The Florida Department of Health has a new logo. Read more...
Department of Health Home A to Z Topics About the Department of Health Site Map Contact Us
  • 1 out of 3 kids are now considered overweight or obese. Find out how to become your Healthiest Weight
  • 65% of adults are overweight or obese. Find out how to become your Healthiest Weight
Florida Division of Environmental Health
Programs
Div EH Logo
Laurel Hardaur-Morano

This is a full transcript of the online presentation. For the presentation itself, go here.

Begin Transcript:


I am Laurel Harduar-Morano and today I will be talking about descriptive statistics.

Descriptive statistics are numerical summaries from a data set that characterize that data set without testing a particular hypothesis. In other words we are describing the data.  

The reason to use descriptive statistics is to calculate statistics that will summarize the data, create appropriate graphs to visualize the data, and to get ideas for which more sophisticated statistical analysis can be done. Some of this analysis includes hypothesis testing.

One of the ways to summarize the data is to determine the central tendency or where the middle of the data set is. We want to see what is typical of the data. For example, we want to know what the average gas mileage of a car is before we buy it; we want to know if our child’s growth is typical of her age group; knowing the average rainfall of an area can assist in monitoring of potential floods.

Measurements of central tendency are Mean, Median and Mode.

Mean, which is also known as the average, is the most commonly calculated measure of central tendency. It is computed by adding all the individual values in the group and dividing by the number of values in that group.

The mean is highly sensitive to extreme values. Extreme values are observations that appear to deviate markedly from the other members of the sample. If extreme values are present using the mean may not be the best measure of central tendency.

Here is an example. Let’s say that from 1990 to 2000 we had the following carbon monoxide poisonings; 3 in 1990, 4 in 1991, 5 in 1992, etc.


Often you will see the mean represented by an x with a line above it, which is called x-bar.

To calculate the mean we add up the number of poisonings for each year and then divide by the number of years, 11.

This gives us a mean of 4.18 persons poisoned be CO during the years 1990 to 2000.

 Median is the value in a data set, which has been ranked, that divides the number of observations in the data into two equal parts. When there is an odd number of observations the median is the middle observation. When there is an even number of observations the median is the average of the two central values.

 Using the example of CO poisonings, we now rank the data in order from 1 to highest. (notice the year is now irrelevant) The number that divides the data into two equal parts is 4. On one side we have 5 numbers and on the other side we have five numbers

To look at an example with an even number of observations let’s look at CO poisonings from 1990 to 1995. Again we have ranked the data. The two middle numbers are 4 and 5 so we calculated the average – 4 plus 5 divided by two gives us the median 4.5. We can see that there are 3 numbers on either side of 4.5.  

The mode is the most frequently occurring value in a set of observations.

For example we have a dataset with the following ranked numbers, 1 through 8. Visually we can see that there are more 3s in the data set than any other number. So the mode = 3.

For our second example we have 5 types of weather; sun, hail, rain hail, and cloudy. The mode is the most frequently occurring type, which is hail.

 When the dataset is symmetric indicating that there no extreme values, the calculated value of the mean and median are exactly the same.

However, when the data includes extreme values the mean and the median are different. The green curve is an example of data skewed to the right; the extreme value is larger than any of the other points in the dataset.

 Before we determine when to use the measures of central tendency we need to define the types of data.

There are two main data categories - continuous and categorical.

Continuous data can have an infinite number of values; examples are benzene levels, age, or number of hospitalizations

Categorical data are collected or summarized into categories.

There are two subset types;

Ordinal data, which has an obvious order to the categories, such as the four categories of BMI: underweight, normal weight, overweight, and obese.

Nominal data which has no obvious order to the categories, for instance race; black, white, Asian, pacific islander, American Indian.

 How do we know when to use mean median or mode.

When the data is continuous symmetrical, no extreme values, then the mean is appropriate

When the data is continuous skewed, extreme values present, the median is appropriate.

The determination of extreme values can be a judgment call. My suggestion is to calculate both the median and the mean and if the results are similar use the mean. However, if the results are drastically different use the median.

If the data is in ordinal (in order categories) then use the median.

The only time that the mode would be appropriate is if the data was nominal, categorical data without order.

 A distribution is a set of numbers and their frequency of occurrence collected from measurements over a statistical population.

For example: the distribution of flowering plants in our garden. In painting, Flower Beds in Holland by Vincent Van Gogh,

statistical population = the number of bed

frequency of occurrence = the number of flowering plants in each bed.

 Normal distribution is represented by a family of curves defined by the mean (x bar) - and standard deviation. The sd is written as s d or by the Greek letter sigma. In the figure the mean is equal to zero.

For a normal distribution: 68% of all values fall within 1 standard deviation of the mean, 95% of all values fall within 2 standard deviations of the mean, and 99.7% of all values fall within 3 standard deviations of the mean.

For example let’s say we have measured the height of a particular river for the past 10 years during the rainy season. The river rises and falls occasionally depending on the rainfall. This means that using our data we can expect 68% of the time that the river will rise or fall 1 inch or less from the mean value, 95% of the time the river will rise or fall 2 inches or less, 99.7% of the time the river will rise or fall 3 inches or less. So, if we built a retaining wall that was a little over three inches above the normal height of the river, we should expect it to a flood about 0.15% of the time during the rainy season. (0.3% or the time the river will rise or fall more than 3 inches) If the rainy season is approximately 90 days per year then we should see approximately 1 flood in 8 years.

(90days*8years = 720days*0.15% = 1.08)

 Normal distributions are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.

The green curve is the standard normal curve and has an sd of 1 and a mean = to 0.

The red curve has an sd of 0.2 and is tall and skinny while the blue curve has an sd of 5 and is short and wide.

The pink curve has moved along the line with a mean of -2. And is no longer considered a standard normal curve since the mean does not = 0.

 This slide presents two different ways to look at a distribution.

To the right is a distribution curve. However, you will typically visualize the data as graph like the figure on the left.  

Both visualizations represent the number of flowering plants in plot 3 from 1990 to 2000. The statistical population is the number of plants in plot three. It may be that not all the plants in plot three are flowering. As can be seen in the bar graph, the frequency of occurrence during 1995 is 7 flowering plants in plot 3; in 2000 the frequency of occurrence is 3

To visualize the distribution we calculate the mean and the standard deviation.

The mean is the orange line and the sd is represented by the dotted lines. We can see that 4 of the 11 years (1992, 1993, 1994, and 1999) fall within 1 sd

In addition, this distribution is a little skewed to the left. Which is what happens when the mean is smaller than the median, which is represented by the green line. If the mean is larger than the median the distribution is skewed to the right. And if the mean = the median then the distribution is defined as normal.

 As I am sure we all know, in epidemiology we utilize information dealing with a population, time period and places.

Population: The collection of units that a sample is drawn from; all the citizens of Hamilton county, or all the males in Florida, institutions such as all elementary schools, hospital records…

Time: The time period under study; 1990 to 1997, June to September,

Place: Area under study; this may be as small as a single farm or as large as the entire nation

 Rate is the measure of frequency of occurrence of a phenomenon in the population under study – in other words how often does something happen.

Examples of rates are: The birth rate in Florida, turnover rate in a pool, rate of asthma in children, the rate of colon cancer in the male population, rate of inflation

Rates are very important in comparing information from one population to another population.

Rate is the number of events in a specified period (i.e. 1995 to 2003) over the average population during the same time period

The resulting value is often multiplied by a factor of ten (i.e. 100, 1,000 or 10,000) in order to convert the rate to a whole number

It is essential to use rates instead of raw numbers to compare two (or more) populations

Raw numbers can lie – as we shall see in the next slide. 

Here is an example using asthma cases among school children in Florida.

Column three contains the number of students with asthma, or raw numbers. It seems that Brevard with 4,548 cases has more asthma cases than Baker County with 409. However, this is not a fair assessment because Brevard County has many more students than Baker. After calculating the rates, which takes into account the different population sizes, we can see that cases of asthma are greater in Baker County than in Brevard County. There are 65.3 cases of asthma per 1,000 in Brevard and 89 cases per 1,000 in Baker.  

To calculate the asthma rate for Alachua county we take the number students with asthma (2,345) and divide it by the number of students in Alachua county. The resulting number is multiplied by a factor of ten – in this case 1,000.

Our result is the Alachua County Asthma Rate, which is 78.5 cases per 1,000 children for the school years 1990-2000,

When creating a map with multiple counties, describing a disease, condition, contamination use rates

 DO NOT use RAW NUMBERS (counts)

Do not use raw numbers when creating maps

Do not use raw numbers when comparing different areas (Leon vs. Orange)

Do not use raw numbers when comparing years (1990 vs. 2000)

 A measure of the rate at which people without a disease develop the disease during a specific time period

Incidence rate is the number of new cases of a disease over a period of time divided by the population at risk in the same time period

Let’s say there are 400 chickens in area B who are susceptible to West Nile virus

During the summer of 2003, 25 chickens were diagnosed with West Nile virus.

The incidence rate is: 25/400 = 0.0625 x 100 or 6.25 chickens per 100 were diagnosed with West Nile virus during the summer of 2003

 Prevalence: a measure of the number of people in a population who have a particular disease at a given point in time

In other words it is a snapshot in time

Sometimes in epidemiology prevalence is referred to as a prevalence rate

However, this number is proportion - not a rate. Rates include a time period while a proportion does not.

 Here is example of prevalence and when it is used:

Let’s say we conducted a survey of all the ER nurses in Hillsborough County. We asked questions such as salary, stress level, and training and certification. The collection of data took place during May and June of 2005

The results of the survey provide us with snapshot of the ER nurses during those two months in 2005.

The population is all the ER nurses and the place is Hillsborough County. Our results may show that 4 nurse is 100 earn 40,000 or less per year. This information can not be generalize to the population at large only the participants of our survey.

 A simple way to look at rate, incidence rate, and prevalence is

The rate measures how often something occurs

The incidence rate measures how many people will develop the disease in a certain time period

And Prevalence is a snapshot in time

 Standardization is important because different areas have a different population distribution. We want to compare population in different areas. However, the risk of a disease is greater for some age groups then others, the elderly are more likely to have been diagnosed with a osteoporosis. If county A has a larger percentile of retired individuals than county B - we may falsely assume, because of the age distribution, that county A has a higher rate of illness.

So, we adjust for age in our calculation

The adjusted rate tells you what the rate would be if the sample population had a similar age structure to that of the standard population

To do this requires Age-specific rates for the sample population and the age-structure of a standard population

 Let’s say that we would like to know the rate of cancer deaths in Florida. To calculate the rate as we did earlier we take the total number of cancer deaths (115) and divide by the population (45,000) and multiply by 1,000, which gives us 2.56 deaths per 1,000 people.

Generally when we talk about cancer death rates we multiply by a factor of 1 million, however, for our example using 1,000 is easier.

When we break the cancer deaths down by age we can see that more deaths occur in the 65 plus population while there are more people in the 20-64 age group. Using the crude rate we may actually be overestimating the rate of cancer deaths. If we look at the age specific rate (column 4) we can see that for 0-19 years there 1 death in 1,000; for ages 20-64 there are 0.4 deaths per 1,000.The age specific rate for the 20-64 age group is smaller than the 19 or younger age group because the older age group has a larger population at risk. For those people 65 and older there 6.67 deaths per 1,000. There is clearly a difference in rates by age.

We then multiply the age specific rate by the standard population and then add the results together.

The standardized cancer deaths divided by the standard population multiplied by 1,000 provides us with the age-adjusted rate of 1.70

 Have to be careful when comparing crude rates. The crude cancer death rates for Florida may be higher than the study population. However, the study population may have an older population and therefore in actuality higher cancer death rates

If crude rate decreases after adjustment, the study population is older than the standard population

 If crude rate increases after adjustment, the study population is younger than the standard population

In our previous example the study population is older than the 2000 Florida population. The crude rate of 2.56 deaths per 1,000 is larger than the adjusted rate of 1.7 deaths per 1,000.

 To compare rates among subpopulations when confounding is not an issue we would use crude rates

In our earlier example age was a confounder. However, if we conducted a survey and selected individuals in such a way that the age distribution was the same as the standard population we could use crude rates.

To compare the health of entire populations or diseases (i.e. cancer, birth defects) use adjusted rates.

They allow for comparison of populations with different demographic structures (i.e. race, age, poverty level)

 A bar graph is any plot of a set of data such that the number of data elements falling within one or more categories is indicated using a rectangle whose height or width is a function of the number of elements

Used with categorical data: data that are in categories i.e. race, gender, counties 

In these two graphs the rectangles are a function of height. If they were turned 90 degrees and the rectangles went from right to left they would a function of width.

Now notice the difference between the two bar graphs. The first utilizes only crude (raw) numbers and doesn’t take into account the differing populations among counties. The second graph allows for comparison between counties since all rates are per 1,000 students.

When looking at the raw numbers Miami-Dade appears to have a higher number of asthma and allergies then Baker or Bay. But when we look at the same data per 1,000 students we can see that Miami-Dade actually has a smaller number of asthma and allergy cases.

It is important to remember that we do not use raw number or counts for comparisons we only use rates. 

A graphical representation of a set of observations in which class frequencies (continuous data) are represented by the areas of rectangles (bins) centered on the class interval.

Used with continuous data: Data that has an infinite number of values i.e. age, Benzene levels, blood pressure

 In this example: 140 urban children were tested for lead. The frequency of a particular range is recorded. For instance we can see that 2 children had blood lead levels between 0 and 0.39 micrograms/dL while 7 children had blood lead levels between 0.4 and 0.79 micrograms per/dL.

This information can be presented in a histogram. Our histogram has a class interval of 0.4 and 11 bins. Bin 1 provides the number of children with lead concentrations of 0 to 0.39 micrograms/dL

Bin 10 provides the number of children with lead concentrations of 3.6 to 3.9, which is seven children

In the histogram we can use raw numbers or counts because we are not comparing the information to anything we just want to visualize it. And we can see that the most common level is 2 to 2.4

 Any questions?

Thank you very much for listening to this presentation. If you have any question please feel free to email me. 


Back To Top
This page was last modified on: 05/22/2007 01:32:40