Laurel Hardaur-Morano
This is a full transcript of the online presentation. For the
presentation itself, go
here.
Begin Transcript:
I am Laurel Harduar-Morano and today I will be talking about descriptive
statistics.
Descriptive statistics are numerical summaries from a data set that
characterize that data set without testing a particular hypothesis. In
other words we are describing the data.
The reason to use descriptive statistics is to calculate statistics
that will summarize the data, create appropriate graphs to visualize the
data, and to get ideas for which more sophisticated statistical analysis
can be done. Some of this analysis includes hypothesis testing.
One of the ways to summarize the data is to determine the central
tendency or where the middle of the data set is. We want to see what is
typical of the data. For example, we want to know what the average gas
mileage of a car is before we buy it; we want to know if our childs
growth is typical of her age group; knowing the average rainfall of an
area can assist in monitoring of potential floods.
Measurements of central tendency are Mean, Median and Mode.
Mean, which is also known as the average, is the most commonly
calculated measure of central tendency. It is computed by adding all the
individual values in the group and dividing by the number of values in
that group.
The mean is highly sensitive to extreme values. Extreme values are
observations that appear to deviate markedly from the other members of
the sample. If extreme values are present using the mean may not be the
best measure of central tendency.
Here is an example. Lets say that from 1990 to 2000 we had the
following carbon monoxide poisonings; 3 in 1990, 4 in 1991, 5 in 1992,
etc.
Often you will see the mean represented by an x with a line above it,
which is called x-bar.
To calculate the mean we add up the number of poisonings for each year
and then divide by the number of years, 11.
This gives us a mean of 4.18 persons poisoned be CO during the years
1990 to 2000.
Median is the value in a data set, which has been ranked, that divides
the number of observations in the data into two equal parts. When there
is an odd number of observations the median is the middle observation.
When there is an even number of observations the median is the average
of the two central values.
Using the example of CO poisonings, we now rank the data in order from
1 to highest. (notice the year is now irrelevant) The number that
divides the data into two equal parts is 4. On one side we have 5
numbers and on the other side we have five numbers
To look at an example with an even number of observations lets look at
CO poisonings from 1990 to 1995. Again we have ranked the data. The two
middle numbers are 4 and 5 so we calculated the average 4 plus 5
divided by two gives us the median 4.5. We can see that there are 3
numbers on either side of 4.5.
The mode is the most frequently occurring value in a set of
observations.
For example we have a dataset with the following ranked numbers, 1
through 8. Visually we can see that there are more 3s in the data set
than any other number. So the mode = 3.
For our second example we have 5 types of weather; sun, hail, rain hail,
and cloudy. The mode is the most frequently occurring type, which is
hail.
When the dataset is symmetric indicating that there no extreme values,
the calculated value of the mean and median are exactly the same.
However, when the data includes extreme values the mean and the median
are different. The green curve is an example of data skewed to the
right; the extreme value is larger than any of the other points in the
dataset.
Before we determine when to use the measures of central tendency we
need to define the types of data.
There are two main data categories - continuous and categorical.
Continuous data can have an infinite number of values; examples are
benzene levels, age, or number of hospitalizations
Categorical data are collected or summarized into categories.
There are two subset types;
Ordinal data, which has an obvious order to the categories, such as the
four categories of BMI: underweight, normal weight, overweight, and
obese.
Nominal data which has no obvious order to the categories, for instance
race; black, white, Asian, pacific islander, American Indian.
How do we know when to use mean median or mode.
When the data is continuous symmetrical, no extreme values, then the
mean is appropriate
When the data is continuous skewed, extreme values present, the median
is appropriate.
The determination of extreme values can be a judgment call. My
suggestion is to calculate both the median and the mean and if the
results are similar use the mean. However, if the results are
drastically different use the median.
If the data is in ordinal (in order categories) then use the median.
The only time that the mode would be appropriate is if the data was
nominal, categorical data without order.
A distribution is a set of numbers and their frequency of occurrence
collected from measurements over a statistical population.
For example: the distribution of flowering plants in our garden. In
painting, Flower Beds in Holland by Vincent Van Gogh,
statistical population = the number of bed
frequency of occurrence = the number of flowering plants in each bed.
Normal distribution is represented by a family of curves defined by the
mean (x bar) - and standard deviation. The sd is written as s d or by
the Greek letter sigma. In the figure the mean is equal to zero.
For a normal distribution: 68% of all values fall within 1 standard
deviation of the mean, 95% of all values fall within 2 standard
deviations of the mean, and 99.7% of all values fall within 3 standard
deviations of the mean.
For example lets say we have measured the height of a particular river
for the past 10 years during the rainy season. The river rises and falls
occasionally depending on the rainfall. This means that using our data
we can expect 68% of the time that the river will rise or fall 1 inch or
less from the mean value, 95% of the time the river will rise or fall 2
inches or less, 99.7% of the time the river will rise or fall 3 inches
or less. So, if we built a retaining wall that was a little over three
inches above the normal height of the river, we should expect it to a
flood about 0.15% of the time during the rainy season. (0.3% or the time
the river will rise or fall more than 3 inches) If the rainy season is
approximately 90 days per year then we should see approximately 1 flood
in 8 years.
(90days*8years = 720days*0.15% = 1.08)
Normal distributions are always symmetrically bell shaped, but the
extent to which the bell is compressed or flattened out depends on the
standard deviation of the population.
The green curve is the standard normal curve and has an sd of 1 and a
mean = to 0.
The red curve has an sd of 0.2 and is tall and skinny while the blue
curve has an sd of 5 and is short and wide.
The pink curve has moved along the line with a mean of -2. And is no
longer considered a standard normal curve since the mean does not = 0.
This slide presents two different ways to look at a distribution.
To the right is a distribution curve. However, you will typically
visualize the data as graph like the figure on the left.
Both visualizations represent the number of flowering plants in plot 3
from 1990 to 2000. The statistical population is the number of plants in
plot three. It may be that not all the plants in plot three are
flowering. As can be seen in the bar graph, the frequency of occurrence
during 1995 is 7 flowering plants in plot 3; in 2000 the frequency of
occurrence is 3
To visualize the distribution we calculate the mean and the standard
deviation.
The mean is the orange line and the sd is represented by the dotted
lines. We can see that 4 of the 11 years (1992, 1993, 1994, and 1999)
fall within 1 sd
In addition, this distribution is a little skewed to the left. Which is
what happens when the mean is smaller than the median, which is
represented by the green line. If the mean is larger than the median the
distribution is skewed to the right. And if the mean = the median then
the distribution is defined as normal.
As I am sure we all know, in epidemiology we utilize information
dealing with a population, time period and places.
Population: The collection of units that a sample is drawn from; all the
citizens of Hamilton county, or all the males in Florida, institutions
such as all elementary schools, hospital records
Time: The time period under study; 1990 to 1997, June to September,
Place: Area under study; this may be as small as a single farm or as
large as the entire nation
Rate is the measure of frequency of occurrence of a phenomenon in the
population under study in other words how often does something happen.
Examples of rates are: The birth rate in Florida, turnover rate in a
pool, rate of asthma in children, the rate of colon cancer in the male
population, rate of inflation
Rates are very important in comparing information from one population to
another population.
Rate is the number of events in a specified period (i.e. 1995 to 2003)
over the average population during the same time period
The resulting value is often multiplied by a factor of ten (i.e. 100,
1,000 or 10,000) in order to convert the rate to a whole number
It is essential to use rates instead of raw numbers to compare two (or
more) populations
Raw numbers can lie as we shall see in the next slide.
Here is an example using asthma cases among school children in Florida.
Column three contains the number of students with asthma, or raw
numbers. It seems that Brevard with 4,548 cases has more asthma cases
than Baker County with 409. However, this is not a fair assessment
because Brevard County has many more students than Baker. After
calculating the rates, which takes into account the different population
sizes, we can see that cases of asthma are greater in Baker County than
in Brevard County. There are 65.3 cases of asthma per 1,000 in Brevard
and 89 cases per 1,000 in Baker.
To calculate the asthma rate for Alachua county we take the number
students with asthma (2,345) and divide it by the number of students in
Alachua county. The resulting number is multiplied by a factor of ten
in this case 1,000.
Our result is the Alachua County Asthma Rate, which is 78.5 cases per
1,000 children for the school years 1990-2000,
When creating a map with multiple counties, describing a disease,
condition, contamination use rates
DO NOT use RAW NUMBERS (counts)
Do not use raw numbers when creating maps
Do not use raw numbers when comparing different areas (Leon vs. Orange)
Do not use raw numbers when comparing years (1990 vs. 2000)
A measure of the rate at which people without a disease develop the
disease during a specific time period
Incidence rate is the number of new cases of a disease over a period of
time divided by the population at risk in the same time period
Lets say there are 400 chickens in area B who are susceptible to West
Nile virus
During the summer of 2003, 25 chickens were diagnosed with West Nile
virus.
The incidence rate is: 25/400 = 0.0625 x 100 or 6.25 chickens per 100
were diagnosed with West Nile virus during the summer of 2003
Prevalence: a measure of the number of people in a population who have
a particular disease at a given point in time
In other words it is a snapshot in time
Sometimes in epidemiology prevalence is referred to as a prevalence rate
However, this number is proportion - not a rate. Rates include a time
period while a proportion does not.
Here is example of prevalence and when it is used:
Lets say we conducted a survey of all the ER nurses in Hillsborough
County. We asked questions such as salary, stress level, and training
and certification. The collection of data took place during May and June
of 2005
The results of the survey provide us with snapshot of the ER nurses
during those two months in 2005.
The population is all the ER nurses and the place is Hillsborough
County. Our results may show that 4 nurse is 100 earn 40,000 or less per
year. This information can not be generalize to the population at large
only the participants of our survey.
A simple way to look at rate, incidence rate, and prevalence is
The rate measures how often something occurs
The incidence rate measures how many people will develop the disease in
a certain time period
And Prevalence is a snapshot in time
Standardization is important because different areas have a different
population distribution. We want to compare population in different
areas. However, the risk of a disease is greater for some age groups
then others, the elderly are more likely to have been diagnosed with a
osteoporosis. If county A has a larger percentile of retired individuals
than county B - we may falsely assume, because of the age distribution,
that county A has a higher rate of illness.
So, we adjust for age in our calculation
The adjusted rate tells you what the rate would be if the sample
population had a similar age structure to that of the standard
population
To do this requires Age-specific rates for the sample population and the
age-structure of a standard population
Lets say that we would like to know the rate of cancer deaths in
Florida. To calculate the rate as we did earlier we take the total
number of cancer deaths (115) and divide by the population (45,000) and
multiply by 1,000, which gives us 2.56 deaths per 1,000 people.
Generally when we talk about cancer death rates we multiply by a factor
of 1 million, however, for our example using 1,000 is easier.
When we break the cancer deaths down by age we can see that more deaths
occur in the 65 plus population while there are more people in the 20-64
age group. Using the crude rate we may actually be overestimating the
rate of cancer deaths. If we look at the age specific rate (column 4) we
can see that for 0-19 years there 1 death in 1,000; for ages 20-64 there
are 0.4 deaths per 1,000.The age specific rate for the 20-64 age group
is smaller than the 19 or younger age group because the older age group
has a larger population at risk. For those people 65 and older there
6.67 deaths per 1,000. There is clearly a difference in rates by age.
We then multiply the age specific rate by the standard population and
then add the results together.
The standardized cancer deaths divided by the standard population
multiplied by 1,000 provides us with the age-adjusted rate of 1.70
Have to be careful when comparing crude rates. The crude cancer death
rates for Florida may be higher than the study population. However, the
study population may have an older population and therefore in actuality
higher cancer death rates
If crude rate decreases after adjustment, the study population is older
than the standard population
If crude rate increases after adjustment, the study population is
younger than the standard population
In our previous example the study population is older than the 2000
Florida population. The crude rate of 2.56 deaths per 1,000 is larger
than the adjusted rate of 1.7 deaths per 1,000.
To compare rates among subpopulations when confounding is not an issue
we would use crude rates
In our earlier example age was a confounder. However, if we conducted a
survey and selected individuals in such a way that the age distribution
was the same as the standard population we could use crude rates.
To compare the health of entire populations or diseases (i.e. cancer,
birth defects) use adjusted rates.
They allow for comparison of populations with different demographic
structures (i.e. race, age, poverty level)
A bar graph is any plot of a set of data such that the number of data
elements falling within one or more categories is indicated using a
rectangle whose height or width is a function of the number of elements
Used with categorical data: data that are in categories i.e. race,
gender, counties
In these two graphs the rectangles are a function of height. If they
were turned 90 degrees and the rectangles went from right to left they
would a function of width.
Now notice the difference between the two bar graphs. The first utilizes
only crude (raw) numbers and doesnt take into account the differing
populations among counties. The second graph allows for comparison
between counties since all rates are per 1,000 students.
When looking at the raw numbers Miami-Dade appears to have a higher
number of asthma and allergies then Baker or Bay. But when we look at
the same data per 1,000 students we can see that Miami-Dade actually has
a smaller number of asthma and allergy cases.
It is important to remember that we do not use raw number or counts for
comparisons we only use rates.
A graphical representation of a set of observations in which class
frequencies (continuous data) are represented by the areas of rectangles
(bins) centered on the class interval.
Used with continuous data: Data that has an infinite number of values
i.e. age, Benzene levels, blood pressure
In this example: 140 urban children were tested for lead. The frequency
of a particular range is recorded. For instance we can see that 2
children had blood lead levels between 0 and 0.39 micrograms/dL while 7
children had blood lead levels between 0.4 and 0.79 micrograms per/dL.
This information can be presented in a histogram. Our histogram has a
class interval of 0.4 and 11 bins. Bin 1 provides the number of children
with lead concentrations of 0 to 0.39 micrograms/dL
Bin 10 provides the number of children with lead concentrations of 3.6
to 3.9, which is seven children
In the histogram we can use raw numbers or counts because we are not
comparing the information to anything we just want to visualize it. And
we can see that the most common level is 2 to 2.4
Any questions?
Thank you very much for listening to this presentation. If you have any
question please feel free to email me.
|