2.1 Sampling Distribution of Means

Biostatistics for the Clinician

Biostatistics for the Clinician

University of Texas-Houston
Health Science Center

Lesson 2.1

Sampling Distribution of Means

Lesson 2: Inferential Statistics 2.1 - 1

Biostatistics for the Clinician

2.1 Sampling Distribution of Means

2.1.1 Why Important
In Lesson 1 you learned that there are two cases where you don't need to worry about statistics. When you have the whole population, and when you have large samples. Unfortunately, most medical experiments fall outside those. You typically have a small sample, and you typically want to generalize to a larger group. This is the job of inferential statistics.
Actually, in the context above, the word "statistics" was being used in the more restrictive sense of inferential statistics or hypothesis testing statistics. Inferential statististical methods are the kind you use when you're trying to make inferences or generalize to the entire population based upon research that includes a small portion of that population.
Of course, you can always use the summary statistics or descriptive statistics of Lesson 1 to summarize or describe data where you're not trying to generalize beyond the group you've collected the data from. If you have quantitative data from any group, you can always find the mean, the median, and the mode. You can find the mean age or the standard deviation or any other of those descriptive measures. These are descriptive statististics. Descriptive statistics can be used to summarize group performance no matter what the size of the group and whether you have the whole group. a part of it, or anything else.

Sampling Distribution of Means
Practice
Exercise 1: You need inferential statistics to:
No Response
Describe samples
Generalize from samples to populations
Compute statistics
Describe populations

Lesson 2: Inferential Statistics 2.1 - 2

Biostatistics for the Clinician

2.1.2 Sampling Distribution of Means
Let's find out about sampling distributions and hypothesis testing. Consider again now the Gaussian distribution with z-scores on the horizontal axis, also called the standard normal distribution. Whenever you are looking at graphs, for proper interpretation it is always important to make sure you understand what the axes in the graphs represent. Always look at the axes and figure out what the axes mean before you try to interpret the graph. Now look closely at the graph below (See Figure 2.5).

Figure 2.5: Gaussian Distribution

Note that on the horizontal axis the segements are marked off in standard deviations -- 0, +1, +2, -1 and so forth. You learned last time that any Gaussian distribution, no matter what its mean and standard deviation, can be represented in a plot that looks like this using z-scores. In other words because every original score has a corresponding z-score you can represent any normal distribution in terms of its z-scores. And, you can plot it that way.
The vertical axis in Figure 2.5 represents a percentage or proportion. Because proportions are on the vertial axis, it turns out that the area under this kind of curve is always 1.
What the graph then shows is proportions of the sample that fall within various limits. The whole curve encompasses the entire population. Fifty percent is above and 50% below the mean. And, because you know from Gaussian tables the areas under the normal curve, any value on the horizontal axis has probabilities associated with it. For example, you already know that roughly 68% of any Gaussian distribution lies between plus and minus 1 standard deviations from the mean.

Sampling Distribution of Means
Practice
Exercise 2: Regions under the standard normal distribution curve between points on the z-score axis represent:
No Response
Areas
Probabilities
Proportions
All of the above

Lesson 2: Inferential Statistics 2.1 - 3

Biostatistics for the Clinician
Now look at the areas of the curve associated with +2 and -2 in Figure 2.5. Tables tells us that 2.1% of the values lie below -2, and 2.1% of the values lie above +2. Now the most frequent probability or p value that you hear about in research studies is .05. That .05 typically represents the probability or area that is in the two tails of the distribution. In other words, it's like the two little pieces in the figure labeled with the 2.1%, but for .05 each piece would be labeled 2.5%. If you add those little pieces you have a five percent probability.
The combined probabilities of the 2.1% in each tail of Figure 2.5. above adds up to 4.2%. So, what that says is that the chance of being more than 2 standard deviations away from the mean in a normal distribution is .042. You can see that to have a 5% chance you wouldn't need to be quite as far out as 2 standard deviations. In fact, the number, the z value, is 1.96. To put it another way, the chance of being more than 1.96 standard deviations away from the mean in a normal distribution is .05.
So, the value of 1.96, cutting off 5% in the tails, defines cutoff points for outliers. Any time you randomly collect data that is distributed normally, you can expect some of the values to be way out in the tails here according to the probabilities just described. And, you could put cutoff points for further outliers further out in the tails. Many times studies use a cutoff point of .01 for extreme values. In that case you're down to 1/2% in the left tail and 1/2% in the right tail. So you're way out in the tails of the distribution.

Sampling Distribution of Means
Practice
Exercise 3: Z values like plus or minus 1.96 in the tails of a distribution can be used to define cutoff points for rare or unlikely values.
No Response
True
False

Lesson 2: Inferential Statistics 2.1 - 4

Biostatistics for the Clinician
Let's suppose you wanted to determine the average diastolic blood pressure for a population of students. Let's suppose that to do this you take successive groups of five, finding the mean for each successive group. In other words you find the average for the first group of five, for the second group of five, for the third group of five, and so on. Let's suppose you did this a very large number of times and then plotted all the means from the samples of five in a graph to see how these means were distributed. So you wouldn't be looking at the distribution of individual diastolic blood pressures, but at the distribution of the averages of these groups. It turns out that most of the curves you see in statistics that are really important are these kinds of distributions, rather than distributions of the raw underlying data. These kinds of distributions are so important they have a special name. They are called sampling distributions of the mean. A sampling distribution of the mean is just a distribution of sample means.

Sampling Distribution of Means
Practice
Exercise 4: Taking repeated samples of a given size, finding each samples mean, and then plotting the distribution of all the sample means produces a:
No Response
Poisson distribution
Binomial distribution
Sampling distribution
Z-score distribution

Lesson 2: Inferential Statistics 2.1 - 5

Biostatistics for the Clinician

2.1.3 Properties of Sampling Distribution of Means
An interesting thing happens when you take averages and plot them this way. The size of the sampling groups (5 in the current case) affects the width of the resulting distribution of sample means. Think about it for a moment. As the size of the samples increases how do you think that affects the width of the resulting distributions of means? What if you formed groups of 10, repeatedly computing the mean of each group of 10. How would the width of the distribution of means of samples of 10 differ from that for a distribution of means of samples of 5? Look at Figure 3.1 below for a hint.

Figure 3.1: Sampling Distributions

You might think intuitively that the larger the sample size gets, the more the means of the samples will tend to be similar. In other words, little samples might have means that would be spread out all over the place. While big samples, because they have so many values, will have means that are pretty much the same. In fact, this is exactly what happens. You're kind of averaging out the differences within the group as the samples get larger.
Consequently, when you plot the graphs you get curves that are tighter as the size of the samples grow larger. So looking at figure 3.1 you see that the underlying population has the widest curve. And, curves become tighter, that is have less variability, as the sample size increases. With groups of four you get a narrowed distribution. With groups of 16 the distribution becomes narrower. With groups of 64 narrower yet, and so on.
So what happens then is the fatness of the original underlying curve is reduced as a function of the sample size. You could phrase it as follows. As the sample size increases the standard deviation of the distribution of sample means decreases. It is also true that the sampling distributions for successively larger samples approximate more and more closely a Gaussian distribution.

Sampling Distribution of Means
Practice
Exercise 5: Larger samples result in fatter sampling distributions.
No Response
True
False

Lesson 2: Inferential Statistics 2.1 - 6

Biostatistics for the Clinician

2.1.4 Standard Error
A natural question to ask then is, "What are the standard deviations of these distributions of sample means?" The answer tells us how far the sample means tend to be from the real population mean on the average, and so gives us a measure of the average error or standard error in the sample means. So, the standard deviation of the sampling distribution is called the standard error. The standard error tells us the fatness of the sampling distribution curves. So, you now know that as the sample size increases the standard error decreases. In other words the sample means become better estimates of the population mean (more accurate) with larger samples. Because you're summing when you average, these distributions are nearly normal, so you also know that the standard error plus or minus one contains about 68% of the sample means.
Now, it turns out that you don't in fact, actually have to carry out this repeated sampling and averaging to figure out what happens and find the standard error. You don't need to take 10 people here and 10 people there and 10 more people and so on and average them all. You only need to take 10 people! To find an estimate of the standard error all you need to do is divide the standard deviation of the sample by the square root of the sample size (see formula below).

Standard Error of the Mean

You can use the standard error to determine what the chances are that a particular group average is from an oddball group. More generally, you can use the standard error to determine what the chances are that a particular group lies within any given range of values. It is apparent that 68% of the samples' means lie within 1 standard error of the real population mean. You can then see how the standard error gives you a measure of the accuracy of any sample mean as an estimate of the real population mean.
So, an estimate of the standard error can be computed, for a given sample size, from one sample. You don't need to take repeated samples. Consequently, you know approximately what results would occur from taking one sample. The formula also reveals that accuracy of the approximation increases with larger samples.
In most medical research the focus is not on individuals, but on groups of individuals. You have a sample of 10 or 20 or 50 or 60. Any time you have a sample of more than one you can then compute this new statistic called the standard error. Remember it's just the standard deviation of the sample divided by the square root of the sample size. So that rather than looking at individuals now, you're looking at groups of individuals. That's probably the most difficult conceptual leap in statistics. Where you start moving from individual kinds of measures to group measures and the standard error. But keep these things in mind because they form the foundation for the next topics, particularly the very important t-statistic.

Sampling Distribution of Means
Practice
Exercise 6: The standard deviation of the sampling distribution of the mean is:
No Response
1
The standard error
The standard error times the sample size
Approximately 68%

Final Instructions

Press Button below for your score.

After completing Lesson 2.1, including all practice exercises, press the "Submit... " button below for Lesson 2.1 research participation credit.

After you press "Submit..." it is possible Netscape may tell you it is unable to connect because of unusually high system demands. If you receive no error message upon submission you're OK. But, if Netscape gives you an error message after you press the "Submit..." button, wait a moment and resubmit or consult the attendant.

Finally, press the "Table of Contents..." button below to correctly end Lesson 2.1 and return to the Lesson 2 Table of Contents so you may continue with Lesson 2.2.

End Lesson 2.1
Sampling Distribution of Means

Lesson 1: Summary Measures of Data 2.1 - 9