# Biostatistics for the Clinician ## 2.2 Hypothesis Testing

### 2.2.1 Formulation of Hypotheses

Inferential statistics is all about hypothesis testing. The research hypothesis will typically be that there is a relationship between the independent and dependent variable, or that treatment has an effect which generalizes to the population. On the other hand, the null hypothesis, upon which the probabilities are based, will typically be that there is no relationship between the independent and dependent variable, or that the treatment has no real effect. In other words, apparent differences in the samples are flukes. To examine these hypotheses many different kinds of statistical tests are used. These tests are chosen based on the kinds of variables being examined and the kinds of samples.
Hypothesis Testing
Practice
Exercise 1:
The null hypothesis states that there is

No Response
A relationship between independent and dependent variables
An effect
No relationship between independent and dependent variables
No effect
Both the 2nd & 3rd buttons are correct
Both the 4th & 5th buttons are correct

### 2.2.2 Performing Tests

Recall now the standard error of the mean. The standard error is the average error that would be expected in using a sample mean as an estimate of the real population mean. It turns out to also be the basis for many of the most frequently performed statistical tests. In particular, it's the basis of one of the more well known test statistics called the t-statistic.

#### Parametric Tests: t-Tests

Here's how it works. Suppose you have two groups, a group with treatment A, and a group with treatment B. So, you have a difference between two sample means. You can then compute the standard error of that difference. You just take a ratio of difference in the sample means to its standard error and it gives you the t-statistic. It's very much like a z-statistic. If the t-statistic is too big, it's out in the tails of the distribution, and you conclude it's probably not just by chance. It's probably an effect of the treatment.

Once you have the t-statistic you can go to tables that tell you whether or not these two groups, A and B, are significantly different. So the t-test is performed by just dividing the difference between the two group means by the standard error of the difference.

It makes a difference whether the groups are independent or related. So, you'll see the terms independent or paired samples. Independent means treatment group A and treatment group B have no relationship to each other. They're just two groups. A lot of times though you'll do an experiment where you measure a group, treat them, and then look at that same group again afterwards. That's called a paired sample. So you're looking not at two different groups, but the same group observed at two different times.

You may even have so-called crossover designs. In a crossover design, you give them drug A for a while, followed by a washout period, and then you give drug B afterwards to the same group. So that's a paired sample too.

The main thing here is the concept. The t-statistic is a standardized measure of how far apart the treatment means are, in the same sense that the z-statistic is a standardized measure of how far a given value is from the group mean. To put it another way, the t-statistic measures how far apart the mean values and their distributions are. Using the t-statistic you can tell whether sample means come from different distributions or from the same distribution.

The skinnier the distribution, that is, the smaller the standard error, and thus the larger the n-value; or the farther apart the mean values are; the better the chances are that you'll be able to conclude that treatment A and treatment B really have different results. Independent versus paired samples typically refers to whether it's the same people or different people that you're measuring.

Hypothesis Testing
Practice
Exercise 2:
Just as you can compute a z-statistic for a value, from which you find it's probability, you can compute a t-statistic for a mean of a sample, from which you find it's probability.

No Response
True
False

#### Nonparametric Tests

Remember in the Gaussian distribution there were parameters? You had the mean and the standard deviation. There are other kinds of tests. There is a whole class of tests that doesn't depend upon knowing the distribution of the results. These are the nonparametric tests, also sometimes referred to as distribution free tests. They don't depend on your knowing the mean value or knowing the standard deviation or anything like that, so you can apply them more generally. But, they're not necessarily as likely to detect significant effects. These tests do not require quantitative dependent variables and do not require Gaussian distributions (see Nonparametric Tests of Significance Table below).

Nonparametric Tests of Significance
Nominal Ordinal
Chi Square Mann-Whitney U
Kolmogorov-Smirnov
Small Numbers
Fisher Exact
Paired
McNemar
Paired
Sign Test
Wilcoxon

It is valuable to have some familiarity with the fact these tests exist and know something about when they are used. Don't necessarily be real concerned here about what the words mean, words like Mann-Whitney U-test and Kolmogorov-Smirnov, but be aware these are nonparametric tests. You can use these tests with the kinds of non-quantitative variables you see in the table above.

The table says for nominal values, the appropriate test is one of the well known Chi-square tests. Think back now. If you have means, standard deviations and standard errors, and you have interval or ratio variables that are approximately Gaussian, you apply a parametric test like a t-test. If you have nominal data you apply a chi-square test. If it's a small sample you'd use the Fisher Exact Test for nominal data. If you have paired data, that is the same person is measured twice, you'd use a McNemar test. If it's ordinal data, then you'd use the Mann-Whitney U or Komogorov-Smirnov. If it's paired ordinal data use a Sign Test or Wilcoxan.

The point is -- if you know what kind of data you have (nominal, ordinal, interval or ratio) and whether it's paired or not paired, and whether you want a parametric or nonparametric test, you can go to a textbook and find out which test to use. In fact, there are little computer programs that ask you questions like this. "Is your data nominal?" "Is your data ordinal?" And so on. "Is it paired data or not paired?" The programs then tell you what kind of test and how you do the test.

Hypothesis Testing
Practice
Exercise 3:
The primary difference between parametric and nonparametric tests is that nonparametric tests:

No Response
Are more likely to be significant
Can be applied more generally
Require more parameters
Assume Gaussian distributions

### 2.2.3 Types of Error

The following is critically important. It is big time stuff. It is something you really have to understand if you're going to interpret any kind of statistical analysis. So look at the following table.

Types of Errors
Reality
Study's
Conclusion
True EffectNo Effect
True
Effect
(reject
null
hyp.)

No
Effect
(don't
reject
null
hyp.)

## Truth

In the past you may have looked at a two by two table, and across the top you may have had, "Disease Present", "Disease Not Present". You may have had, "Test Positive", "Test Negative". You've seen other kinds two by two tables where you calculated things like sensitivity and specificity and so forth. In this case you're looking at four possible types of events that may occur with the outcome of an experiment. Across the top of the table you have the reality. The reality is that there either is an effect or there is no effect. Down the side of the table you have the experimenter's conclusion. The study may conclude there is an effect or that there is no effect. So you have four possible outcomes. Two of these events are errors. Two of them are correct conclusions.

Let's look first at the upper left hand corner of the table. Let's suppose you sampled ten people and from the study you concluded that the drug had an effect. It appeared to reduce their diastolic blood pressure. If in fact your conclusion is not the result of a fluke, a product of sampling error or measurement error or something like that, then your conclusion is correct and you have the truth. You're in the upper left corner of the table. On the other hand, if you conclude there is no effect, that is you tried drug A to reduce diastolic blood pressure and you conclude it didn't work, and that is in fact the case you've made another correct decision. You have the truth and you're in the lower right hand corner of the table. So if the test showed no effect and there really is no effect, then you're in the lower right corner.

Let's rephrase it. If there is an effect and you conclude there is an effect, you've made the correct decision in the upper left corner. If there is no effect and you conclude there is no effect, you've made the correct decision in the lower right corner. In both these cases you've made the correct decision.

If there is an effect, but you conclude there is no effect, then you've made the error in the lower left corner, a Type II Error or beta error. Beta is the probability of a Type II Error.

Sometimes people have a hard time remembering whether alpha and beta go with Type I or Type II. Just remember both alpha and "I" are first. They go together. Both beta and "II" are second. They go together. You'll never get them mixed up.

Let's be more concrete now. The first kind of error, Type I or alpha errors are like false alarms. You think you're on to something; there's an effect. But, in reality, there isn't. Type I Error occurs when you do an experiment and an apparent effect isn't a real one. Maybe groups were a little strange so that they responded better or worse to the drug than the average person might. So, you get an experiment in which you seem to have an effect, when in reality there is no effect. That's called a Type I or alpha error. It's a false positive.

Now you can also have an experiment where you find no effect. In fact a lot of experiments are like this, where you have no effect that you can distinguish but, in fact, the treatment in reality was producing or could produce in the population a good result. In this case, you've missed a good treatment, because our little experiment didn't show an effect. That sort of error is beta error or Type II Error. It's a false negative. So Type I Error is a false positive and Type II error is a false negative.

Hypothesis Testing
Practice
Exercise 4:
Which is a Type I Error?

No Response
Falsely concluding there is an effect
Correctly concluding there is an effect
Falsely concluding there is no effect
Correctly concluding there is no effect

Hypothesis Testing
Practice
Exercise 5:
Which is a Type II Error?

No Response
Falsely concluding there is an effect
Correctly concluding there is an effect
Falsely concluding there is no effect
Correctly concluding there is no effect

##### Biostatistics for the Clinician

The next question then when doing research is, "How can you minimize the chances of an error?" You'd like to make the correct decision every time. But, because of practical constraints, you can't do that. You'd have to have larger and larger samples to reduce the error entirely. So, it would be very expensive to have certainty in every experiment. So you use sample sizes of 10 or 50 or 100 rather than 100,000 or a million. You're forced by costs to manage these errors. So, there's a set of conventions about these things. That's where the p value that you've heard about comes in.

You set the probability of a Type I Error you are willing to tolerate in an experiment in advance and it's called the alpha level. Alpha is the probability of a Type I Error. The convention, typically, is to set the alpha level to be no more than 5%. And what does that mean? It means that the chances are less than 5% that you have an oddball sample, false positive result. You could still have an oddball sample. Your chances would be 5%. It would happen one time out of twenty. What you're saying is that you will not regard any result as statistically significant unless it is one that would occur less than one time out of 20 simply by chance. In other words, you accept it only if it's at least 95% certain that it wasn't just a fluke.

Hypothesis Testing
Practice
Exercise 6:
The probability of a false positive is:

No Response
Alpha
Beta
.95
.05

### 2.2.4 Power, Sample Size, Effect Size and Clinical Significance

Let's introduce one more term now. It's called power. Power is 1 - beta. The reason you use the word power is that it measures the power to distinguish an effect when it is there. It measures how powerful the experiment is. It gets at how small an effect the experimental microscope will distinguish. You want beta error typically to be less than 20%. So, you want a power of about 80%. Then you have an 80% chance of finding an effect if it's there. So, if the beta error is 20% the power is 80%. So, if the effect is there you have an 80% chance of detecting it. The reason you don't go for more is that you reach a point of diminishing returns at .80. You have to get bigger and bigger samples and it costs too much money. So, typically you like to see an alpha of .05, and a power of .80 or a beta of .2.

Hypothesis Testing
Practice
Exercise 7:
You should strive to achieve a power of:

No Response
.05
.20
.80
.95

#### Factors Affecting the Choice of Alpha and Power

Now, what if you're researching a drug, and you want to know whether the drug is effective and the drug has nasty side effects. Would that affect the alpha error you choose? The alpha error tells you what your chances are of concluding the drug is effective when it's really not. It tells you what your chances of a false positive are. If the drug has really nasty side effects would you want to increase or decrease your alpha? Would you want to set it higher than 5% or lower than 5%?

You'd want to lower your alpha to under 5%. You want to reduce the chance of your falsely concluding this is a good drug when it's not, because it has nasty side effects. So, the 5% is the convention, but if you have good reasons for increasing or decreasing it by all means do so.

On the other hand, if in starting a new program of research, the drug has no harmful side effects, and you want to reduce the chances of missing an important effect especially since at this point your procedures may be relatively unrefined, then you may want to increase your alpha level to say .10.

That is, your experiment is designed in a way that you have a 10% chance of a false positive. It doesn't matter if you misapply this drug. It's not going to hurt anybody. So, on the one hand, you may want to have a .01 a point .001 alpha level if the drug has nasty side effects. Or, you may want to have a .10 alpha level if you are doing a pilot study.

What if this drug is for a horrific disease, a crippling disease or a life threatening disease? Well, you don't want to do an experiment that causes you to miss the good drug. Assume it is a very devasting disease. You've got a chance to do something about it. So you want to make sure that if the drug works you don't miss it. Well you could try to reduce the beta error, to .1 instead of .2. That is increase your power from .8 to .9 maybe .95, whatever it takes to do that kind of thing. Of course, increasing alpha increases power, so that is one of your alternatives.

So these numbers are important for you to interpret, not the statisician. Because only you know the medical aspects of the treatments that you're using, how devastating the disease is, or how painful the side effects might be. You are the person who knows these issues. Then the statistician builds an experiment to guarantee your alpha level is what you want it to be and your beta error or your power is what you want it to be. You'll be interacting with the statistician on these kinds of issues, but you'll have to use your medical knowledge to decide these sorts of things.

Hypothesis Testing
Practice
Exercise 8:
If you are researching a treatment that may be dangerous you should consider:

No Response
Increasing alpha
Decreasing alpha
Increasing power
Decreasing beta

### 2.2.5 Working with a Statistician

The figure below shows what alpha, beta and power look like in a graph and illustrates some of the relationships between them. Remember, alpha and beta represent the probabilities of Type I and Type II Errors, respectively.

Figure 3.3: Alpha and Beta Errors Since power is 1-beta, the area to the right of 107.5 under the right curve represents your power in this experiment. It should be apparent from the graph that alpha, beta and power are closely related. In particular, you can see that reducing alpha is equivalent to moving the vertical line between the two sample means to the right. When you do this, alpha decreases, power (1 - beta) decreases, and beta increases. On the other hand moving that same vertical line to the left increases alpha, increases power, and decreases beta. To put it another way, increases in alpha increase power and decreases in alpha decrease power.

Hypothesis Testing
Practice
Exercise 9:
As you decrease alpha from .05 to .01 you:

No Response
Increase Type I Error
Increase power
Decrease Type II Error
Decrease power

##### Biostatistics for the Clinician

More specifically, in the graph the assumption is that you have a hypothesized mean value of 100 (IQs or something like that). You have a two-tailed alpha level of .05 (.025 in each tail) providing a critical value of 107.5 that means if you get a sample mean of 107.5 or larger it will be statistically significant. Lets also suppose that your sample and effect sizes are large enough that your power is .80 with an effect the size of this (10 units). That would mean that beta is .20 and you have a 20% chance of a false negative or Type II Error. You have already determined that the alpha is at about 2.5% for a false positive. This would be a well designed experiment. Say from your sample of 10 or 20 or however many you have, you obtained a mean of 107.5. The question is, "Given the sample mean of 107.5, is this sample that I have from the group on the left that has a mean of 100, or from the group on the right with the mean of 110?" There is no effect if it comes from the group with the mean of 100.

And you say, "Well, how do you arrive at things like that?" Well that's another thing that you as the physician will have to tell the statistician. You'll have to say, "Well I want to be able to detect the smallest possible effect that's useful to me clinically." That along with the sample size determines the power. By achieving .80 power you protect yourself from doing an expensive experiment that has a low probability of detecting an effect. So you will need to tell the statistician. how big a difference, or how small a difference, is still clinically important. Now the bigger you make this difference that's clinically important, the more power you have and the easier it is to do the experiment. You can see from the figure above that the bigger the effect is, the further apart the means are, the less the curves overlap and the greater the power. Suppose the distribution on the right had a mean of 120. Your experiment would require fewer people to detect that kind of difference. So the further apart, the bigger the size of the useful effect, the smaller the experiment can be.

Hypothesis Testing
Practice
Exercise 10:
Larger effects increase:

No Response
Alpha
Power
Beta
1-Power

##### Biostatistics for the Clinician

You have two ways, other than alpha, of controlling the power, sample size and effect size. Keep in mind the width of the sampling distributions in Figure 3.3 above is measured by the standard error which is determined by sample size. On the other hand this distance between the means is the size of an effect determined to be important by the experimenter. It's what you consider to be the minimum important difference. So, you can either have more people and shrink down the width of these distributions to see that they're different, or you can set the minimum effect your interested to a larger value. The table below provides some illustrations of how sample size and effect size affect statistical significance.

Relationship of Sample Size and Mean Values
to Achieve Statistical Significance
Sample SizeSample MeanPopulation Meanp
4110.0100.00.05
25104.0100.00.05
64102.5100.00.05
100102.0100.00.05
400101.0100.00.05
2,500100.4100.00.05
10,000100.2100.00.05

You can see from the table, that in addition to changing your alpha level you have two more ways of controlling the power of your experiment. One is to increase the size of sample and that changes the width of the sampling distributions. The other is to increase the minimum size effect you want to be able to detect.

These are the kinds of considerations you'll be asked to deal with as a physician. Figure 3.4 below shows the same kind of thing you saw illustrated in the table above in a different form (see Figure 3.4).

Figure 3.4: Power as Function of Sample Size In Figure 3.4 the different curves represent what happens with different effect sizes. Effect sizes are defined in terms of fractions of the standard deviation. In other words the 0.2, 0.4, and 0.6 are all differences in means divided by the standard deviation which we're assuming are the same for the two distributions. So, in the bottom curve where the difference between the means is fairly small compared to the standard deviation, it's obvious you have to have a large sample size to get adequate power.

On the other hand, if you confine your interest only to larger effects, you can see by the top curve that a much smaller sample size provides the .80 you want. With a .6 effect size, a sample of 20 provides .80 power. On the other hand, with a .2 effect size, a sample of 100 doesn't yet provide .80 power. So, other things being equal, the minimum effect you're looking for in an experiment determines the sample size that you need for adequate statistical power. Obviously, if you want to find a tiny difference, a tiny result, a tiny difference in the treatments, then you're going to have to have a very large sample size to do that sort of thing. If you're willing to look for a large effect you can have a smaller sample size.

Hypothesis Testing
Practice
Exercise 11:
You can control your power in an experiment by changing:

No Response
Alpha
Effect size
Sample size
All except "No Response" above

Final Instructions

Press Button below for your score.

• After completing Lesson 2.2, including all practice exercises, press the "Submit... " button below for Lesson 2.2 research participation credit.
• After you press "Submit..." it is possible Netscape may tell you it is unable to connect because of unusually high system demands. If you receive no error message upon submission you're OK. But, if Netscape gives you an error message after you press the "Submit..." button, wait a moment and resubmit or consult the attendant.