3.1 Correlation & Regression Analysis

Biostatistics for the Clinician

UT Logo

University of Texas-Houston
Health Science Center

Lesson 3.1

Correlation and Regression Analysis

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 1

Biostatistics for the Clinician

3.1 Correlation and Regression Analysis

3.1.1 Simple Correlation and Regression

Scatterplots

You probably have already a bit of a feel for what a relationship between two variables means. To enrich that understanding, the plots in Figure 13.3 below show you some concrete examples of the meaning of a particular measure of relationship called the correlation coefficient.

Figure 13.3 Scatterplots

Let's assume that in the graph the variable X on the horizontal axis represents a mother's weight, and that the variable Y graphed on the Y axis represents the birth weight of the mother's infant. If the two variables increase together and are perfectly correlated, all points in the scatterplot fall on a rising straight line, as you see in the upper left graph of Figure 13.3. In this case the correlation coefficient is a plus 1 . If they are perfectly correlated and as one variable increases the other decreases, the points lie on a falling straight line, as you see in the upper right graph of Figure 13.3. In this case the correlation is a minus 1.

In the middle left graph you see a little bit of a scatter about the line. The correlation coefficient here is .93. In the middle right graph you see more scatter and the correlation is about .49. In the lower left graph you see much more scatter and the correlation coefficient is about .02. In general, the correlation coefficient varies from +1 to -1 with 0 indicating no relationship between the variables.

The presence of a correlation means, given an X value on the horizontal axis, you can make a prediction of the Y value by using the straight line predictors you see in the graphs. The better the correlation, the more acccurate the prediction. The correlation coefficient just measures the degree to which the scatterplots for the variables approximate the straight lines.

Correlation and Regression Analysis
Practice Exercise 1:	Larger magnitude correlations indicate larger: No Response Approximations Lines Relationships Variability

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 2

Biostatistics for the Clinician

Linear vs. Curvilinear Regression

If you remember the old equation, y = mx+b, from high school algebra, that's the equation for the line. In that equation you're trying to relate two variables Y and X, relating them by slope, and by where the line crosses the axis. The slope, in particular, has to do with the correlation coefficient.

Now, the points could be scattered as in the lower right hand corner graph of Figure 13.3. In this sort of situation if you try to fit a straight line, you get a relatively small correlation coefficient (.21 in the present case), showing it's not a very good fit. The correlation typically measures only the amount of straight line, or linear, relationship between variables. On the other hand, it is very clear by looking at the graph that you should be able to make very good predictions about Y by drawing a curve through the points.

You could, if you had some theoretical reason for it, fit your points to a U shaped curve and not use a linear correlation coefficient. In fact, in medicine you'll have the U shaped curves rather than straight lines fairly frequently. You'll have a particular treatment. As you increase the treatment it improves the patients outcome until the treatment reaches some critical place, and then it diminishes the patient's outcome. So you may have a curve, you may be looking for a curve rather than a straight line and you want to know where the maximum is because you want to get the maximum benefit from your treatment. You don't want to have so much treatment that it causes a problem. You would try to fit these points not to a straight line but to a curve. However, in most elementary statistical classes you're only fitting things to straight lines rather than curves, but it's just as easy to fit them to curves if you have a theoretical reason for using a curve.

Correlation and Regression Analysis
Practice Exercise 2:	Linear correlations typically provide a good fit for U shaped curves. No Response True False

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 3

Biostatistics for the Clinician

Regression Equations and Estimation

Now let's take a look at relationships between variables from another point of view, the point of view of regression analysis. Regression analysis can be thought of as being sort of like the flip side of correlation. It has to do with finding the equation for the kind of straight lines you were just looking at. Study Table 13.1 below briefly and then continue.

Table 13.1 Prepregnancy Weights of Mothers and Birthweights of their Infants
Case Number	Mother's Weight (kg)	Infant's Birthweight (g)
1	49.4	3515
2	63.5	3742
3	68.0	3629
4	52.2	2680
5	54.4	3006
6	70.3	4068
7	50.8	3373
8	73.9	4124
9	65.8	3572
10	54.4	3359
11	73.5	3230
12	59.0	3572
13	61.2	3062
14	52.2	3374
15	63.1	2722
16	65.8	3345
17	61.2	3714
18	55.8	2991
19	61.2	4026
20	56.7	2920
21	63.5	4152
22	59.0	2977
23	49.9	2764
24	65.8	2920
25	43.1	2693

Table 13.1 contains 25 cases -- the mothers weight in kilograms and the infant's birthweight in grams. The question concerns whether you can find some relationship between the mother's weight and the infant's birth weight that allows you to identify likely candidates for fetal alcohol syndrome.. Why would such a relationship be important? How could such a relationship between the mother's weight and the infant's birth weight help you?

You want to check for malnutrition with regard to the fetus. You're looking at fetal alcohol syndrome. So what's the problem? Why can't you just say - well this kid's underweight and that kid's overweight or at weight, so this mother used alcohol and this mother didn't? What else do you need? Why do you have to fool around with some kind of linear relationship between the mother's weight and the infant's birth weight?

You're talking about making the babies weight the dependent variable that depends upon the mother's weight. Why do you want to know that? To see if there's a correlation. But, if you find a correlation how does that help concerning identifying likely candidates for fetal alcohol syndrome? Why would you want to know something about the relationship between the mother's weight and the infant's weight if you're looking at the fetal alcohol syndrome?

You're trying to separate two effects here -- the effect of the mother's weight and the effect of the alcohol. You suspect that the infant's weight may be related to the mother's weight regardless of alcohol, and you suspect that the infant's weight may be related to the mother's alcohol consumption. So, you're trying to remove the effect of the fact that possibly heavier mothers have heavier babies. Once you remove that effect, what's left might well be the fetal alcohol syndrome. In other words, if you could figure what the baby should weigh based on the mother's weight, so that if you see a baby that is way above the predicted weight, you might say that baby is a healthy baby. It's likely that mother is not using alcohol. On the other hand, if the baby were way below the predicted weight based on the mother's weight, then you might conclude it's related to fetal alcohol.

So, you can use regression to remove a relationship, to produce a situation like having a sample of mother's having exactly the same weight, if you could find them, and then finding ones who have alcohol or not.

You don't have the ability to hold something constant with people like you do with experimental subjects in the lab. But you do have statistical tools to remove the effects of certain variables, in effect, holding them constant without really doing so. So, what we're going to do here is to identify the typical relationship, between mother and infant's weights, and to remove that statistically, so what's left might be just what's related to the fetal alcohol syndrome.

Correlation and Regression Analysis
Practice Exercise 3:	Regression analysis provides a statistical way of removing relationships due to nuisance variables. No Response True False

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 4

Biostatistics for the Clinician

Let's look at Table 13.3 below and see what happens. Using linear correlation and regression methods you can find each infant's expected birth weight based on the mother's weight. You can also find the equation so other resulting values of Table 13.3 can quickly be derived (see Table 13.3).

Table 13.3 Prepregnancy Weights of Mothers and Birthweights of their Infants - Deviations About the Linear Line of Regression
Case Number	Mother's Weight (kg)	Infant's Birthweight (g)	Infant's Expected Birthweight (g)	Residual (g)	Squared Residual (g)
1	49.4	3515	3022.5	492.4	242,502.5
2	63.5	3742	3456.7	285.3	81,378.9
3	68.0	3629	3595.3	33.7	1,135.9
4	52.2	2680	3108.8	-428.8	183,846.9
5	54.4	3006	3176.5	-170.5	29,076.2
6	70.3	4068	3666.1	401.9	161,507.7
7	50.8	3373	3065.7	307.3	94,455.4
8	73.9	4124	3777.0	347.0	120,427.6
9	65.8	3572	3527.5	44.4	1,975.5
10	54.4	3359	3176.5	182.5	33,299.9
11	73.5	3230	3764.7	-534.7	285,857.1
12	59.0	3572	3318.2	253.8	64,433.0
13	61.2	3062	3385.9	-323.9	104,915.8
14	52.2	3374	3108.8	265.2	70,344.9
15	63.1	2722	3444.4	-722.4	521,880.5
16	65.8	3345	3527.5	-182.6	33,325.6
17	61.2	3714	3385.9	328.1	107,644.9
18	55.8	2991	3219.6	-228.6	52,270.3
19	61.2	4026	3385.9	640.1	409,718.9
20	56.7	2920	3247.3	-327.3	107,151.8
21	63.5	4152	3456.7	695.3	483,400.2
22	59.0	2977	3318.2	-341.2	116,392.5
23	49.9	2764	3038.0	-274.0	75,049.0
24	65.8	2920	3527.6	-607.6	369,120.7
25	43.1	2693	2828.6	-135.6	18,376.8
Y = 30.7926X +1501.40

Notice the equation at the bottom of the table. X in the equation is the mother's weight, so if you know the mother's weight you can plug it in here and predict the baby's birthweight. So you take the mom's weight in kilograms, plug it in for X, multiply it by 30.7926, add 1501.40 to it, and that provides the gram estimate of the infant's weight. The expected weights based on those computations appear in the fourth column from the left.

So, anytime that you see a regression equation with one or more variables that looks like this, no matter what the variables are, you can plug in the values of independent variables, multiply, and get the predicted values, in this case, the infant's expected weight.

Correlation and Regression Analysis
Practice Exercise 4:	Regression analysis allows you to predict expected values of dependent variables from the values of: No Response Other dependent variables Residuals Independent variables Estimates

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 5

Biostatistics for the Clinician

Residuals

The "residuals" in the fifth column are simply the difference between the infant's actual weight and predicted weight. In the fifth column you can see some babies are below and some babies are above there expected weight. Large negative numbers in the fifth column indicate babies whose weights are far below what you would expect based on the mother's weight alone.

Now, once you've got the expected values you can plot the scatterplot and regression line together on the same graph showing the residuals as the distances the actual points are above or below the expected values on the regression line (See Figure 13.6 below).

Figure 13.6: Scatterplot and Regression Line

Figure 13.6 above shows what the graph looks like when plotted. The points on the line are the estimates of the babies' weights, so if the mom weighs 70 kilograms, for example, the baby's predicted weight is about 3800 grams. Babies with large negative residuals, that is, whose weights are far below the predicted weights, are those who turn out to be associated with fetal alcohol syndrome. So what you've got is a very powerful technique to take humans which have a lot of variability and remove some of that variability statistically, so you can focus on the aspects of health that are the focal concern.

Correlation and Regression Analysis
Practice Exercise 5:	Regression analysis can help detect nontypical cases through the examination of: No Response Dependent variables Residuals Independent variables Estimates

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 6

Biostatistics for the Clinician

3.1.2 Use of Computers

Simplicity of Computer Analysis

Nobody does statistical calculations by hand any more. Everybody uses computer programs. Let's look now at one example showing how easy, how dangerously easy, it is. The following shows what it might look like to run a regression analysis using Microsoft Excel.

Running A Regression Analysis by Computer

You can see above that you've got the two columns of data on the left side of the screen. The mother's weight is in the first column. The infant's weight is in the second. The last last seven cases of data run off the bottom and so are not visible.

Any of a variety of other computer programs could be used, SPSS, MINITAB, and WinStat are some programs designed especially for high level statistical analyses. These programs allow you to compute a regression analysis in a few point and click steps.

When you choose the "Data Analysis" option from the "Tools" menu, as shown above, the following screen showing a Data Analysis dialogue window appears.

The Data Analysis Dialogue Window Appears

Computation of Regression Equation

At this point you'd choose "Regression", so you'd press the "OK" button. The following screen then appears. You'd enter the cells containing the dependent variable in the "Input Y Range", the cells containing the predictor variable in the "Input X Range", and check the "Line Fit Plots" checkbox as shown below.

The Regression Dialogue Window Appears

At this point you only need to press the "OK" button. The analysis would run to completion and a screen like the following would then appear.

Output of Computer Regression Analysis

So, with a flick of a button you'd get all these numbers. You can't see it yet, but there's a scatterplot showing the regression line. The "a" and "b" values for the equation have been computed and you can see they are approximately equal to the one's you've already seen in Table 13.3. So, you've got the intercept value of approximately 1501.3 and the slope of 30.79 to multiply the mother's weight by in the y = ax + b equation. There are also some R values toward the upper left. They tell you how effective your fit is. In particular, the important value is the R² value. The R² will tell you what percent of the infant's weight variability is explained by the regression line. You can see from the R² above that about 27% of the variability in the infant's weight is predictable from the mother's weight, not great, but not too bad either.

If you then scrolled over to the right you'd then see the scatterplot shown below.

Scatterplot showing Regression Line

Guidelines for Computer Use

So, you can see that you just need to type in the data, point and click a few buttons, and you get an answer. It's a little bit like giving a 2nd year medical student a flex sig, pointing to where they start and then leaving them to themselves. That is, you can do these things so easily that you may be tempted to do them without much understanding about what's going on.

The point is that is very easy for the statisticians to do these things, but the conceptual understanding they have protects you from errors. If you have some conceptual understanding you can communicate with them appropriately to obtain the right kinds of tests and other statistics. You should always feel free to call upon the biostatistician for the actual implementation of these analyses and to provide guidance with the details.

Correlation and Regression Analysis
Practice Exercise 6:	Computers make it so easy to do statistical analyses that they appropriately implement them without conceptual understanding on your part. No Response True False

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 7

Biostatistics for the Clinician

3.1.3 Multiple Regression

Typical Multivariable Clinical Problems

Now, lets look at more than one predictor variable. First, think for a moment. Is there a situation you as physicians frequently encounter in a clinic, where you need to determine what one variable should be based on the values of two other variables? In other words, can you think of an example of something you might do routinely in a clinic or one of your assistants might do in a clinic that requires two variable regression? A situation where you need to predict one thing based on two others. In the previous study example you predicted infants weight from mother's weight. Now, the question is, what question do you routinely face in your clinic where you use to control two variables to make a prediction?

Weight, Sex and Height

What about ideal weight. How do you determine a person's ideal weight? You use their height and their sex. You use or control two variables. You have, for example, a table with a column for males, a column for females, and you have height. From the two variables then you can look at the table and say, "OK you're ideal weight is so and so, and so you're 10% over, or your 30% over it."

So you use multiple regression lots of times, but you don't think of it as that sort of thing. You use or control for multiple variables in that sort of case if you use a weight table, and the results of the regression analysis are included in the table. Occasionally you'll even see the curve itself. It's not quite a straight line.

Correlation and Regression Analysis
Practice Exercise 7:	Multivariable situations are encountered frequently in the clinic. No Response True False

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 8

Biostatistics for the Clinician

Cholesterol, Weight and Blood Pressure

Now, let's look at an example where you have another situation (see table below).

Measurements Taken on 11 apparently Normal Males Between the Ages of 14 and 24 Years
Serum Cholesterol (mg/100 cc) Y	Weight (kg) X₁	Systolic Blood Pressure (mm/hg) X₂
162.2	51.0	108
158.0	52.9	111
157.0	56.0	115
155.0	56.5	116
156.0	58.0	117
154.1	60.1	120
169.1	58.0	124
181.0	61.0	127
174.9	59.4	122
180.2	56.1	121
174.0	61.2	125
y_c = 18.25 - 4.06x_1j + 3.20x_2j

Let's say you want to find the males that have a serious problem with serum cholesterol. In this case you have some assumptions that their healthy cholesterol is related to their weight and their blood pressure. You want to look at the serum cholesterol of these patients with weight and blood pressure taken into account. So, you would need to do a multiple linear regression like that done with height and sex as predictors of weight. Using the table, your dependent variable would be serum cholesterol and your independent variables would be weight in kilograms and systolic blood pressure in mm of mercury.

You can see the equation you get, just below the table. The x_1j is the j'th person's weight. The x_2j is the j'th person's systolic blood pressure. So, what if you wanted to compute the expected or predicted serum cholesterol of the person with the systolic blood pressure of 120? You find 4.06 times 60.1 and subtract that from 18.25. Then you add 3.2 times 120. That would give you the predicted value. Then you could compare the predicted value with the actual. You'd compute the difference (called the residual) between the two. You could then see whether, based on these two things, the serum cholesterol is unusually high or low. You could do the same sort of thing with every patient.

Of course, you don't need to know how the equation was derived to use it. You simply need to know whether it is valid. All you need to do is plug in the values of the independent variables. You'll get predictions for the dependent variable.

Correlation and Regression Analysis
Practice Exercise 8:	Multiple regression situations occur when you have 1 dependent variable and the number of predictor variables is: No Response 1 2 3 Many 2 or more

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 9

Biostatistics for the Clinician

Log Linear Regression

This is an inocculation or prevention lesson. You don't want to have statistician shock. The statistician comes up to you and he says, "OK, this is the equation I've come up with to work with your program doctor." (See example below).

An Example Log Linear Regression Equation
p LOG _______ = b₀+b₁(AGE)+b₂(IC)+b₃(SC)+b₄(BCP)+b₅(IUD) 1 - p

You've got a log of a risk ratio on the left where the p represents the probability of giving birth to a child. On the right, you see the age of the mother as the coefficient of b₁. The equation, in this case doesn't really represent results of real research, but it does represent what might be computed from some real data. In the equation, frequency of intercourse is IC. The SC is the father's sperm count. The use of birth control pills is BCP, and the use of an IUD is represented in the last term of the equation. Now, you might say that's a complicated looking equation. But, don't worry about it! The statisticians handle the computations.

If you take the probability of the event happening (of a birth) divided by 1 minus that probability, you're just looking at the ratio of it happening to not happening. You then take the log of it. On the right side of the equation, it then turns out that these b's turn out to be risk ratios. Risk ratios are real nice to work with. For example, if one of these b values was a 3, that would cause 3 times more babies, because these b's are risk ratios.

The point is that you should avoid getting bogged down thinking about the equation. Don't worry about the notation. Let the statisticians handle that. You take the results which have been validated and use them in your practice and you work together with a biostatistician in applying them in your research.

Correlation and Regression Analysis
Practice Exercise 9:	The clinician uses biostatistics appopriately by having a a conceptual grasp of the big picture while relying on the biostatistician for the details. No Response True False

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 10

Biostatistics for the Clinician

3.1.4 C.R.A.P. Detectors

Have At Least 10 Subjects Per Variable

Now here are some cautions you should remember about multiple regression studies. First, make sure you've got plenty of subjects (see Figure 13.3).

C.R.A.P DETECTOR #13.3
DON'T REGARD TOO SERIOUSLY ANY STUDY THAT USED FEWER THAN 10 SUBJECTS FOR EACH VARIABLE.

When you have many independent variables that you're taking into account, you need to have many subjects. The recommended number is at least 10 subjects per independent variable. So, if you have five variables that you're using as predictors, make sure you have at least 50 subjects. Don't look at any studies that have small numbers of people that use multiple independent variables. The sampling error will be too large. You can't expect data from such small groups to be representative. This is a rule to remember. Use it to evaluate published medical research studies. Apply it when you're doing experiments that may have multiple variables.

Full Credibility Requires Systematic Replication

Did the authors replicate their results? You need to have replications for really credible research (see Figure 16.5). Multivariable studies can have variations in the sample that really throw things off. So, in addition to having fairly large numbers of people such studies should be replicated, preferably by different authors.

C.R.A.P DETECTOR #16.5
DID THE AUTHORS REPLICATE THEIR RESULTS? SINCE SO MANY RESULTS WITH MULTIVARIATE PROCEDURES DON'T HOLD UP ON REPLICATION, DON'T BELIEVE THE FINDINGS UNTIL THEY HAVE BEEN REPLICATED ON A DIFFERENT SAMPLE, AND PREFERABLY BY DIFFERENT AUTHORS.

Cross-Validate with Independent Samples

Cross-validation is a very important issue when doing regression studies. Given a result from a particular sample you've drawn, generally you should then find a whole new independent sample and apply your predictors to that group. This is called cross-validation.

C.R.A.P DETECTOR #14.4
THE GOODNESS OF AN EQUATION IN CLASSIFYING SUBJECTS SHOULD ALWAYS BE TESTED ON A "CROSS-VALIDATION" SAMPLE, I.E., A GROUP OF NEW PEOPLE WHO WERE NOT USED USED IN DERIVING THE EQUATION.

It's a very good validation technique. You don't just use the predictors within the original group, but you go get a different group and see how accurate the predictions are with the new data. If the predictions hold up well, you've demonstrated how stable or reliable they are when you go from group to group. These are the kinds of cautions you need to employ when you have multiple variables as predictors in regression situtations.

Correlation and Regression Analysis
Practice Exercise 10:	Which of the following is least important when conducting or critically evaluating multivariable medical research studies? No Response Cross-validation Replication Extension Sample size vs. number of variables

Final Instructions

Press Button below for your score.

After completing Lesson 3.1, including all practice exercises, press the "Submit... " button below for Lesson 3.1 research participation credit.

After you press "Submit..." it is possible Netscape may tell you it is unable to connect because of unusually high system demands. If you receive no error message upon submission you're OK. But, if Netscape gives you an error message after you press the "Submit..." button, wait a moment and resubmit or consult the attendant.

Finally, press the "Table of Contents..." button below to correctly end Lesson 3.1 and return to the Lesson 3 Table of Contents so you may continue with Lesson 3.2.

Biostatistics for the Clinician

Biostatistics for the Clinician

University of Texas-Houston Health Science Center

Lesson 3.1

Correlation and Regression Analysis

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 1

Biostatistics for the Clinician

3.1 Correlation and Regression Analysis

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 2

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 3

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 4

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 5

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 6

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 7

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 8

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 9

Biostatistics for the Clinician

Lesson 3: Clinical Decision Making in a Multivariable Environment 3.1 - 10

Biostatistics for the Clinician

End Lesson 3.1 Correlation and Regression Analysis

Lesson 1: Summary Measures of Data 3.1 - 11

University of Texas-Houston
Health Science Center

End Lesson 3.1
Correlation and Regression Analysis