3. Correlation versus Causation

Just because two variables are correlated does not mean that one causes the other to change. For example, there is a strong correlation between shoe sizes and vocabulary sizes for grade school children. Clearly larger shoe sizes do not cause larger vocabularies, and larger vocabularies do not cause childeren’s feet to grow. It’s actually the age of a child that more obviously relates to both other variables. The variable age is an example of the previously mentioned lurking variable. Often lurking variables result in confounding, which is the belief that two variables have a cause-and-effect relationship, when actually other variables are “in charge.”

Linear Regression

Once you have a scatterplot of data, it’s tempting to act like a child and play connect-the-dots. Many times the dots form a nice pattern. The pattern we are most interested in for this course is a linear pattern. In higher level statistics courses, you get to study other more interesting patterns. Linear regression is a technique that summarizes and, more importantly, quantifies the linear relationship between two variables.

Now that we have data from two variables, say X and Y, rather than just quantify the linear correlation between them, we often would like to model the relationship as an actual line.

Basically, we want to draw a line through the scatter diagram. But again, left up to an individual, we each would draw slightly different lines through the data. What we really need is the line that “best” describes (or fits) the data. This line is called the least-squares regression line. Once we have such a line, we can then predict the average response for all individuals with a given value of the explanatory variable.

Determining the Least-Squares Regression Line

Recall that a linear equation has the mathematical (algebraic) form: y = mx + b, where m is the slope (rise over run) and b is the y-intercept (the location along the y-axis where the line crosses). Try to not get “trapped” by the letters used in the formula, as each textbook or online resource will likely use different letters to represent slope and the y-intercept.

In the textbook recommeded for this course, the linear form is written as: y = a + bx,
where b represents the slope, a represents the y-intercept, and the value y is the predicted value for the response variable Y. This is the variable convention that will be followed in the rest of these notes.

Example (Continued)

We again return to studying the relationship between absences and grades. Using Excel to determine the linear regression line we see:

section4-7

Excel not only draws the line of best fit through our data, it gives you the option of displaying the linear equation on the scatterplot. Notice that the line does not go through every point. (It couldn’t because the points do not fall along an exact line.) However, the line is created so that it minimizes the total amount of deviation (squared errors) for all data points.

Interpreting the Slope and y-Intercept Values from Linear Regression

In any linear regression, the slope value b is the most important value. Remember that the slope is defined as:

section4-8

Basically, the slope describes how values of the response variable Y will change when values of the explanatory variable X change. For example:

  • If b = 4 (thus, we have a positive linear relationship), then:
      • If x increases by 1, then y will increase by 4
      • If x decreases by 1, then y will decrease by 4
  • If b = -7 (thus we have a negative linear relationship), then:
      • If x increases by 1, then y will decrease by 7
      • If x decreases by 1, then y will increase by 7

Usually the value of the y-intercept, a, has no real meaning. It is useful only if 0 is a reasonable value for x. In that case, a can be interpreted as the value of y when x = 0. If 0 is not a reasonable value for x, then a does not have an interpretation.

Example (continued)

From our linear regression, we found that the equation

y = 88.733 – 2.8273x

models the relationship between the number of absences (x) versus final course grade (y). In this case, the slope b means that for each additional absence a student has in this particular course, he or she can expect the final course grade to drop, on average, by roughly 2.8 points. The y-intercept a does have a reasonable value in this case. If a student never misses a class (x = 0), he or she can expect to receive (on average) a final grade of 88.733 for the class. This example provides great evidence for the damage that missing classes can have on your grade in a course!

The real use of a linear regression line is for predictions. Let’s assume a student misses 4 classes throughout the semester. Both the student and the professor have a good idea of what that student’s final grade will be:

y = 88.733 – 2.8273(4) = 77.4

This student is expected to receive a final course grade of 77.4%. In general, the student’s specific final grade will be different, but on average (if we took all students who missed exactly 4 classes and averaged their final course grades), we expect to see a final grade of 77.4%.

Extrapolation

In general, we should not use a linear regression model for values of x that are much larger or much smaller than the observed values, i.e., outside of the scope of the model. Without collecting more data, we have no idea what happens to the relationship of X and Y outside of our data values. Does the linear relationship continue? Who knows. We’d need to collect more data.

Example: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator
Calculate the correlation coefficient, r, for the data from the table relating touchdowns thrown and base salaries.

sec 4a

sec 4b

 

Solution

Let’s enter these data into our calculator.

  • Press  [STAT].
  • Select option 1:Edit.
  • Enter the values for number of touchdowns (x) in L1 and the values for base salary (y) in L2.
  • Press [STAT].
  • Choose CALC.
  • Choose option 4:LinReg(ax+b).
  • Press [ENTER] twice.

From the scatter plot, we would expect  r  to be close to 0. The calculator confirms that the correlation coefficient for these two variables is r ≈ -0.251, indicating a weak negative relationship, if any relationship exists at all.

sec 4c

Coefficient of Determination

The coefficient of determination is a measure of how well the least-squares regression line explains the variation in the response variable. Although easy to calculate, it carries a very powerful meaning for a set of data that is linearly related.

Now that we have discussed the scatterplot, linear correlation (r), and least-square linear regression, we might like to measure how well our model explains the linear relationship. In other words, how much does our model explain? How much does our model improve our prediction? So if we have a value of x, and we use our linear model y = a + bx to predict the value of y, how accurate are we?

To answer these questions, we look to the percentage of variation in the response variable Y that is explained by the least-squares regression line. This percentage of variation is called the coefficient of determination, and is denoted symbolically by R2. Repeating what was just written, R2 measures the fraction of the variation in the values of the response variable Y that is explained by the least-squares regression line. The value of R2 varies between 0 and 1. A value of R2 close to 0 (i.e., 0% explanation) indicates a model with very little explanatory power. A value of R2 close to 1 (i.e., 100% explanation) indicates a model with lots of explanatory power. Again, the value of R2 is an overall measure of the usefulness of a linear regression prediction.

Luckily, to compute R2 we simply square the value of the linear correlation coefficient, r. In symbols:

R2 = r2

If you recall, the linear correlation coefficient for the data set of absences versus final course grade was r = -0.947. Thus, the value of R2 is:

R2 = (-0.947)2 = 0.8968

Thus, 89.68% of the variation in final course grades can be explained by the least-squares linear regression y = 88.733 – 2.8273x.

The value for R2 also comes directly from the Excel output of a linear regression. For the absences versus final course grade problem, the regression output is shown. Notice the slight difference in our calculated R2 value (0.8968) and the R2 value from Excel (0.89755). This is due solely to rounding error. (The value we used for r was rounded to three decimal places.)

section4-9

Example 12.9: Calculating and Interpreting the Coefficient of Determination

If the correlation coefficient for the relationship between the numbers of rooms in houses and their prices is r = 0.65, how much of the variation in house prices can be associated with the variation in the numbers of rooms in the houses?

Solution

Recall that the coefficient of determination tells us the amount of variation in the response variable (house price) that is associated with the variation in the explanatory variable (number of rooms).

Thus, the coefficient of determination for the relationship between the numbers of rooms in houses and their prices will tell us the proportion or percentage of the variation in house prices that can be associated with the variation in the numbers of rooms in the houses. Also, recall that the coefficient of determination is equal to the square of the correlation coefficient.

Since we know that the correlation coefficient for these data is r = 0.65, we can calculate the coefficient of determination as r^2=(0.65)^2=0.4225
Thus, approximately 42.25% of the variation in house prices can be associated with the variation in the numbers of rooms in the houses.