1. Intro

In this sub-competency, you will study the relationship between two variables measured from an individual. In many studies we measure more than one variable for each individual. Some examples are:

  • The weight of a car and its gas mileage (in miles per gallon)
  • Exercise and cholesterol levels for a group of people
  • Height and weight for a group of people

In cases where multiple variables are measured from individuals, we are interested in whether the variables have some kind of a relationship. We’d like to know whether changes in one variable lead to specific (and thus predictable) changes in another variable.

When we have two variables, they could be “connected” in one of several different ways:

  • They could be completely unrelated.
  • One variable (the explanatory or predictor variable) could be used to explain the other (the response or dependent variable).
  • One variable could be thought of as causing the other variable to change.

A response variable measures an outcome of a study (think y-value or dependent variable) while an explanatory variable explains or influences changes in a response variable (think x-value or independent variable). Sometimes it is not clear which variable is the explanatory variable and which is the response variable. Sometimes the two variables are related without either being explanatory or response variables. And sometimes the two variables are both affected by a different variable, called a lurking variable, which was not collected or included in the study. Studies with lurking variables can cause a lot of trouble for people trying to prove a point. An excellent example of a lurking variable is a study that shows the number of television sets in your home can be used to predict your life expectancy! Think about some possible lurking variables in this study.

2. A Scatterplot

The most useful graph to show the relationship between two quantitative variables is the scatter diagram. If a distinction exists in the two variables being studied, plot the explanatory variable (X) on the horizontal scale, and plot the response variable (Y) on the vertical scale. With a scatterplot, each individual in the data set is represented by a single point (x, y) in the xy-plane.

Example (taken from Fundamentals of Statistics, by Sullivan):
section4-1
A professor at a large midwestern university wanted to study the relationship between the number of class absences a student has in a given semester and that student’s final course grade. The data shown were collected from a sample of students in a general education course.

Identifying the relationship between the two data values from a table is difficult, so we create a scatterplot. In this case, the professor hopes that the number of a student’s absences will offer some explanation of his or her final course grade.

Plot the 10 points on the xy-axes, using the points (0, 89.2) (1, 86.4), and so on. Typically we rely on technology to create the scatterplot for us. A scatteplot created in Excel looks like:

section4-2

It’s now quite clear that as the number of absences increases, the final course grade decreases.

Types of Relationships

Once you have a scatterplot, it can be used to identify an overall pattern and deviations from this pattern. You can describe the pattern by form, direction, and strength of the relationship, and you can identify points that do not follow the overall pattern (outliers). This is a process very similar to describing distributions!

Some relationships are such that the points of a scatterplot tend to fall along a more-or-less straight line. Two variables have a linear relationship in a scatter plot when the two variables roughly follow a straight-line pattern. We say two variables have a positive association if above-average values of one variable tend to accompany above-average values of the other variable, and below-average values tend to occur together. Likewise, two variables have a negative association if above-average values of one variable tend to accompany below-average values of the other variable, and vice versa. When the points in a scatter plot do roughly follow a straight line, the direction of the pattern tells how the variables respond to each other. A positive slope indicates that as the values of one variable increase, so do the values of the other variable. This type of relationship between two variables is called a positive linear relationship. A negative slope indicates that as the values of one variable increase, the values of the other variable decrease. This type of relationship between two variables is called a negative linear relationship.

See the provided figures. Some examples of data with a linear relationship are:

  • From a scatterplot of college students, there is a positive association between verbal SAT score and GPA.
  • For used cars, there is a negative association between the age of the car and the selling price.

section4-3

Some data exhibits a nonlinear (or curved) relationship. An excellent example of a nonlinear data set is the relationship between the speed you drive your car and the corresponding gas mileage. This relationship is more quadratic in nature, with an example shown in the left image.

section4-4

Example: Determining Whether a Scatter Plot Would Follow a Straight-Line Pattern

Determine whether the points in a scatter plot for the two variables are likely to have a positive slope, negative slope, or not follow a straight-line pattern.

a.     The number of hours you study for an exam and the score you make on that exam
b.     The price of a used car and the number of miles on the odometer
c.     The pressure on a gas pedal and the speed of the car
d.     Shoe size and IQ for adults

Solution

a.    As the number of hours you study for an exam increases, the score you receive on that exam is usually higher. Thus, the scatter plot would have a positive slope.
b.    As the number of miles on the odometer of a used car increases, the price usually decreases. Thus, the scatter plot would have a negative slope.
c.    The more you push on the gas pedal, the faster the car will go. Thus, the scatter plot would have a positive slope.
d.    Common sense suggests that there is not a relationship, linear or otherwise, between a person’s IQ and his or her shoe size.

Measuring the Strength of a Linear Relationship

There still remains some subjectivity when describing the relationship between two data values from a scatterplot. What you may see in a pattern of dots I may interpret differently (it’s like looking at clouds patterns in the sky). To eliminate this “bias,” we can directly measure the strength of a linear relationship using the correlation coefficient, r. There is an intimidating formula for computing the value of r by hand, so you should always rely on technology for this! If you’re interested, here is the formula:

section4-5

Even though the formula is complex, you’ll notice quite a few symbols familiar from before: x̄ and ȳ represent the average values of both variables being studied, and sx and sy represent the standard deviations of both variables. Finally, the value n – 1 is the degrees of freedom for the n variables.

The correlation coefficient r provides a measure of the strength and direction of the linear relationship:

  • the stronger the relationship, the larger the magnitude of r, and
  • a positive r indicates a positive relationship, a negative r indicates a negative relationship.
  • If r = 0, then no linear relationship exists between the two variables.

Scatter plots can reveal different correlations

Sec 4

Example (continued)

Continuing our example of relating class absences to final course grade, we already know the correlation coefficient r will be negative (since the final course grade decreases as the number of absences increases) and as the points lie roughly on a line, the value of r should be close to 1. Using Excel we get the following output:

section4-6

Therefore, r (rounded to 3 decimal places), which is about what we expected!

3. Correlation versus Causation

Just because two variables are correlated does not mean that one causes the other to change. For example, there is a strong correlation between shoe sizes and vocabulary sizes for grade school children. Clearly larger shoe sizes do not cause larger vocabularies, and larger vocabularies do not cause childeren’s feet to grow. It’s actually the age of a child that more obviously relates to both other variables. The variable age is an example of the previously mentioned lurking variable. Often lurking variables result in confounding, which is the belief that two variables have a cause-and-effect relationship, when actually other variables are “in charge.”

Linear Regression

Once you have a scatterplot of data, it’s tempting to act like a child and play connect-the-dots. Many times the dots form a nice pattern. The pattern we are most interested in for this course is a linear pattern. In higher level statistics courses, you get to study other more interesting patterns. Linear regression is a technique that summarizes and, more importantly, quantifies the linear relationship between two variables.

Now that we have data from two variables, say X and Y, rather than just quantify the linear correlation between them, we often would like to model the relationship as an actual line.

Basically, we want to draw a line through the scatter diagram. But again, left up to an individual, we each would draw slightly different lines through the data. What we really need is the line that “best” describes (or fits) the data. This line is called the least-squares regression line. Once we have such a line, we can then predict the average response for all individuals with a given value of the explanatory variable.

Determining the Least-Squares Regression Line

Recall that a linear equation has the mathematical (algebraic) form: y = mx + b, where m is the slope (rise over run) and b is the y-intercept (the location along the y-axis where the line crosses). Try to not get “trapped” by the letters used in the formula, as each textbook or online resource will likely use different letters to represent slope and the y-intercept.

In the textbook recommeded for this course, the linear form is written as: y = a + bx,
where b represents the slope, a represents the y-intercept, and the value y is the predicted value for the response variable Y. This is the variable convention that will be followed in the rest of these notes.

Example (Continued)

We again return to studying the relationship between absences and grades. Using Excel to determine the linear regression line we see:

section4-7

Excel not only draws the line of best fit through our data, it gives you the option of displaying the linear equation on the scatterplot. Notice that the line does not go through every point. (It couldn’t because the points do not fall along an exact line.) However, the line is created so that it minimizes the total amount of deviation (squared errors) for all data points.

Interpreting the Slope and y-Intercept Values from Linear Regression

In any linear regression, the slope value b is the most important value. Remember that the slope is defined as:

section4-8

Basically, the slope describes how values of the response variable Y will change when values of the explanatory variable X change. For example:

  • If b = 4 (thus, we have a positive linear relationship), then:
      • If x increases by 1, then y will increase by 4
      • If x decreases by 1, then y will decrease by 4
  • If b = -7 (thus we have a negative linear relationship), then:
      • If x increases by 1, then y will decrease by 7
      • If x decreases by 1, then y will increase by 7

Usually the value of the y-intercept, a, has no real meaning. It is useful only if 0 is a reasonable value for x. In that case, a can be interpreted as the value of y when x = 0. If 0 is not a reasonable value for x, then a does not have an interpretation.

Example (continued)

From our linear regression, we found that the equation

y = 88.733 – 2.8273x

models the relationship between the number of absences (x) versus final course grade (y). In this case, the slope b means that for each additional absence a student has in this particular course, he or she can expect the final course grade to drop, on average, by roughly 2.8 points. The y-intercept a does have a reasonable value in this case. If a student never misses a class (x = 0), he or she can expect to receive (on average) a final grade of 88.733 for the class. This example provides great evidence for the damage that missing classes can have on your grade in a course!

The real use of a linear regression line is for predictions. Let’s assume a student misses 4 classes throughout the semester. Both the student and the professor have a good idea of what that student’s final grade will be:

y = 88.733 – 2.8273(4) = 77.4

This student is expected to receive a final course grade of 77.4%. In general, the student’s specific final grade will be different, but on average (if we took all students who missed exactly 4 classes and averaged their final course grades), we expect to see a final grade of 77.4%.

Extrapolation

In general, we should not use a linear regression model for values of x that are much larger or much smaller than the observed values, i.e., outside of the scope of the model. Without collecting more data, we have no idea what happens to the relationship of X and Y outside of our data values. Does the linear relationship continue? Who knows. We’d need to collect more data.

Example: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator
Calculate the correlation coefficient, r, for the data from the table relating touchdowns thrown and base salaries.

sec 4a

sec 4b

 

Solution

Let’s enter these data into our calculator.

  • Press  [STAT].
  • Select option 1:Edit.
  • Enter the values for number of touchdowns (x) in L1 and the values for base salary (y) in L2.
  • Press [STAT].
  • Choose CALC.
  • Choose option 4:LinReg(ax+b).
  • Press [ENTER] twice.

From the scatter plot, we would expect  r  to be close to 0. The calculator confirms that the correlation coefficient for these two variables is r ≈ -0.251, indicating a weak negative relationship, if any relationship exists at all.

sec 4c

Coefficient of Determination

The coefficient of determination is a measure of how well the least-squares regression line explains the variation in the response variable. Although easy to calculate, it carries a very powerful meaning for a set of data that is linearly related.

Now that we have discussed the scatterplot, linear correlation (r), and least-square linear regression, we might like to measure how well our model explains the linear relationship. In other words, how much does our model explain? How much does our model improve our prediction? So if we have a value of x, and we use our linear model y = a + bx to predict the value of y, how accurate are we?

To answer these questions, we look to the percentage of variation in the response variable Y that is explained by the least-squares regression line. This percentage of variation is called the coefficient of determination, and is denoted symbolically by R2. Repeating what was just written, R2 measures the fraction of the variation in the values of the response variable Y that is explained by the least-squares regression line. The value of R2 varies between 0 and 1. A value of R2 close to 0 (i.e., 0% explanation) indicates a model with very little explanatory power. A value of R2 close to 1 (i.e., 100% explanation) indicates a model with lots of explanatory power. Again, the value of R2 is an overall measure of the usefulness of a linear regression prediction.

Luckily, to compute R2 we simply square the value of the linear correlation coefficient, r. In symbols:

R2 = r2

If you recall, the linear correlation coefficient for the data set of absences versus final course grade was r = -0.947. Thus, the value of R2 is:

R2 = (-0.947)2 = 0.8968

Thus, 89.68% of the variation in final course grades can be explained by the least-squares linear regression y = 88.733 – 2.8273x.

The value for R2 also comes directly from the Excel output of a linear regression. For the absences versus final course grade problem, the regression output is shown. Notice the slight difference in our calculated R2 value (0.8968) and the R2 value from Excel (0.89755). This is due solely to rounding error. (The value we used for r was rounded to three decimal places.)

section4-9

Example 12.9: Calculating and Interpreting the Coefficient of Determination

If the correlation coefficient for the relationship between the numbers of rooms in houses and their prices is r = 0.65, how much of the variation in house prices can be associated with the variation in the numbers of rooms in the houses?

Solution

Recall that the coefficient of determination tells us the amount of variation in the response variable (house price) that is associated with the variation in the explanatory variable (number of rooms).

Thus, the coefficient of determination for the relationship between the numbers of rooms in houses and their prices will tell us the proportion or percentage of the variation in house prices that can be associated with the variation in the numbers of rooms in the houses. Also, recall that the coefficient of determination is equal to the square of the correlation coefficient.

Since we know that the correlation coefficient for these data is r = 0.65, we can calculate the coefficient of determination as r^2=(0.65)^2=0.4225
Thus, approximately 42.25% of the variation in house prices can be associated with the variation in the numbers of rooms in the houses.