2. A Scatterplot

The most useful graph to show the relationship between two quantitative variables is the scatter diagram. If a distinction exists in the two variables being studied, plot the explanatory variable (X) on the horizontal scale, and plot the response variable (Y) on the vertical scale. With a scatterplot, each individual in the data set is represented by a single point (x, y) in the xy-plane.

Example (taken from Fundamentals of Statistics, by Sullivan):
section4-1
A professor at a large midwestern university wanted to study the relationship between the number of class absences a student has in a given semester and that student’s final course grade. The data shown were collected from a sample of students in a general education course.

Identifying the relationship between the two data values from a table is difficult, so we create a scatterplot. In this case, the professor hopes that the number of a student’s absences will offer some explanation of his or her final course grade.

Plot the 10 points on the xy-axes, using the points (0, 89.2) (1, 86.4), and so on. Typically we rely on technology to create the scatterplot for us. A scatteplot created in Excel looks like:

section4-2

It’s now quite clear that as the number of absences increases, the final course grade decreases.

Types of Relationships

Once you have a scatterplot, it can be used to identify an overall pattern and deviations from this pattern. You can describe the pattern by form, direction, and strength of the relationship, and you can identify points that do not follow the overall pattern (outliers). This is a process very similar to describing distributions!

Some relationships are such that the points of a scatterplot tend to fall along a more-or-less straight line. Two variables have a linear relationship in a scatter plot when the two variables roughly follow a straight-line pattern. We say two variables have a positive association if above-average values of one variable tend to accompany above-average values of the other variable, and below-average values tend to occur together. Likewise, two variables have a negative association if above-average values of one variable tend to accompany below-average values of the other variable, and vice versa. When the points in a scatter plot do roughly follow a straight line, the direction of the pattern tells how the variables respond to each other. A positive slope indicates that as the values of one variable increase, so do the values of the other variable. This type of relationship between two variables is called a positive linear relationship. A negative slope indicates that as the values of one variable increase, the values of the other variable decrease. This type of relationship between two variables is called a negative linear relationship.

See the provided figures. Some examples of data with a linear relationship are:

  • From a scatterplot of college students, there is a positive association between verbal SAT score and GPA.
  • For used cars, there is a negative association between the age of the car and the selling price.

section4-3

Some data exhibits a nonlinear (or curved) relationship. An excellent example of a nonlinear data set is the relationship between the speed you drive your car and the corresponding gas mileage. This relationship is more quadratic in nature, with an example shown in the left image.

section4-4

Example: Determining Whether a Scatter Plot Would Follow a Straight-Line Pattern

Determine whether the points in a scatter plot for the two variables are likely to have a positive slope, negative slope, or not follow a straight-line pattern.

a.     The number of hours you study for an exam and the score you make on that exam
b.     The price of a used car and the number of miles on the odometer
c.     The pressure on a gas pedal and the speed of the car
d.     Shoe size and IQ for adults

Solution

a.    As the number of hours you study for an exam increases, the score you receive on that exam is usually higher. Thus, the scatter plot would have a positive slope.
b.    As the number of miles on the odometer of a used car increases, the price usually decreases. Thus, the scatter plot would have a negative slope.
c.    The more you push on the gas pedal, the faster the car will go. Thus, the scatter plot would have a positive slope.
d.    Common sense suggests that there is not a relationship, linear or otherwise, between a person’s IQ and his or her shoe size.

Measuring the Strength of a Linear Relationship

There still remains some subjectivity when describing the relationship between two data values from a scatterplot. What you may see in a pattern of dots I may interpret differently (it’s like looking at clouds patterns in the sky). To eliminate this “bias,” we can directly measure the strength of a linear relationship using the correlation coefficient, r. There is an intimidating formula for computing the value of r by hand, so you should always rely on technology for this! If you’re interested, here is the formula:

section4-5

Even though the formula is complex, you’ll notice quite a few symbols familiar from before: x̄ and ȳ represent the average values of both variables being studied, and sx and sy represent the standard deviations of both variables. Finally, the value n – 1 is the degrees of freedom for the n variables.

The correlation coefficient r provides a measure of the strength and direction of the linear relationship:

  • the stronger the relationship, the larger the magnitude of r, and
  • a positive r indicates a positive relationship, a negative r indicates a negative relationship.
  • If r = 0, then no linear relationship exists between the two variables.

Scatter plots can reveal different correlations

Sec 4

Example (continued)

Continuing our example of relating class absences to final course grade, we already know the correlation coefficient r will be negative (since the final course grade decreases as the number of absences increases) and as the points lie roughly on a line, the value of r should be close to 1. Using Excel we get the following output:

section4-6

Therefore, r (rounded to 3 decimal places), which is about what we expected!