We are now getting into the more computational part of statistics. When we describe a distribution, there are three main characteristics to identify: its shape, center, and spread. We already learned about the many shapes distributions can take. The center and spread of a distribution are *numerical summaries* of a data set. There are many ways to describe both the center and spread of a data set.

## 1. Intro

## 2. Measures of Center

In this section we learn three **measures of central tendency**, i.e., three ways to identify the “center” of a data set. You should already be somewhat familiar with these basic ideas.

### The Mode

The **mode** of a variable is a simple measure of center that describes the most frequent (recurring) observation in a data set. For example, in a data set such as {0,1,2,2,2,3,4,8}, the value 2 would be the mode of the data set since it appears the most times. If you refer back to the exam score data shown previously, none of the data values appears more than once. This means that the exam score data has NO mode. The mode is rarely used in serious statistics studies.

### The Arithmetic Mean

The **arithmetic mean** of a variable is often what people mean by the “average. ” To calculate, simply add up all the values and divide by how many there are. In statistics, the symbol x̄ (pronounced x-bar) is used to denote the **mean of a sample**.

As an example, the data shown in the table are the first exam scores for all 17 students from a calculus class. You should verify that the average is:

The mean is only valid for quantitative data and can be thought of as a balance point for the set of data. For example, if you had only three exam scores, such as 81, 85, and 83, you can see how “borrowing” two exam points from the 85 and applying them to the 81 would make all three scores be 83. That is your average value.

### The Median

The **median** of a variable, typically denoted by the capital letter M, is another measure of the “center.” The median is simply the numerical value of the data value that occupies the physical middle location of your ordered data set.

After just a moment of thinking, it’s clear that the calculation of the median of a variable is slightly different depending on if there are an odd number of data points, or if there are an even number of data points. (Think about the middle value of 3 data points and the middle value of 4 data points.)

To calculate the median of a data set, arrange the data in order and count the number of observations, *n*. If the value of *n* is odd, then there will be data value that is exactly in the middle. That data value is the median M. If *n* is even, then there will be two values on either side of the exact middle. In this case, the median is defined to be the average of these two data values. In either case, the *location* of the mean can be found through the simple calculation:

Please be careful to note that the median is ** NOT** the value of the fraction

The value of this fraction is simply the

*location*of the median.

Returning to the example of the first exam scores above, we first sort the scores in ascending order, as shown. Since there are 17 scores, an odd number, there will be an exact middle score. If your data set has just a few values, it is easy to find the middle. However, just to be sure, the location of the median comes from the formula

Thus, the 9th data value gives us our median, *M* = 76.5.

Now suppose that the student who scored a 46 wasn’t enrolled in class in the first place. That would mean the population has *n* = 16 members (just ignore the 46). In that case, the location of the median would be

Obviously there is not a score in the 8.5th location. Thus, we take the two scores from the 8th and 9th positions, 76.5 and 77, and find their average:

Therefore, the population median would be *M* = 76.75.

**On the TI-83/84 Calculator:**

First we enter the data into the lists. To do this we hit the STATS Key and select the first option 1:Edit. Then we enter the data into the first list and hit ENTER after every value.

Then we hit STAT again and then arrow over to the CALC menu and select 1: 1-Var Stats, hit ENTER and then choose your list to be L1 and hit ENTER on the Calculate. We notice that the value of the mean is shown as well as the size of our data set n=13.

If we arrow down further we see some more statistics including the median. Observe that the calculator does not give us the mode. For more details on using the calculator to do statistics check out the Technology Guide on D2L.

### Comparing the Mean and Median Values

Although both the mean and median are measures of the center of a data set, it is rarely the case that the two will be exactly equal. In fact, there are times when the two will be very different. How the mean and median relate to each other tells is a lot about the distribution of the underlying data set.

Basically, it’s important to know that the mean is a measure of center that is *highly* sensitive to changes in the data values. Think about the process of taking an average…* every* data value is used in the computation. If simply one data value changes, then the average will change. The change in the value of the mean will be much more drastic if one of the outlying data values changes. In essence, the mean is ** not resistant** to changes to extreme values.

On the other hand, the median is a measure of center that is not very sensitive to changes in the data values. Really only one or two data points determine the value of the mean. The values of the other data points do not factor at all into the median. They serve only as placeholders, which lead to the middle value. Because of this, the median is ** resistant** to changes in extreme values.

**Example:**

Data |
Mean |
Median |
---|---|---|

{1, 5, 13, 20, 28} | μ=13.4 | M = 13 |

{1, 5, 13, 20, 280} | μ=63.8 | M = 13 |

{1, 5, 13, 20} | μ=7.8 | M = 9 |

Notice how drastically the mean changes in each case, while the median stays either the same or changes just slightly.

Because of the sensitivity of the mean, it gets pulled in the direction of the tails for skewed data sets. Basically, if the distribution is:

**Symmetric:**the mean will usually be close to the median**Left (Or Negative) Skew:**the mean will usually be smaller than the median**Right (or Positive) Skew:**the mean will usually be larger than the median

The following picture illustrates the graphical relationship of the mean, median, and mode in the three types of data:

## 3. Measures of Spread

The purpose of identifying a “central” value from a data set was to describe a *typical* value in the data set. Once we know this, we can measure the amount of dispersion or spread of the data values *from the typical, central, value*. In other words, we’re going to calculate how “spread out” our data is. Three main measures of dispersion for a data set are the **range**, the **variance**, and the **standard deviation**.

### The Range

The range of a variable is simply the “distance” between the largest data value and the smallest data value. In math symbols:

Range = largest data value – smallest data value

The table shown provides the first exam scores for a class with 11 students. The range for this data set is:

Range = 94 – 64 = 30

Calculating the range requires the use of only two values: the smallest and largest data values. If either of the two values changes, so does the range. Therefore, the range clearly is ** not resistant** to extreme values in the data set. No other data values affect the range.

### Sample Variance

The variance of a data set is a numerical summary that indicates the *average deviation of each data value from the mean* of a data set. The calculation of the variance of a data set requires us to compare each data value from our raw list, {*x _{1},x_{2},x_{3},…,x_{(n-1)},x_{n}*}, to the mean

*x̄*. The idea of deviation is just the difference, as computed by subtraction. In symbols, the deviation about the mean for the

*i*data value, x

^{th}_{i}, is the value: (

*x*).

_{i}– x̄Because of the definition of the mean of a data set, if you add up the deviation from the mean for each data value, you will always get zero. In symbols, Σ(*x _{i} – x̄*) = 0. This is a bit technical, but what it basically means is that we cannot just average the sum of deviations. We’d always get zero!!

To get around this, we need a way to make all deviations from the mean positive, regardless of whether a data value is below or above the mean. For example, if you live two miles north of a city and I live two miles south of the same city, it would be ridiculous to say, “I live negative 2 miles from the city.” We both live two miles away.

Mathematically, one way to make all deviations positive is to use an absolute value. Another way, which we’ll use for calculating both the variation and standard deviation of a data set, is to ** square** each deviation. For the city example, your deviation value would be 2

^{2}= 4 and my deviation value would be (-2)

^{2}= 4. Therefore, our deviation, regardless of being positive or negative, would be the same! So, to treat positive differences and negative differences as the same, we square the deviations: (

*x*)

_{i}– x̄^{2}.

Finally, since the variance measures the *average deviation of each data value from the mean* of the entire data set, we add up the squared-deviation value for each data point and divide by the value (*n* – 1), one fewer than the number of data values. This is another techical “difficulty” that we’ll deal with later. The value (*n* – 1) is given the special designation **degrees of freedom** of a data set. The reason for this will be made more clear throughout the class, but imagine the following simple scenario: You and four friends go to a Chinese restaurant, and at the end of the meal, your server brings your group 5 fortune cookies, setting them in a pile in the middle of the table. How many of your party of 5 get to actually *choose* their fortune? Only 4. The reason is obvious: after 4 people have had their choice of fortune cookies, only one remains. The fifth person has ** no choice of fortune**. The degrees of freedom for this “problem” is thus 5 – 1 = 4.

Here’s another example, this time from a math standpoint: if someone tells you that they are thinking of 3 numbers whose average is 5, how many of the three numbers do you need to know before you know all 3? After a little thought, you’ll realize the answer is 2. If you are told that two of the numbers are 2 and 10, a little thought (and some algebra) will help you find that the last number must be

Again, the degrees of freedom for this problem is 3 – 1 = 2.

In all its glory, the math formula for calculating the sample variance is:

where *n* is the size of the sample.

**Example: **Calculations for a Sample Variance

Returning to the population of exam scores for a class with 11 students, the table above illustrates the (sometimes tedious!) calculations for the population variance. The mean of this population of data is *x̄* = 82.

The total squared deviation for the population data is **1272**. Therefore, the variance for the data set is:

Score |
Deviation From Mean |
Squared Deviation |
---|---|---|

94 |
94 – 82 = 12 |
12^{2} = 144 |

87 |
87 – 82 = 5 |
5^{2} = 25 |

95 |
95 – 82 = 13 |
13^{2} = 169 |

68 |
– 14 |
196 |

72 |
– 10 |
100 |

75 |
– 7 |
49 |

88 |
6 |
36 |

89 |
7 |
49 |

94 |
12 |
144 |

76 |
– 6 |
36 |

64 |
– 18 |
324 |

SUM = 1272 |

## 4. Sample Standard Deviation

The standard deviation provides the most commonly used measure of the spread of the underlying distribution of data. When comparing two distributions, the larger the standard deviation, the more dispersion the data set has (the data values, as a whole, differ more from the mean). Think about these:

- Would you rather stand in a line where the standard deviation of wait times is 3 minutes or 6 minutes? Why?
- Would you rather invest in a stock with a standard deviation rate of return of 8% or 20%? Why?

The standard deviation and the mean are the most popular methods for numerically summarizing the distribution of a symmetric variable. Knowing both the mean (center) and standard deviation (spread) tells us a lot about the distribution. We’ll be using these two values for most types of statistical inference later on.

For those of you scared by the above calculations for variance, you can relax! The standard deviation is simply the square root of the variance:

**standard deviation = √variance**

The **sample standard deviation**, denoted by ** s**, is simply the square root of the

**sample variance**:

*s* = √*var* = √*s*^{2}

**Example Calculations for a Sample Standard Deviation**

For the above example of exam scores, the population variance was *s*^{2} = 127.2. Therefore, the sample standard deviation is:

*s* = √*s*^{2} = √127.2 ≈ 11.2783

On the TI-83/84 Calculator we can use the 1-Var Stats to find the standard deviation on the list of statistics as Sx=11.27829774.

### Five-Number Summary and Boxplots

There are other important tools to summarize data numerically and/or graphically. A five-number summary is simply a list of 5 numerical values from a data set, one of which you’ve already learned to compute: the median. Boxplots are a way to visualize a five-number summary, and are an excellent way to compare the distributions of two (or more) data sets.

### Five-Number Summary

The five-number summary of a data set consists of the following 5 numbers from the data set:

minimum, first quartile, median, third quartile, maximum

*min*, *Q*_{1}, *M* = *Q*_{2}, *Q*_{3}, *max*

The median of a data set, being the middle value, necessarily splits your data set into two pieces…one consisting of all the values to the left of (and thus lower than) the median and one consisting of all the values to the right of (and thus greater than) the median. The first quartile is simply the median value of the lower half of your data set and the third quartile is the median value of the upper half of your data set. The median of your entire data set is obviously the same as a second quartile. The quartiles split your data into fourths, which means that 25% of the data values fall between the minimum data value and *Q*_{1}, 25% of the data values lie between *Q*_{1} and the median, 25% of the data values lie between the median and *Q*_{3}, and 25% of the data lies between *Q*_{3} and the maximum value of the data set. In addition, 50% of all observations fall between *Q*_{1} and *Q*_{3}. The five-number summary provides a cool way to describe the distribution of a data set.

### Boxplots

An obvious problem with a 5-number summary is that it is simply a list of numbers. A way to visualize the five-number summary is through a **boxplot**. To create a boxplot:

- Draw a horizontal or vertical number line (like an
*x*-axis or*y*-axis) that has a scale corresponding to the values within the data set. - Draw vertical lines at the positions of
*Q*_{1},*Q*_{2}(the median), and*Q*_{3}. - Enclose these three vertical lines with a box.
- Draw vertical lines at the positions of the minimum and the maximum values of the data set.
- Finally, extend horizontal lines from the central box out toward the minimum and maximum values.

As mentioned earlier, boxplots are very useful when comparing two (or more) data sets. One can simply make comparisons between the data sets using the five-number summary by determining how each data set is distributed.

**Example**

As an example, shown are five-number summaries of three data sets, one corresponding to the home run distances for each of three players, Mark McGwire, Sammy Sosa, and Barry Bonds for a given year.

Creating a boxplot for each player, using a common vertical scale for home run distances allows us to compare the distribution of home run distances for each player.

The distribution shape and boxplot are related. We can identify symmetry or skewness, the quartiles, and the maximum and minimum values of a distribution from a boxplot. And if more than one boxplot is plotted on the same scale, we can visually compare the centers, the spreads, and the extreme values.

**Choosing an Appropriate Measure of Center**

When presented with data a primary question would be which measure of center best describes this data? Lets look at an example to help us decide.

**Example: Calculating Measures of Center—Mean, Median, and Mode**

Given the recent economy and change of attitude in society, many people chose to take on another job after retiring from one. Below is a sample of ages at which people truly retired; that is, they stopped working for pay. Calculate the mean, median, and mode for the data.

84, 80, 82, 77, 78, 80, 79, 42

**Solution **

**Mean: **Remember, the mean is the sum of all the data points divided by the number of points.

**Median: **We have an even number of values, so we will need the mean of the middle two values in the ordered array.

42, 77, 78, **79**, **80**, 80, 82, 84

**Mode: **The number 80 occurs more than any other number, so it is the mode.

As you can see in the figure, of the three measures of center, the mean is closest to the outlier while the median and mode are more similar in value and are not affected by the outlier.

**Determining the Most Appropriate Measure of Center**

- For qualitative data, the mode should be used.
- For quantitative data, the mean should be used, unless the data set contains outliers or is skewed.
- For quantitative data sets that are skewed or contain outliers, the median should be used.

**Example: Choosing the Most Appropriate Measure of Center**

Choose the best measure of center for the following data sets.

- T-shirt sizes (S, M, L, XL) of American women
- Salaries for a professional team of baseball players
- Prices of homes in a subdivision of similar homes
- Professor rankings from student evaluations on a scale of
*best*,*average*, and*worst.*

**Solution **

- T-shirt sizes are qualitative, the mode is the best measure of center.
- The players’ salaries are quantitative data with outliers, since the superstars on the team make substantially more than the typical players. Therefore, the median is the best choice.
- The home prices are quantitative data with no outliers, since the homes are similar. Therefore, the mean is the best choice.
- The rankings are qualitative, it’s best to use the mode as a measure of center.

**Summary**

## 1. Intro

In this sub-competency, you will study the relationship between two variables measured from an individual. In many studies we measure more than one variable for each individual. Some examples are:

- The weight of a car and its gas mileage (in miles per gallon)
- Exercise and cholesterol levels for a group of people
- Height and weight for a group of people

In cases where multiple variables are measured from individuals, we are interested in whether the variables have some kind of a relationship. We’d like to know whether changes in one variable lead to specific (and thus predictable) changes in another variable.

When we have two variables, they could be “connected” in one of several different ways:

- They could be completely unrelated.
- One variable (the
**explanatory**or**predictor variable**) could be used to explain the other (the**response**or**dependent variable**). - One variable could be thought of as causing the other variable to change.

A **response variable** measures an outcome of a study (think *y*-value or dependent variable) while an **explanatory variable** explains or influences changes in a response variable (think *x*-value or independent variable). Sometimes it is not clear which variable is the explanatory variable and which is the response variable. Sometimes the two variables are related without either being explanatory or response variables. And sometimes the two variables are both affected by a different variable, called a **lurking variable**, which was not collected or included in the study. Studies with lurking variables can cause a lot of trouble for people trying to prove a point. An excellent example of a lurking variable is a study that shows the number of television sets in your home can be used to predict your life expectancy! Think about some possible lurking variables in this study.

## 2. A Scatterplot

The most useful graph to show the relationship between two quantitative variables is the **scatter diagram**. If a distinction exists in the two variables being studied, plot the **explanatory variable (X) ** on the horizontal scale, and plot the **response variable** **(Y)** on the vertical scale. With a scatterplot, each individual in the data set is represented by a single point (*x, y*) in the *xy*-plane.

**Example** (taken from *Fundamentals of Statistics*, by Sullivan):

A professor at a large midwestern university wanted to study the relationship between the number of class absences a student has in a given semester and that student’s final course grade. The data shown were collected from a sample of students in a general education course.

Identifying the relationship between the two data values from a table is difficult, so we create a scatterplot. In this case, the professor hopes that the number of a student’s absences will offer some explanation of his or her final course grade.

Plot the 10 points on the *xy*-axes, using the points (0, 89.2) (1, 86.4), and so on. Typically we rely on technology to create the scatterplot for us. A scatteplot created in Excel looks like:

It’s now quite clear that as the number of absences increases, the final course grade decreases.

### Types of Relationships

Once you have a scatterplot, it can be used to identify an overall pattern and deviations from this pattern. You can describe the pattern by form, direction, and strength of the relationship, and you can identify points that do not follow the overall pattern (outliers). This is a process very similar to describing distributions!

Some relationships are such that the points of a scatterplot tend to fall along a more-or-less straight line. Two variables have a **linear relationship** in a scatter plot when the two variables roughly follow a straight-line pattern. We say two variables have a **positive association** if above-average values of one variable tend to accompany above-average values of the other variable, and below-average values tend to occur together. Likewise, two variables have a **negative association** if above-average values of one variable tend to accompany below-average values of the other variable, and vice versa. When the points in a scatter plot do roughly follow a straight line, the direction of the pattern tells how the variables respond to each other. A **positive slope** indicates that as the values of one variable increase, so do the values of the other variable. This type of relationship between two variables is called a positive linear relationship. A **negative slope** indicates that as the values of one variable increase, the values of the other variable decrease. This type of relationship between two variables is called a negative linear relationship.

See the provided figures. Some examples of data with a linear relationship are:

- From a scatterplot of college students, there is a
*positive association*between verbal SAT score and GPA. - For used cars, there is a
*negative association*between the age of the car and the selling price.

Some data exhibits a nonlinear (or curved) relationship. An excellent example of a nonlinear data set is the relationship between the speed you drive your car and the corresponding gas mileage. This relationship is more quadratic in nature, with an example shown in the left image.

**Example: Determining Whether a Scatter Plot Would Follow a Straight-Line Pattern**

Determine whether the points in a scatter plot for the two variables are likely to have a positive slope, negative slope, or not follow a straight-line pattern.

a. The number of hours you study for an exam and the score you make on that exam

b. The price of a used car and the number of miles on the odometer

c. The pressure on a gas pedal and the speed of the car

d. Shoe size and IQ for adults

**Solution**

a. As the number of hours you study for an exam increases, the score you receive on that exam is usually higher. Thus, the scatter plot would have a positive slope.

b. As the number of miles on the odometer of a used car increases, the price usually decreases. Thus, the scatter plot would have a negative slope.

c. The more you push on the gas pedal, the faster the car will go. Thus, the scatter plot would have a positive slope.

d. Common sense suggests that there is not a relationship, linear or otherwise, between a person’s IQ and his or her shoe size.

### Measuring the Strength of a Linear Relationship

There still remains some subjectivity when describing the relationship between two data values from a scatterplot. What you may see in a pattern of dots I may interpret differently (it’s like looking at clouds patterns in the sky). To eliminate this “bias,” we can directly measure the strength of a linear relationship using the **correlation coefficient, r. **There is an intimidating formula for computing the value of

*r*by hand, so you should always rely on technology for this! If you’re interested, here is the formula:

Even though the formula is complex, you’ll notice quite a few symbols familiar from before: *x*̄ and *y*̄ represent the average values of both variables being studied, and *s*_{x} and *s*_{y} represent the standard deviations of both variables. Finally, the value *n* – 1 is the degrees of freedom for the *n* variables.

The correlation coefficient *r* provides a measure of the strength and direction of the linear relationship:

- the stronger the relationship, the larger the magnitude of
*r*, and - a positive
*r*indicates a positive relationship, a negative*r*indicates a negative relationship. - If r = 0, then no linear relationship exists between the two variables.

Scatter plots can reveal different correlations

**Example (continued)**

Continuing our example of relating class absences to final course grade, we already know the correlation coefficient *r* will be negative (since the final course grade decreases as the number of absences increases) and as the points lie roughly on a line, the value of *r* should be close to 1. Using Excel we get the following output:

Therefore, *r* (rounded to 3 decimal places), which is about what we expected!

## 3. Correlation versus Causation

Just because two variables are correlated does not mean that one *causes* the other to change. For example, there is a strong correlation between shoe sizes and vocabulary sizes for grade school children. Clearly larger shoe sizes do not *cause* larger vocabularies, and larger vocabularies do not *cause* childeren’s feet to grow. It’s actually the age of a child that more obviously relates to both other variables. The variable **age** is an example of the previously mentioned lurking variable. Often lurking variables result in *confounding*, which is the belief that two variables have a cause-and-effect relationship, when actually other variables are “in charge.”

### Linear Regression

Once you have a scatterplot of data, it’s tempting to act like a child and play connect-the-dots. Many times the dots form a nice pattern. The pattern we are most interested in for this course is a linear pattern. In higher level statistics courses, you get to study other more interesting patterns. Linear regression is a technique that summarizes and, more importantly, quantifies the linear relationship between two variables.

Now that we have data from two variables, say *X* and *Y*, rather than just quantify the linear correlation between them, we often would like to *model* the relationship as an actual line.

Basically, we want to draw a line through the scatter diagram. But again, left up to an individual, we each would draw slightly different lines through the data. What we really need is the line that “best” describes (or fits) the data. This line is called the **least-squares regression line**. Once we have such a line, we can then predict the average response for all individuals with a given value of the explanatory variable.

### Determining the Least-Squares Regression Line

Recall that a linear equation has the mathematical (algebraic) form: *y* = *mx* + *b*, where m is the slope (rise over run) and *b* is the *y*-intercept (the location along the *y*-axis where the line crosses). Try to not get “trapped” by the letters used in the formula, as each textbook or online resource will likely use different letters to represent slope and the *y*-intercept.

In the textbook recommeded for this course, the linear form is written as: *y* = *a* + *bx*,

where *b* represents the slope, *a* represents the *y*-intercept, and the value *y* is the *predicted* value for the response variable *Y*. This is the variable convention that will be followed in the rest of these notes.

**Example (Continued)**

We again return to studying the relationship between absences and grades. Using Excel to determine the linear regression line we see:

Excel not only draws the line of best fit through our data, it gives you the option of displaying the linear equation on the scatterplot. Notice that the line does not go through every point. (It couldn’t because the points do not fall along an exact line.) However, the line is created so that it minimizes the total amount of deviation (squared errors) for all data points.

### Interpreting the Slope and *y*-Intercept Values from Linear Regression

In any linear regression, the slope value *b* is the most important value. Remember that the slope is defined as:

Basically, the slope describes how values of the response variable *Y* will change when values of the explanatory variable *X* change. For example:

- If
*b*= 4 (thus, we have a positive linear relationship), then:- If
*x*increases by 1, then*y*will increase by 4 - If
*x*decreases by 1, then*y*will decrease by 4

- If

- If
*b*= -7 (thus we have a negative linear relationship), then:- If
*x*increases by 1, then*y*will decrease by 7 - If
*x*decreases by 1, then*y*will increase by 7

- If

Usually the value of the *y*-intercept, *a*, has no real meaning. It is useful only if 0 is a reasonable value for *x*. In that case, *a* can be interpreted as the value of *y* when *x* = 0. If 0 is not a reasonable value for *x*, then *a* does not have an interpretation.

**Example (continued)**

From our linear regression, we found that the equation

*y* = 88.733 – 2.8273*x*

models the relationship between the number of absences (*x*) versus final course grade (*y*). In this case, the slope *b* means that for each additional absence a student has in this particular course, he or she can expect the final course grade to **drop**, on average, by roughly 2.8 points. The *y*-intercept *a* *does* have a reasonable value in this case. If a student never misses a class (*x* = 0), he or she can expect to receive (on average) a final grade of 88.733 for the class. This example provides great evidence for the damage that missing classes can have on your grade in a course!

The real use of a linear regression line is for predictions. Let’s assume a student misses 4 classes throughout the semester. Both the student and the professor have a good idea of what that student’s final grade will be:

*y* = 88.733 – 2.8273(4) = 77.4

This student is expected to receive a final course grade of 77.4%. In general, the student’s specific final grade will be different, but *on average* (if we took all students who missed exactly 4 classes and averaged their final course grades), we expect to see a final grade of 77.4%.

### Extrapolation

In general, we should not use a linear regression model for values of *x* that are much larger or much smaller than the observed values, i.e., outside of the scope of the model. Without collecting more data, we have no idea what happens to the relationship of *X* and *Y* outside of our data values. Does the linear relationship continue? Who knows. We’d need to collect more data.

**Example: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator**

Calculate the correlation coefficient, r, for the data from the table relating touchdowns thrown and base salaries.

**Solution**

Let’s enter these data into our calculator.

- Press [STAT].
- Select option 1:Edit.
- Enter the values for number of touchdowns (x) in L1 and the values for base salary (y) in L2.
- Press [STAT].
- Choose CALC.
- Choose option 4:LinReg(ax+b).
- Press [ENTER] twice.

From the scatter plot, we would expect r to be close to 0. The calculator confirms that the correlation coefficient for these two variables is r ≈ -0.251, indicating a weak negative relationship, if any relationship exists at all.

**Coefficient of Determination**

The coefficient of determination is a measure of how well the least-squares regression line explains the variation in the response variable. Although easy to calculate, it carries a very powerful meaning for a set of data that is linearly related.

Now that we have discussed the scatterplot, linear correlation (*r*), and least-square linear regression, we might like to measure how well our model *explains* the linear relationship. In other words, how much does our model explain? How much does our model improve our prediction? So if we have a value of *x*, and we use our linear model *y* = *a* + *bx* to predict the value of *y*, how accurate are we?

To answer these questions, we look to the percentage of variation in the response variable *Y* that is explained by the least-squares regression line. This percentage of variation is called the **coefficient of determination**, and is denoted symbolically by *R*^{2}. Repeating what was just written, *R*^{2} measures the fraction of the variation in the values of the response variable *Y* that is explained by the least-squares regression line. The value of *R*^{2} varies between 0 and 1. A value of *R*^{2} close to 0 (i.e., 0% explanation) indicates a model with very little explanatory power. A value of *R*^{2} close to 1 (i.e., 100% explanation) indicates a model with lots of explanatory power. Again, the value of *R*^{2} is an overall measure of the usefulness of a linear regression prediction.

Luckily, to compute *R*^{2} we simply square the value of the linear correlation coefficient, *r*. In symbols:

*R*^{2} = *r*^{2}

If you recall, the linear correlation coefficient for the data set of absences versus final course grade was *r* = -0.947. Thus, the value of *R*^{2} is:

*R*^{2} = (-0.947)^{2} = 0.8968

Thus, 89.68% of the variation in final course grades can be explained by the least-squares linear regression *y* = 88.733 – 2.8273*x*.

The value for *R*^{2} also comes directly from the Excel output of a linear regression. For the absences versus final course grade problem, the regression output is shown. Notice the slight difference in our calculated *R*^{2} value (0.8968) and the *R*^{2} value from Excel (0.89755). This is due solely to rounding error. (The value we used for *r* was rounded to three decimal places.)

**Example 12.9: Calculating and Interpreting the Coefficient of Determination**

If the correlation coefficient for the relationship between the numbers of rooms in houses and their prices is r = 0.65, how much of the variation in house prices can be associated with the variation in the numbers of rooms in the houses?

**Solution**

Recall that the coefficient of determination tells us the amount of variation in the response variable (house price) that is associated with the variation in the explanatory variable (number of rooms).

Thus, the coefficient of determination for the relationship between the numbers of rooms in houses and their prices will tell us the proportion or percentage of the variation in house prices that can be associated with the variation in the numbers of rooms in the houses. Also, recall that the coefficient of determination is equal to the square of the correlation coefficient.

Since we know that the correlation coefficient for these data is r = 0.65, we can calculate the coefficient of determination as r^2=(0.65)^2=0.4225

Thus, approximately 42.25% of the variation in house prices can be associated with the variation in the numbers of rooms in the houses.

## 1. Intro

This competency set will address the subject of data collection. You’ll learn how to appropriately collect meaningful data, either through sampling or experiments, from which conclusions can be drawn. Anyone can collect data. However, if the data is not collected in a way to eliminate (or reduce) bias or the data does not accurately represent the population of interest, all results and conclusions drawn from the data will be practically meaningless.

As a word of warning, there are a tremendous number of terms that form the “language of statistics.” It may be very beneficial to create your own statistics dictionary, consisting of terms, definitions, and examples. As this language will be used throughout the course, having a good understanding of terms from the start will help you succeed.

### Parameters vs. Statistics

Remember from the first unit that a **population** is the *entire* group being studied, while a **sample** is a representative *subset* of the population. By definition, a sample is always smaller than the population.

If you are able to collect data from the entire population (for example, exam scores for *all* students in a statistics course), then the descriptive measures of the population are called **parameters. **Parameters are often written using Greek letters like *μ* (pronounced “mew”) or *σ*, pronounced “sigma.” If you only have data from a sample, the descriptive measures of samples are called **statistics**. Statistics are written using Roman letters like *x̄* and *s*, as you saw in Unit 2. An easy way to remember the distinction is by:

**p**arameter ⇔ **p**opulation

**s**tatistic ⇔ **s**ample

The main reason for the difference is so you know whether someone is reporting a descriptive measure from the entire population or just a sample. This very subtle, yet extremely important, difference forms the basis for the process of *statistical inference*.

## 2. Collecting Data by Surveys

In a sense, the techniques for collecting data are the most important step in the process of statistics; the procedures set the stage for obtaining information that can be used to draw meaningful and accurate conclusions. All of the statistical calculations we’ll be learning can be used on any set of data, regardless of how the data was obtained. Useless results come from bad data. Therefore, no matter how careful and exacting you are in organizing, summarizing, and analyzing data, your conclusions will be useless if you aren’t careful in the beginning to collect data appropriately.

There are four main ways to obtain data.

- A
**census**is a list of all the individuals and their characteristics in a population. An example of a census is the US Census held every 10 years (this is only an example, though). The main advantage of using a survey to obtain information is that your conclusions will have 100% certainty. The disadvantages of conducting a census are that it may be difficult or impossible to obtain all the information, and costs may be prohibitive. - An
**existing source**is an appropriate data set that has already been collected, and can be used for your study. The advantage of finding an existing source of data is obviously the savings in both time and money. A disadvantage is that it can often be difficult to find the*exact*data you need. - A
**survey sample**is a study when only a subset of the population is considered and where there is no attempt to influence the value of the variable of interest. The advantage of using a survey is the savings in both time and money of not having to get information from every individual in the population. The main disadvantage of a survey sample, and this is extremely important, is that choosing an appropriate sample could be difficult. The sample*must*represent the overall population, even though it is just a subset of the population.

A survey sample is an example of an**observational study**, where there is no attempt to influence the value of the variable. Observational studies are great for detecting*associations*(relationships) between variables, but they cannot isolate causes to determine*causation*. This happens when we fail to observe certain variables, called**lurking variables**. - A
**designed experiment**is an experiment that applies a treatment to individuals. In an experiment, information from the treated group is often compared with a control (untreated) group. Variables from the individuals and the treatments can easily be controlled in an experiment. A major advantage of an experiment is that you can analyze individual factors. Disadvantages of experiments are that they cannot be conducted when the variables cannot be controlled and in cases for moral/ethical reasons. Section 1.5 discusses methods for setting up and conducting an experiment.

When conducting a census is unrealistic (as is usually the case), sampling from the population is the next best thing. There is one main question: How do you choose your sample? For example, if you are interested in knowing the average grade point average (GPA) of graduating high school students in your city, you wouldn’t want your sample to consist of only women or of just athletes or of just honor roll students. You would want your sample to represent the entire population of interest.

We must use the process of **randomness** to select the individuals included in our sample. If we are allowed to do the selecting, our sample will most certainly be **biased**, i.e., it will include a group of individuals that does not represent the entire population, and therefore, conclusions will most certainly systematically favor certain outcomes.

The most popular sampling technique that relies on randomness is **simple random sampling**, a technique where every possible sample of size *n* out of a population of size *N* has an equally likely chance of occurring. For example, a simple random sample of size n = 2 from a population size of N = 4 has 6 possible samples, and each has an equally likely chance of occurring:

**Population:** {1, 2, 3, 4}

**Possible Samples of Size n = 2:** {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}

As simple random sampling is similar to “drawing names out of a hat,” we need a method to select the individuals for our sample. We will use either a table of random digits or technology to do this. A quick search on the Internet shows many places to find tables of random digits. One great site contains freely available tables (download as a PDF or read online): http://www.rand.org/pubs/monograph_reports/MR1418.html

In such a table of random digits, each entry is equally likely to be any of the 10 digits 0 through 9, which means that entries are *independent* of one another (knowledge of one number gives us no information about any of the other entries surrounding it). In fact, if we read the table in groups of two numbers, each pair of entries is equally likely to be any of the 100 pairs 00, 01, …, 98, 99. Reading each triple of entries gives us an equally likely chance of seeing any of the 1000 entries 000, 001, 002, …, 998, 999.

To conduct a Simple Random Sample, begin by numbering every member in your population. If your population has size 30, you will read numbers from the table in groups of two (pairs); if your population has size 168, you will read numbers from the table in groups of three (triples). Start anywhere you’d like in the table, and read in any direction, left, right, up, or down. It’s nice to follow a pattern, just so you don’t get lost. Select the random numbers as you move along, and match the numbers chosen to the individuals in your population. If you select a random number that does not correspond to an individual in your population, or if you encounter a repeat number (this WILL happen because the digits are random!), skip it and move on.

For example, suppose I want to select 4 students from a class of 30 to estimate the class average on an exam (this is unrealistic because it’s trivial to find the average of only 30 students, but it’s just an example). I would first assign a number to each student, starting at 01 and ending at 30. Start at the beginning of a line, say 263, and read in pairs from left to right. Here are the numbers that I’d record:

32 03 13 96 08 75 99 27 34 45 01 …

Table of Random Digits | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

00250 | 59467 | 58309 | 87834 | 57213 | 37510 | 33689 | 01259 | 62486 | 56320 | 46265 | |||||

00251 | 73452 | 17619 | 56421 | 40725 | 23439 | 41701 | 93223 | 41682 | 45026 | 47505 | |||||

00252 | 27635 | 56293 | 91700 | 04391 | 67317 | 89604 | 73020 | 69853 | 61517 | 51207 | |||||

00253 | 86040 | 02596 | 01655 | 09918 | 45161 | 00222 | 54577 | 74821 | 47335 | 08582 | |||||

00254 | 52403 | 94255 | 26351 | 46527 | 68224 | 90183 | 85057 | 72310 | 34963 | 83462 | |||||

00255 | 49465 | 46581 | 61499 | 04844 | 94626 | 02963 | 41482 | 83879 | 44942 | 63915 | |||||

00256 | 94365 | 92560 | 12363 | 30246 | 02086 | 75036 | 88620 | 91088 | 67691 | 67762 | |||||

00257 | 34261 | 08769 | 91830 | 23313 | 18256 | 28850 | 37639 | 92748 | 57791 | 71328 | |||||

00258 | 37110 | 66538 | 39318 | 15626 | 44324 | 82827 | 08782 | 65960 | 58167 | 01305 | |||||

00259 | 83950 | 45424 | 72453 | 19444 | 68219 | 64733 | 94088 | 62006 | 89985 | 36936 | |||||

00260 | 61630 | 97966 | 76537 | 46467 | 30942 | 07479 | 67971 | 14558 | 22458 | 35148 | |||||

00261 | 01929 | 17165 | 12037 | 74558 | 16250 | 71750 | 55546 | 29693 | 94984 | 37782 | |||||

00262 | 41659 | 39098 | 23982 | 29899 | 71594 | 77979 | 54477 | 13764 | 17315 | 72893 | |||||

00263 | 32031 | 39608 | 75992 | 73445 | 01317 | 50525 | 87313 | 45191 | 30214 | 19769 | |||||

00264 | 90043 | 93478 | 58044 | 06949 | 31176 | 88370 | 50274 | 83987 | 45316 | 38551 |

Since the first number does not correspond to anyone in the population, we skip it. The first student to be selected for the population would be Student 03. Following this would be Students 13, 08, and finally 27:

32 **03** **13** 96 **08** 75 99 **27** 34 45 01 …

Therefore, all our statistical research would focus on the exam scores of the four students pertaining to the numbers 03, 08, 13, and 27. Numbering students and “allowing” a table to choose the students to include in our study removes the baises that exist if we tried to choose the students ourselves.

### Sampling Errors

Now that we have seen *how* to obtain samples appropriately, here are some of the issues that can arise during a sampling process. There are two types of errors that arise: **sampling errors** and **nonsampling errors**.

**Sampling errors** are very difficult to control or predict. These errors result from using the sample (a *subset* of the population) to describe characteristics of the population. Therefore, the process of sampling may give incomplete information about the population. In other words, even if we use a random process to select a sample, our sample may not “perfectly” represent the population of interest. This occurs because data and information vary from member to member in the population.

There are numerous **nonsampling errors** that result from the sampling process, including the nonresponse of individuals selected in the sample, inaccurate responses to poorly worded questions, bias in the selection of the sample, and so on. Nonsampling errors are often largely avoidable with a good study design, and minimizing these errors is of high priority in designing a sample survey. Some examples of nonsampling errors include:

- Using an incomplete population
- Nonresponse
- Interviewer errors
- Misrepresented answers
- Mistakes in recording or entering data
- Questionnaire design
- Wording of questions
- Order of questions, words, and responses

**Example 1: Identifying Parts of a Survey**

Two shortened survey reports are given. In each report, identify the following: the population, the sample, the results, and whether the results represent a sample statistic or a population parameter.

a. A headline about the rising obesity among young people led a school board to survey local high school students. Out of 231 students surveyed, 58% reported eating a “high fat” snack at least 4 times a week.

b. A nonprofit organization interviewed 618 adult shoppers at malls across Louisiana about their views on obesity in youths. The resulting report stated that an estimated 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options.

**Solution**

a. Population: local high school students.

Sample: the 231 students who were surveyed.

Results: 58% of students surveyed eat a “high fat” snack at least 4 times a week. The result refers to only those students who were surveyed, thus the result is a sample statistic.

b. Population: Louisiana adults

Sample: the 618 adult Louisiana mall shoppers who were surveyed.

Results: 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options. The results refer to all Louisiana adults, thus this is a population parameter. This population parameter is an estimate based on the sample statistics, which were not reported.

**Types of Sampling**

There are many ways that one can sample from a population. Ideally we would like to sample in such a way that we get a sample that reflects all the characteristics of the population and therefore a statistic that represents the parameter well. The quality of a sample statistic (i.e., accuracy, precision, representativeness) is highly affected by how sample(s) are chosen; that is., by the sampling method. Below we will describe different sampling methods.

- A
**representative sample**is one that has the same relevant characteristics as the population and does not favor one group of the population over another. - A
**random sample**is one in which every member of the population has an equal chance of being selected. - A
**stratified sample**is one in which members of the population are divided into two or more subgroups, called**strata**, that share similar characteristics like age, gender, or ethnicity. A random sample from each stratum is then drawn. - A
**cluster sample**is one chosen by dividing the population into groups, called clusters, that are each similar to the entire population. The researcher then randomly selects some of the clusters. The sample consists of the data collected from every member of each cluster selected. - A
**systematic sample**is one chosen by selecting every nth member of the population. Systematic sampling is easy to detect because it always produces the same sample for the same n. To get a different sample you will need a different n value. - A
**convenience sample**is one in which the sample is “convenient” to select. It is so named because it is convenient for the researcher.

Cluster sampling and stratified sampling are often confused but perhaps a simple thought experiment will help. Suppose you wanted to study the comparison of fuel-efficient in different cars driven in the United States. Can you think of some ways to divide the cars into strata that might represent a broader scope of vehicles on the market? For example: size of engine, manufacturer, make, safety rating, number of doors. Notice that these are characteristics the individuals in the samplemay or may not have have and are qualitative. For a clustering look at the same example ask yourself, do you think this method would produce a good representative sample of vehicles if we allowed our clusters to be price ranges? Why or why not? Because cluster sampling is an “all from one group” method, comparing mpg’s from cars in only certain price ranges would not produce a representative sample.

Suppose instead you decide to gather data from half the students in your class for our comparison of fuel-efficient cars. Do you think this example of convenience sampling would be an accurate picture of the population of cars driven in the United States? It is unlikely that students (or any age group for that matter) will drive a wide range of cars. Newer, more expensive cars are less likely to be driven by students and would not be well represented in the student sample.

Lastly, lets try identifying every 5th car accessing the interstate on a particular entrance ramp during rush hour traffic. This is an example of systematic sampling for our fuel study. Can you identify any potential biases that we might need to be aware of when choosing the observation spot? The location of the entrance ramp might lend itself to having cars only on one end of the price scale depending on the businesses located in the area.

**Example 2: Identifying Sampling Techniques**

Identify the sampling technique used to obtain a sample in each of the following situations.

a. To conduct a survey on collegiate social life, you knock on every 5th dorm room door on campus.

b. Student ID numbers are randomly selected from a computer print out for free tickets to the championship game.

c. Fourth grade reading levels across the county were analyzed by the school board by randomly selecting 25 fourth graders from each school in the county district.

d. In order to determine what ice cream flavors would sell best, a grocery store polls shoppers that are in the frozen foods section.

e. To determine the average number of cars per household, each household in 4 of the 20 local counties were sent a survey regarding car ownership.

**Solution**

a. Because the sample is obtained by choosing every nth dorm room, this is systematic sampling. This is a representative sample, as long as students were randomly assigned to dorm rooms and there are no hidden potential biases, like only males may live in every nth room.

b. Since every member has an equal chance of being selected, this is random sampling.

c. The students were divided into strata based on their schools and then a random sample from each school was chosen. This is stratified sampling.

d. Because of the ease of choosing shoppers right in their own store, this is convenience sampling. In this case, convenience sampling is a viable method for gaining a representative sample since the store would be interested in knowing the thoughts of their customers.

e. Cluster sampling was used here because the counties are the natural clusters and all of the households in some of the counties received the surveys.

## 3. Collecting Data Through Experiments

In this section you will learn how to appropriately design, set up, and implement a **designed experiment** to collect data. Recall that data can be collected in two main ways: (1) through sample surveys or (2) through designed experiments. While sample surveys lead to observational studies, designed experiments enable researchers to *control *variables, leading to additional conclusions.

A **designed experiment** is a controlled study whose purpose is to control as many factors as possible to isolate the effects of a particular factor. Designed experiments must be carefully set up to achieve their purposes.

The variables in a designed experiment that are controlled are called the **explanatory variables** or are sometimes called **the factors**. Factors have values that can be changed by the researcher and are considered as possible causes. Examples of factors are:

- The dosage of a drug in a medical experiment
- The type of teaching method in an education experiment
- One drug by itself compared with that drug used in conjunction with another

The designed experiment analyzes the effects of the factors on the **response variable**. Response variables are not part of a controlled environment and have values that are measured by the researcher. Examples of response variables are:

- The blood pressures of the patients
- The test scores for a class
- The sizes of a cancerous tumor for patients

A **treatment** is the specific combination of the values of the factors. Examples of treatments are:

- Giving one medication to one group of patients and a different medication to another
- Using one type of fertilizer on a set of plots of corn and a different type of fertilizer on a different set of plots
- Playing country music to one group of mice and rap music to another

A treatment is applied to “experimental units” (people, plants, materials, or other objects). When experimental units are people, we refer to them as **subjects**. Subjects in an experiment correspond to individuals in a survey.

Here is an example of a designed experiment. While reading this, think about who are the subjects, what is/are the factors and treatments, and what is/are the response variables.

**Example 3: Drug Trials**

Suppose you want to determine whether a new drug, Drug N, is more effective at treating high blood pressure than the existing drug, Drug E. Patients with high blood pressure are given either Drug N or Drug E, and the blood pressures are measured one month later.

**Solution**

For this experiment, the **subjects** are the patients selected to receive either Drug N or Drug E; the **factor** is the type of drug that a subject receives; the **treatment** is the specific drug administered (Drug N or Drug E); and the **response variable** is the subject’s blood pressure after one month. If patients given Drug N have significantly lower blood pressures than patients given Drug E, we would wish to conclude that Drug N is more effective. However, it’s not the easy to immediately draw such conclusions.

A carefully designed experiment ensures that the behavior of the researcher and/or subjects does not influence the outcome of the experiment. It is important for subjects to **not** know which treatment they get. In addition, many experiments will have a group of subjects that are not given any medication. These subjects are given a **placebo** (e.g., a sugar tablet) to control against the possibility that subjects imagine a change in their response variable because they know they are receiving “medication.” It is also important for the researchers to not know which group of patients is given which medication or placebo. An experiment where neither the experimenter nor the experimental unit knows what treatment is being administered is call a **double-blind experiment**.

Conducting an experiment involves considerable planning. Here are some steps to consider:

**Identify the problem**. The first step in planning an experiment (or in most any project at all) is to identify the problem. This includes identifying the general purpose of the experiment, the response variable of interest, and the population. The identified problem is often referred to as**a claim**about the population of interest.**Determine the factors**. The second step in planning an experiment is to determine the factors to be studied. Factors can be identified by experts in the field, by the overall purpose of the experiment, or by using results from previous studies. Factors*must*be identified as either fixed at some predetermined level, controlled (those that will be manipulated in the experiment), or uncontrolled.**Determine the number of experimental units (i.e., the sample size)**. In general, the more the experiment units, the more effective the experiment. However, the number of experimental units could have to be limited by time or money. We will learn some techniques later in the semester to calculate an appropriate number of experimental units.**Determine the level of each factor**. There are three ways to deal with the factors:- Control – Fix the levels at a constant level (for factors not of interest)
- Manipulate – Set the levels at predetermined levels (for factors of interest)
- Randomize – Randomize the experimental units (for uncontrolled factors not of interest). Randomization decreases (or averages out) the effects of uncontrolled factors, even ones not identified or thought about in advance.

**Conduct the experiment**. Subjects must be assigned*at random*to a treatment group. There are different good methods for assigning treatments to experimental units: completely random, matched-pairs (see below), and randomized blocks.If a treatment is applied to more than one experimental unit, this is called**replication**, which can be useful for experimental accuracy and to further decrease the effects of uncontrolled factors. In this step, the experimenter then collects and processes the data.**Test the claim**. In the final step, we conduct**inferential statistics**, which will be studied in detail in sub-competencies 8 through 12.

A **completely randomized design** is when each experimental unit is assigned to a treatment completely at random.

Another type of experimental design is the **matched-pairs design**. A matched-pairs design is when the experimental units are paired up (e.g., twins, the same person before and after the treatment, a husband and wife) and each of the pair is assigned to a different treatment. There are only two levels of treatment (one for each of the pair). For example, a researcher would collect and compare information from the same subject before receiving a certain medication and then after receiving the medication.

Finally, we cannot always control all factors whose effects we do not care about but we suspect might have an effect on our response variable or the factors effecting our response variable. When this occurs. For example, customers with young children have different purchasing habits than those without. Perhaps men or women will respond differently to treatment. However, these are not factors that can be assigned to them. Factors like these may account for some of the variation in the response in experiments because subjects at different levels may respond differently. So we deal with them by grouping or **blocking**, our subjects together and, in effect, analyzing the experiment separately for each block. Such factors are called **blocking factors**, and their levels are called **blocks**. Blocking an experiment is like stratifying in survey design.

**Example 4**

An Internet sales site randomly sent customers to one of three versions of its welcome page. It recorded how long each visitor stated on the site. Additionally analysts want to know if customers that came directly to the site (by typing in the URL) behave differently than those who were referred to the site from other sources (such as search engines). The decide to block by how the customers arrived. Draw a diagram of their experimental design.

**Solution**