We are now getting into the more computational part of statistics. When we describe a distribution, there are three main characteristics to identify: its shape, center, and spread. We already learned about the many shapes distributions can take. The center and spread of a distribution are *numerical summaries* of a data set. There are many ways to describe both the center and spread of a data set.

## 1. Intro

## 2. Measures of Center

In this section we learn three **measures of central tendency**, i.e., three ways to identify the “center” of a data set. You should already be somewhat familiar with these basic ideas.

### The Mode

The **mode** of a variable is a simple measure of center that describes the most frequent (recurring) observation in a data set. For example, in a data set such as {0,1,2,2,2,3,4,8}, the value 2 would be the mode of the data set since it appears the most times. If you refer back to the exam score data shown previously, none of the data values appears more than once. This means that the exam score data has NO mode. The mode is rarely used in serious statistics studies.

### The Arithmetic Mean

The **arithmetic mean** of a variable is often what people mean by the “average. ” To calculate, simply add up all the values and divide by how many there are. In statistics, the symbol x̄ (pronounced x-bar) is used to denote the **mean of a sample**.

As an example, the data shown in the table are the first exam scores for all 17 students from a calculus class. You should verify that the average is:

The mean is only valid for quantitative data and can be thought of as a balance point for the set of data. For example, if you had only three exam scores, such as 81, 85, and 83, you can see how “borrowing” two exam points from the 85 and applying them to the 81 would make all three scores be 83. That is your average value.

### The Median

The **median** of a variable, typically denoted by the capital letter M, is another measure of the “center.” The median is simply the numerical value of the data value that occupies the physical middle location of your ordered data set.

After just a moment of thinking, it’s clear that the calculation of the median of a variable is slightly different depending on if there are an odd number of data points, or if there are an even number of data points. (Think about the middle value of 3 data points and the middle value of 4 data points.)

To calculate the median of a data set, arrange the data in order and count the number of observations, *n*. If the value of *n* is odd, then there will be data value that is exactly in the middle. That data value is the median M. If *n* is even, then there will be two values on either side of the exact middle. In this case, the median is defined to be the average of these two data values. In either case, the *location* of the mean can be found through the simple calculation:

Please be careful to note that the median is ** NOT** the value of the fraction

The value of this fraction is simply the

*location*of the median.

Returning to the example of the first exam scores above, we first sort the scores in ascending order, as shown. Since there are 17 scores, an odd number, there will be an exact middle score. If your data set has just a few values, it is easy to find the middle. However, just to be sure, the location of the median comes from the formula

Thus, the 9th data value gives us our median, *M* = 76.5.

Now suppose that the student who scored a 46 wasn’t enrolled in class in the first place. That would mean the population has *n* = 16 members (just ignore the 46). In that case, the location of the median would be

Obviously there is not a score in the 8.5th location. Thus, we take the two scores from the 8th and 9th positions, 76.5 and 77, and find their average:

Therefore, the population median would be *M* = 76.75.

**On the TI-83/84 Calculator:**

First we enter the data into the lists. To do this we hit the STATS Key and select the first option 1:Edit. Then we enter the data into the first list and hit ENTER after every value.

Then we hit STAT again and then arrow over to the CALC menu and select 1: 1-Var Stats, hit ENTER and then choose your list to be L1 and hit ENTER on the Calculate. We notice that the value of the mean is shown as well as the size of our data set n=13.

If we arrow down further we see some more statistics including the median. Observe that the calculator does not give us the mode. For more details on using the calculator to do statistics check out the Technology Guide on D2L.

### Comparing the Mean and Median Values

Although both the mean and median are measures of the center of a data set, it is rarely the case that the two will be exactly equal. In fact, there are times when the two will be very different. How the mean and median relate to each other tells is a lot about the distribution of the underlying data set.

Basically, it’s important to know that the mean is a measure of center that is *highly* sensitive to changes in the data values. Think about the process of taking an average…* every* data value is used in the computation. If simply one data value changes, then the average will change. The change in the value of the mean will be much more drastic if one of the outlying data values changes. In essence, the mean is ** not resistant** to changes to extreme values.

On the other hand, the median is a measure of center that is not very sensitive to changes in the data values. Really only one or two data points determine the value of the mean. The values of the other data points do not factor at all into the median. They serve only as placeholders, which lead to the middle value. Because of this, the median is ** resistant** to changes in extreme values.

**Example:**

Data |
Mean |
Median |
---|---|---|

{1, 5, 13, 20, 28} | μ=13.4 | M = 13 |

{1, 5, 13, 20, 280} | μ=63.8 | M = 13 |

{1, 5, 13, 20} | μ=7.8 | M = 9 |

Notice how drastically the mean changes in each case, while the median stays either the same or changes just slightly.

Because of the sensitivity of the mean, it gets pulled in the direction of the tails for skewed data sets. Basically, if the distribution is:

**Symmetric:**the mean will usually be close to the median**Left (Or Negative) Skew:**the mean will usually be smaller than the median**Right (or Positive) Skew:**the mean will usually be larger than the median

The following picture illustrates the graphical relationship of the mean, median, and mode in the three types of data:

## 3. Measures of Spread

The purpose of identifying a “central” value from a data set was to describe a *typical* value in the data set. Once we know this, we can measure the amount of dispersion or spread of the data values *from the typical, central, value*. In other words, we’re going to calculate how “spread out” our data is. Three main measures of dispersion for a data set are the **range**, the **variance**, and the **standard deviation**.

### The Range

The range of a variable is simply the “distance” between the largest data value and the smallest data value. In math symbols:

Range = largest data value – smallest data value

The table shown provides the first exam scores for a class with 11 students. The range for this data set is:

Range = 94 – 64 = 30

Calculating the range requires the use of only two values: the smallest and largest data values. If either of the two values changes, so does the range. Therefore, the range clearly is ** not resistant** to extreme values in the data set. No other data values affect the range.

### Sample Variance

The variance of a data set is a numerical summary that indicates the *average deviation of each data value from the mean* of a data set. The calculation of the variance of a data set requires us to compare each data value from our raw list, {*x _{1},x_{2},x_{3},…,x_{(n-1)},x_{n}*}, to the mean

*x̄*. The idea of deviation is just the difference, as computed by subtraction. In symbols, the deviation about the mean for the

*i*data value, x

^{th}_{i}, is the value: (

*x*).

_{i}– x̄Because of the definition of the mean of a data set, if you add up the deviation from the mean for each data value, you will always get zero. In symbols, Σ(*x _{i} – x̄*) = 0. This is a bit technical, but what it basically means is that we cannot just average the sum of deviations. We’d always get zero!!

To get around this, we need a way to make all deviations from the mean positive, regardless of whether a data value is below or above the mean. For example, if you live two miles north of a city and I live two miles south of the same city, it would be ridiculous to say, “I live negative 2 miles from the city.” We both live two miles away.

Mathematically, one way to make all deviations positive is to use an absolute value. Another way, which we’ll use for calculating both the variation and standard deviation of a data set, is to ** square** each deviation. For the city example, your deviation value would be 2

^{2}= 4 and my deviation value would be (-2)

^{2}= 4. Therefore, our deviation, regardless of being positive or negative, would be the same! So, to treat positive differences and negative differences as the same, we square the deviations: (

*x*)

_{i}– x̄^{2}.

Finally, since the variance measures the *average deviation of each data value from the mean* of the entire data set, we add up the squared-deviation value for each data point and divide by the value (*n* – 1), one fewer than the number of data values. This is another techical “difficulty” that we’ll deal with later. The value (*n* – 1) is given the special designation **degrees of freedom** of a data set. The reason for this will be made more clear throughout the class, but imagine the following simple scenario: You and four friends go to a Chinese restaurant, and at the end of the meal, your server brings your group 5 fortune cookies, setting them in a pile in the middle of the table. How many of your party of 5 get to actually *choose* their fortune? Only 4. The reason is obvious: after 4 people have had their choice of fortune cookies, only one remains. The fifth person has ** no choice of fortune**. The degrees of freedom for this “problem” is thus 5 – 1 = 4.

Here’s another example, this time from a math standpoint: if someone tells you that they are thinking of 3 numbers whose average is 5, how many of the three numbers do you need to know before you know all 3? After a little thought, you’ll realize the answer is 2. If you are told that two of the numbers are 2 and 10, a little thought (and some algebra) will help you find that the last number must be

Again, the degrees of freedom for this problem is 3 – 1 = 2.

In all its glory, the math formula for calculating the sample variance is:

where *n* is the size of the sample.

**Example: **Calculations for a Sample Variance

Returning to the population of exam scores for a class with 11 students, the table above illustrates the (sometimes tedious!) calculations for the population variance. The mean of this population of data is *x̄* = 82.

The total squared deviation for the population data is **1272**. Therefore, the variance for the data set is:

Score |
Deviation From Mean |
Squared Deviation |
---|---|---|

94 |
94 – 82 = 12 |
12^{2} = 144 |

87 |
87 – 82 = 5 |
5^{2} = 25 |

95 |
95 – 82 = 13 |
13^{2} = 169 |

68 |
– 14 |
196 |

72 |
– 10 |
100 |

75 |
– 7 |
49 |

88 |
6 |
36 |

89 |
7 |
49 |

94 |
12 |
144 |

76 |
– 6 |
36 |

64 |
– 18 |
324 |

SUM = 1272 |

## 4. Sample Standard Deviation

The standard deviation provides the most commonly used measure of the spread of the underlying distribution of data. When comparing two distributions, the larger the standard deviation, the more dispersion the data set has (the data values, as a whole, differ more from the mean). Think about these:

- Would you rather stand in a line where the standard deviation of wait times is 3 minutes or 6 minutes? Why?
- Would you rather invest in a stock with a standard deviation rate of return of 8% or 20%? Why?

The standard deviation and the mean are the most popular methods for numerically summarizing the distribution of a symmetric variable. Knowing both the mean (center) and standard deviation (spread) tells us a lot about the distribution. We’ll be using these two values for most types of statistical inference later on.

For those of you scared by the above calculations for variance, you can relax! The standard deviation is simply the square root of the variance:

**standard deviation = √variance**

The **sample standard deviation**, denoted by ** s**, is simply the square root of the

**sample variance**:

*s* = √*var* = √*s*^{2}

**Example Calculations for a Sample Standard Deviation**

For the above example of exam scores, the population variance was *s*^{2} = 127.2. Therefore, the sample standard deviation is:

*s* = √*s*^{2} = √127.2 ≈ 11.2783

On the TI-83/84 Calculator we can use the 1-Var Stats to find the standard deviation on the list of statistics as Sx=11.27829774.

### Five-Number Summary and Boxplots

There are other important tools to summarize data numerically and/or graphically. A five-number summary is simply a list of 5 numerical values from a data set, one of which you’ve already learned to compute: the median. Boxplots are a way to visualize a five-number summary, and are an excellent way to compare the distributions of two (or more) data sets.

### Five-Number Summary

The five-number summary of a data set consists of the following 5 numbers from the data set:

minimum, first quartile, median, third quartile, maximum

*min*, *Q*_{1}, *M* = *Q*_{2}, *Q*_{3}, *max*

The median of a data set, being the middle value, necessarily splits your data set into two pieces…one consisting of all the values to the left of (and thus lower than) the median and one consisting of all the values to the right of (and thus greater than) the median. The first quartile is simply the median value of the lower half of your data set and the third quartile is the median value of the upper half of your data set. The median of your entire data set is obviously the same as a second quartile. The quartiles split your data into fourths, which means that 25% of the data values fall between the minimum data value and *Q*_{1}, 25% of the data values lie between *Q*_{1} and the median, 25% of the data values lie between the median and *Q*_{3}, and 25% of the data lies between *Q*_{3} and the maximum value of the data set. In addition, 50% of all observations fall between *Q*_{1} and *Q*_{3}. The five-number summary provides a cool way to describe the distribution of a data set.

### Boxplots

An obvious problem with a 5-number summary is that it is simply a list of numbers. A way to visualize the five-number summary is through a **boxplot**. To create a boxplot:

- Draw a horizontal or vertical number line (like an
*x*-axis or*y*-axis) that has a scale corresponding to the values within the data set. - Draw vertical lines at the positions of
*Q*_{1},*Q*_{2}(the median), and*Q*_{3}. - Enclose these three vertical lines with a box.
- Draw vertical lines at the positions of the minimum and the maximum values of the data set.
- Finally, extend horizontal lines from the central box out toward the minimum and maximum values.

As mentioned earlier, boxplots are very useful when comparing two (or more) data sets. One can simply make comparisons between the data sets using the five-number summary by determining how each data set is distributed.

**Example**

As an example, shown are five-number summaries of three data sets, one corresponding to the home run distances for each of three players, Mark McGwire, Sammy Sosa, and Barry Bonds for a given year.

Creating a boxplot for each player, using a common vertical scale for home run distances allows us to compare the distribution of home run distances for each player.

The distribution shape and boxplot are related. We can identify symmetry or skewness, the quartiles, and the maximum and minimum values of a distribution from a boxplot. And if more than one boxplot is plotted on the same scale, we can visually compare the centers, the spreads, and the extreme values.

**Choosing an Appropriate Measure of Center**

When presented with data a primary question would be which measure of center best describes this data? Lets look at an example to help us decide.

**Example: Calculating Measures of Center—Mean, Median, and Mode**

Given the recent economy and change of attitude in society, many people chose to take on another job after retiring from one. Below is a sample of ages at which people truly retired; that is, they stopped working for pay. Calculate the mean, median, and mode for the data.

84, 80, 82, 77, 78, 80, 79, 42

**Solution **

**Mean: **Remember, the mean is the sum of all the data points divided by the number of points.

**Median: **We have an even number of values, so we will need the mean of the middle two values in the ordered array.

42, 77, 78, **79**, **80**, 80, 82, 84

**Mode: **The number 80 occurs more than any other number, so it is the mode.

As you can see in the figure, of the three measures of center, the mean is closest to the outlier while the median and mode are more similar in value and are not affected by the outlier.

**Determining the Most Appropriate Measure of Center**

- For qualitative data, the mode should be used.
- For quantitative data, the mean should be used, unless the data set contains outliers or is skewed.
- For quantitative data sets that are skewed or contain outliers, the median should be used.

**Example: Choosing the Most Appropriate Measure of Center**

Choose the best measure of center for the following data sets.

- T-shirt sizes (S, M, L, XL) of American women
- Salaries for a professional team of baseball players
- Prices of homes in a subdivision of similar homes
- Professor rankings from student evaluations on a scale of
*best*,*average*, and*worst.*

**Solution **

- T-shirt sizes are qualitative, the mode is the best measure of center.
- The players’ salaries are quantitative data with outliers, since the superstars on the team make substantially more than the typical players. Therefore, the median is the best choice.
- The home prices are quantitative data with no outliers, since the homes are similar. Therefore, the mean is the best choice.
- The rankings are qualitative, it’s best to use the mode as a measure of center.

**Summary**