4. Sample Standard Deviation

The standard deviation provides the most commonly used measure of the spread of the underlying distribution of data. When comparing two distributions, the larger the standard deviation, the more dispersion the data set has (the data values, as a whole, differ more from the mean). Think about these:

  1. Would you rather stand in a line where the standard deviation of wait times is 3 minutes or 6 minutes? Why?
  2. Would you rather invest in a stock with a standard deviation rate of return of 8% or 20%? Why?

The standard deviation and the mean are the most popular methods for numerically summarizing the distribution of a symmetric variable. Knowing both the mean (center) and standard deviation (spread) tells us a lot about the distribution. We’ll be using these two values for most types of statistical inference later on.

For those of you scared by the above calculations for variance, you can relax! The standard deviation is simply the square root of the variance:

standard deviation = √variance

The sample standard deviation, denoted by s, is simply the square root of the sample variance:

s = √var = √s2

Example Calculations for a Sample Standard Deviation

For the above example of exam scores, the population variance was s2 = 127.2. Therefore, the sample standard deviation is:

s = √s2 = √127.2 ≈ 11.2783

On the TI-83/84 Calculator we can use the 1-Var Stats to find the standard deviation on the list of statistics as Sx=11.27829774.

Sec02. ti-83b.

Five-Number Summary and Boxplots

There are other important tools to summarize data numerically and/or graphically. A five-number summary is simply a list of 5 numerical values from a data set, one of which you’ve already learned to compute: the median. Boxplots are a way to visualize a five-number summary, and are an excellent way to compare the distributions of two (or more) data sets.

Five-Number Summary

The five-number summary of a data set consists of the following 5 numbers from the data set:

minimum, first quartile, median, third quartile, maximum

min, Q1, M = Q2, Q3, max

The median of a data set, being the middle value, necessarily splits your data set into two pieces…one consisting of all the values to the left of (and thus lower than) the median and one consisting of all the values to the right of (and thus greater than) the median. The first quartile is simply the median value of the lower half of your data set and the third quartile is the median value of the upper half of your data set. The median of your entire data set is obviously the same as a second quartile. The quartiles split your data into fourths, which means that 25% of the data values fall between the minimum data value and Q1, 25% of the data values lie between Q1 and the median, 25% of the data values lie between the median and Q3, and 25% of the data lies between Q3 and the maximum value of the data set. In addition, 50% of all observations fall between Q1 and Q3. The five-number summary provides a cool way to describe the distribution of a data set.

Boxplots

An obvious problem with a 5-number summary is that it is simply a list of numbers. A way to visualize the five-number summary is through a boxplot. To create a boxplot:

  1. Draw a horizontal or vertical number line (like an x-axis or y-axis) that has a scale corresponding to the values within the data set.
  2. Draw vertical lines at the positions of Q1, Q2 (the median), and Q3.
  3. Enclose these three vertical lines with a box.
  4. Draw vertical lines at the positions of the minimum and the maximum values of the data set.
  5. Finally, extend horizontal lines from the central box out toward the minimum and maximum values.

As mentioned earlier, boxplots are very useful when comparing two (or more) data sets. One can simply make comparisons between the data sets using the five-number summary by determining how each data set is distributed.

Example

As an example, shown are five-number summaries of three data sets, one corresponding to the home run distances for each of three players, Mark McGwire, Sammy Sosa, and Barry Bonds for a given year.

section2-13
Creating a boxplot for each player, using a common vertical scale for home run distances allows us to compare the distribution of home run distances for each player.

section2-14
The distribution shape and boxplot are related. We can identify symmetry or skewness, the quartiles, and the maximum and minimum values of a distribution from a boxplot. And if more than one boxplot is plotted on the same scale, we can visually compare the centers, the spreads, and the extreme values.

Choosing an Appropriate Measure of Center

When presented with data a primary question would be which measure of center best describes this data? Lets look at an example to help us decide.

Example: Calculating Measures of Center—Mean, Median, and Mode

Given the recent economy and change of attitude in society, many people chose to take on another job after retiring from one. Below is a sample of ages at which people truly retired; that is, they stopped working for pay. Calculate the mean, median, and mode for the data.

84, 80, 82, 77, 78, 80, 79, 42

Solution

Mean: Remember, the mean is the sum of all the data points divided by the number of points.

Sec02. Mean

Median:   We have an even number of values, so we will need the mean of the middle two values in   the ordered array.

42, 77, 78, 79, 80, 80, 82, 84

Sec02. Median

Mode:      The number 80 occurs more than any other number, so it is the mode.

As you can see in the figure, of the three measures of center, the mean is closest to the outlier while the median and mode are more similar in value and are not affected by the outlier.

Sec02. Mode

Determining the Most Appropriate Measure of Center

  1. For qualitative data, the mode should be used.
  2. For quantitative data, the mean should be used, unless the data set contains outliers or is skewed.
  3. For quantitative data sets that are skewed or contain outliers, the median should be used.

 

Example: Choosing the Most Appropriate Measure of Center

Choose the best measure of center for the following data sets.

  1. T-shirt sizes (S, M, L, XL) of American women
  2. Salaries for a professional team of baseball players
  3. Prices of homes in a subdivision of similar homes
  4. Professor rankings from student evaluations on a scale of best, average, and worst.

Solution

  1. T-shirt sizes are qualitative, the mode is the best measure of center.
  2. The players’ salaries are quantitative data with outliers, since the superstars on the team make substantially more than the typical players. Therefore, the median is the best choice.
  3. The home prices are quantitative data with no outliers, since the homes are similar. Therefore, the mean is the best choice.
  4. The rankings are qualitative, it’s best to use the mode as a measure of center.

Summary

Sec02. Summary