1. Statistics in General

Statistics is a discipline that focuses on the collection, organization, and analysis of data (information) to answer questions or make predictions. In most cases it is impossible to know certain characteristics about an entire population, which is simply every member of a group being studied. For example, imagine trying to answer the question: “How many hours of TV, on average, do Americans watch each week?” If you wanted an exact answer, you would have to ask this question of roughly 300+ million people, record their answer, and finally perform mathematical calculations to reach your conclusion. This is clearly not possible, in terms of both time and expense.

With the science of statistics, however, we can ask the question of a representative sample of individuals from the overall population, and then use results from the sample to infer conclusions about the population. One difference between statistics and other math courses you’ve likely taken is that answers from statistics are not usually 100% accurate. This is because data is variable. With statistics, variability in information (data) leads to conclusions that are not certain. Is this a problem? No!

Example 1.1: Identifying Population and Sample

Identify the population and the sample.

  1. In a survey, 359 college students at the University of Jackson were asked if they had tried the October flavor of the month at the campus coffee shop. Eighty‑three of the students surveyed said yes.
  2. A survey of 1125 households in the United States found that 24% subscribe to satellite radio.

Solution

  1. Population: All college students at the University of Jackson

Sample: The 359 college students who were surveyed

  1. Population: All households in the United States

Sample: The 1125 households in the United States that were surveyed

The characteristics of individuals under study are called variables (because information is different from person to person). Basically, variables fall into two categories. Qualitative (or categorical) variables describe a characteristic about an individual such as hair color, gender, or favorite ice cream flavor. Quantitative variables are numerical variables that can be measured with a scale, such as temperature, weight, height, or distance. Notice that all the quantitative variable examples can be ordered (from least to most, for example), whereas there is no natural “ordering” of hair color or ice cream flavor.

Example 1.2: Classifying Data as Qualitative or Quantitative

Classify the following data as either qualitative or quantitative.

  1. Shades of red paint in a home improvement store
  2. Rankings of the most popular paint colors for the season
  3. Amount of red primary dye necessary to make one gallon of each shade of red paint
  4. Numbers of paint choices available at several stores

Solution

  1. Shades of paint are descriptions and cannot be measured, so these are qualitative data.
  2. Rankings are numeric but not measurements or counts, so these are qualitative data.
  3. The amounts of dye needed are measured and therefore are quantitative data.
  4. The numbers of paint choices must be counted, so they are quantitative data as well.

Quantitative variables can be further classified as either discrete (those with a finite or countable number of possible values) or continuous (those with an infinite or un-countable number of possibilities). An example of a discrete variable would be something like the number of offspring a raccoon produces each year. Variables such as height and weight are continuous.

Example 1.5: Classifying Data as Continuous or Discrete

Determine whether the following data are continuous or discrete.

  1. Temperatures in Fahrenheit of cities in South Carolina
  2. Numbers of houses in various neighborhoods in a city
  3. Numbers of elliptical machines in every YMCA in your state
  4. Heights of doors

Solution

  1. Temperatures could be measured to any level of precision based on the thermometer used, so these are continuous data.
  2. Numbers of houses are discrete data because houses are counted in whole numbers. A house under construction is still a house.
  3. The numbers of elliptical machines are counts, so these are discrete data.

One word of warning: although the word quantitative means numerical, that doesn’t mean that numerical variables are automatically classified as quantitative variables. For example, social security numbers and zip codes are numerical, but mathematical operations of adding, subtracting, averaging, or even ordering provide results that make no sense. Suppose you grew up in Buhl, Idaho (zip code 83316) and now live in Manitowoc, WI (zip code 54220). Do the numbers:

section1-1

or

section1-2

provide any useful information about where you have lived? Definitely not! Just for fun, find out where the cities corresponding to the numbers 29096 and 68786 are located.

2. Visualizing Qualitative Data

Once data has been collected, what you usually have is simply a long list of the raw data. It is very difficult, if not impossible, to determine any patterns or underlying “themes” from a list of data, especially if the data set has more than 20 or so elements. There are three main methods used to summarize qualitative data: in a table (tabular form), in a bar graph, or with a pie chart.

Tables

An easy way to initially summarize the data is with a frequency distribution, which simply lists each main category in the data set, along with the corresponding number of occurrences within each category.

For example, take a bag of plain M&M candies (the ones in the brown package). If you open the bag and simply dump the M&Ms into a bowl, you see lots of colors, but no underlying patterns. If, however, you divide the M&Ms into a separate category for each color (Brown, Red, Blue, Yellow, Green, and Orange), you can count the number of M&Ms of each color. A frequency distribution for a bag of M&Ms might look like:

COLOR FREQUENCY (NUMBER)
Brown 9
Red 11
Blue 12
Yellow 8
Green 4
Orange 14

 

If you were to open another bag, your frequencies will be somewhat different.

It may still be difficult to identify any underlying patterns if your data set has a large number of categories, or if the frequencies are large numbers. In this case, there is a better way to make a table. If you simply divide the frequency of each category by the total population size, you create a relative frequency distribution, which lists the percent (of proportion) of observations within each category relative to the total number of observations. Mathematically, this is easy. We just divide the frequency in each category by the total number to get a percent. For my bag of M&M’s, which had 58 total M&M’s, the relative frequency distribution would be:

COLOR RELATIVE FREQUENCY
Brown 0.155 ( = 9/58 )
Red 0.190 ( = 11/58 )
Blue 0.207 ( = 12/58 )
Yellow 0.138 ( = 8/58 )
Green 0.069 ( = 4/58 )
Orange 0.241 ( = 14/58 )

 

Because every frequency is now related to the total, it is extremely easy to make comparisons among the different categories, since they are now on the same “scale.” One quick check is to add up the percentages. Since we counted every M&M within the bag, and every M&M must belong to one of the six categories, our percentages should add up to 1 ( = 100%). Every once in a while, due to rounding errors, you may not get exactly 1.

Bar Graphs

Although tables are useful, they still aren’t a nice “picture” of the data. In general, visual methods, such as bar graphs, provide a much better summary of data than just a table alone. This doesn’t mean we wasted our time creating a table, because we’ll need it to draw our bar graph!

A bar graph is a graph constructed in the Cartesian coordinate system where data categories are listed on the x-axis, and a bar (or rectangle) is drawn above each category, where the height of each rectangle corresponds to each category’s frequency or relative frequency. In addition, a horizontal space separates each category (this helps distinguish between data that is distinct from data that has a continuous “flow”). Such graphs are typically easy to create by hand, and even easier in computational tools like Excel. One drawback to creating a bar graph by hand is that you need to be very careful and precise when drawing the heights of the rectangles, so as not to provide an inaccurate picture of the data. Two bar graphs for the above distributions are shown below. The graph on the left is the bar graph of our frequency distribution, and the graph on the right is the bar graph of our relative frequency distribution. Notice that the values on the vertical scale in the left bar graph are the counts of M&Ms within each category, and the values on the vertical scale in the right bar graph are the percentages of M&Ms within each category. Both graphs tell the same story and allow for an easier category-to-category comparison than the tables.

section1-3

There are some common good practices in constructing bar graphs that should be followed. With the horizontal scale, categories should be spaced equally apart, and the rectangles should have the same widths. The vertical scale should begin with 0, should be incremented in reasonable steps, and should go somewhat, but not significantly, beyond the largest frequency or relative frequency.

One more simplification with bar graphs that allows for easier comparisons (especially if you have numerous categories) is to arrange the bars in order of height, starting with the highest category on the left, followed by the second highest, etc., until the lowest category, which will be all the way on the right. This special type of bar graph is commonly referred to as a Pareto Chart, named after the fellow who thought of this arrangement. A Pareto chart for our relative frequency distribution of M&M’s looks like:

section1-4

Pie Charts

The final graphical method for categorical data is a pie chart. Although not used for conveying information in scientific fields (you’ll never open up a scientific research paper and see a pie chart), newspapers and other media use these because they are a relatively simple way to get across their point quite quickly. However, pie charts are not very effective if there are too many categories or if some relative frequencies are too small.

section1-5

To create a pie chart, we need the percent values from our relative frequency distribution. Since a whole pie is 100% of itself, we use a pie piece with an appropriate size to represent the percent of data within each category. Different colors should be used to distinguish each category, and each category should be labeled with the category name and relative frequency. The pie chart above represents the M&M data.

3. Visualizing Quantitative Data

There are more ways to summarize quantitative data than qualitative data because numerical data comes in two forms: discrete or continuous (as mentioned earlier). You will learn how to create tables and histograms of each type of data. Two other summary methods for quantitative data are stem-and-leaf plots and dot-plots. These plots are rarely used except as preliminary (quick-and-dirty) techniques for understanding your data.

Discrete Data

The methods for summarizing discrete data are similar to methods used for summarizing qualitative data, since discrete data can be put into separate categories.

Tables

Discrete quantitative data can be presented in tables in several of the same ways as qualitative data: by values listed in a table, by a frequency table, or by a relative frequency table. The only difference is that instead of using category names, we use the discrete values taken by the data.

Histograms

Discrete quantitative data can be presented in bar graphs in the same ways as qualitative data. A bar graph for any type of quantitative data is called a histogram. The discrete values taken by the data are labeled in ascending order across the horizontal axis, and a rectangle is drawn vertically so that the height of each rectangle corresponds to each discrete variable’s frequency or relative frequency. The main visual difference between a bar graph (qualitative data) and a histogram (quantitative data) is that there should be no horizontal spacing between numerical values along the horizontal axis. In other words, rectangles touch each other in a histogram.

Stem-and-Leaf Plots

A stem-and-leaf plot is a graph of quantitative data that is similar to a histogram in the way that it visually displays the distribution. A stem-and-leaf plot retains the original data. The leaves are usually the last digit in each data value and the stems are the remaining digits. A legend, sometimes called a key, should be included so that the reader can interpret the information.

Constructing a Stem‑and-Leaf Plot

  1. Create two columns, one on the left for stems and one on the right for leaves.
  2. List each stem that occurs in the data set in numerical order. Each stem is normally listed only once; however, the stems are sometimes listed two or more times if splitting the leaves would make the data set’s features clearer.
  3. List each leaf next to its stem. Each leaf will be listed as many times as it occurs in the original data set. There should be as many leaves as there are data values. Be sure to line up the leaves in straight columns so that the table is visually accurate.
  4. Create a key to guide interpretation of the stem‑and-leaf plot.
  5. If desired, put the leaves in numerical order to create an ordered stem-and-leaf plot.

 Example 1.3: Creating a Stem-and-Leaf Plot

Create a stem-and-leaf plot of the following ACT scores from a group of college freshmen.

1.3 Creating A Stem and Leaf Plot

Solution

1.3 Creating A Stem and Leaf Plot - Answer

Continuous Data

Continuous data has an infinite number of possibilities (like weights, heights, and times). In terms of summarizing techniques, the main difference between discrete data and continuous data is that continuous data cannot directly be put into frequency tables since they do not have any obvious categories (you cannot create a table or histogram with an infinite number of categories).

To get around this, categories are created using classes, or intervals (ranges) of numbers. Each class has a lower class limit, which is the smallest value within the class, and an upper class limit, which is the largest value within the class. The class width is the difference between the upper class limit and the lower class limit. Finally, if a class does not have a lower or upper class limit (e.g., “shorter than 4 feet” or “60 and older”), the class is said to be open ended.

Tables

Once classes are established for a continuous variable, each data value will belong to one (and only one) class. Counts of the number of data values within each class can now be made, resulting in a table of either a frequency distribution (raw counts) or of a relative frequency distribution (percentage).

Some good practices for constructing tables for continuous variables are listed below. The word “reasonable” in the last two points is very subjective.

    • Classes should not overlap.
    • Classes should not have any gaps between them.
    • Classes should have the same width (except for possible open-ended classes at the extreme low or extreme high ends).
    • Class boundaries should be reasonable numbers.
    • A class width should be a reasonable number.

Histograms

Once a table of values has been created for your continuous data, a histogram is created, as before, where the classes now make up the horizontal scale (remember that in a histogram, the rectangles touch each other). One drawback with histograms of continuous data is that when changing the class width, the appearance of the histogram can change dramatically.

A Continuous Data Example

section1-6
The included table presents the percentage of each state’s residents who were living in poverty in 2002. The data includes a data value for the District of Columbia so there are 51 data values. I’ll describe the process for creating frequency and relative frequency tables and their corresponding.

Frequency and Relative Frequency Tables

In just looking over the data (which illustrates the problem of obtaining information from just a raw list of numbers), the lowest percentage of poverty is 5.6% and the largest is 18.0%. Since percentages are typically not discrete, we must create classes of percentages to begin summarizing the data. Let’s utilize a class width of 1 (percent) with the first class limit beginning at 5%.

When labeling the class widths, you must be careful so that there is no overlap. If we had class widths labeled, “5 – 6,” “6 – 7,” “7 – 8,” and so on, confusion arises when you encounter a percentage of, say, 7.0%. Which class does the data value fall into? “6 – 7” or “7 – 8?” Therefore, be careful and label each class as, “5 – 5.9,” “6 – 6.9,” “7 – 7.9,” and so on.

Class Frequency Relative Frequency
5 – 5.9 1 0.0196 ( = 1/51)
6 – 6.9 1 0.0196
7 – 7.9 3 0.0588
8 – 8.9 7 0.1373
9 – 9.9 9 0.1765
10 – 10.9 6 0.1177
11 – 11.9 5 0.0980
12 – 12.9 3 0.0588
13 – 13.9 5 0.0980
14 – 14.9 4 0.0784
15 – 15.9 1 0.0196
16 – 16.9 2 0.0392
17 – 17.9 3 0.0589
18 – 18.9 1 0.0196

How you round the percentages in the “Relative Frequency” column is a matter of taste. Just make sure not to round too much or to carry unnecessary decimals.

Histograms of the Frequency and Relative Frequency Table Results

I’m using Excel for this, although if you’re careful, you can draw them by hand.

section1-7

Notice that, as with bar graphs, both histograms have exactly the same shape. The only difference is the vertical scale. It’s up to you, and the point you’re trying to get across to your audience, which graphical summary to present.

If we repeat the above process with a class width of 2, rather than 1, we get the following table and histograms:

Class Frequency Relative Frequency
5 – 6.9 2 0.0392 ( = 2/51)
7 – 8.9 10 0.1961
9 – 10.9 15 0.2941
11 – 12.9 8 0.1569
13 – 14.9 9 0.1765
15 – 16.9 3 0.0588
17 – 18.9 4 0.0784

section1-8

Notice the differences between the histograms for class widths of 1% and 2%. Which histogram better conveys the information from the data? Why? Detail can be lost when grouping too much data together. On the other hand, if too many classes have no data values (counts of 0 or 1), then the histogram is likely too specific. You may have to experiment with a few histograms and choose the one that looks best. Again, this can be a matter of taste.

Describing the Shape of a Histogram

One important characteristic of continuous data is that we can describe its distribution. This results from continuous data having a unique ordering of possible values (since it is numerical). In plain terms, a distribution shows how data values are spread out (or distributed) across all possible results. We will focus mainly on one specific distribution in class, the Normal distribution, but you should keep in mind that this is only one of possibly infinite distributions of data. Wikipedia has an impressive list of many of the more important discrete and continuous distributions that often arise in statistical studies. See:

http://en.wikipedia.org/wiki/List_of_probability_distributions

There is a set of common language to describe the overall shape of data distributions. Are all the rectangles (roughly) the same height for each category? Is there one central peak, with frequencies failing off in either direction? Here are the common shapes we’ll encounter:

Uniform Distribution: A distribution where each of the values tends to occur with the same frequency. In this case, the histogram looks flat.

Bell-Shaped Distribution: A distribution where most of the values fall in the middle (a central peak), and the frequencies tail off to the left and to the right. A bell-shaped distribution is called symmetric, where both the right and left sides have (roughly) the same shape.

Right-Skewed Distribution: A distribution that is not symmetric, and where the tail to the right is longer than the tail to the left.

Left-Skewed Distribution: A distribution that is not symmetric, and where the tail to the left is longer than the tail to the right.

If you look to the histograms of class width 1% created for the previous example, it is clear that these are right-skewed distributions, since the peak of values tends to fall around 9%, and there is a tail of values extending far to the right, all the way out to 18%.

Time-Series Data

section1-9

When a variable of interest is measured at different points in time, the data is time-series data. This means that time is the variable on the horizontal axis. Such a plot is called a time-series plot. Time series plots are used to identify long-term trends in a variable, or to identify regularly occurring trends. This time-series plot illustrates how fetal head circumference changes throughout a gestational period. Notice the roughly linear pattern. This information can be used to identify changes in or problems with fetal growth.

Careful Graphical Techniques

Statistical displays can distort the truth. It’s frightening how easy it can be to mislead or even deceive people with graphical summaries. If you unintentionally distort the truth with a bad graph, this is called misleading. If you intentionally distort your results, this is called deceiving. Many people, including scientists and especially the media, are guilty of both.

section1-10

If you want to illustrate a certain point, but your graphical summary does not support your point, it is relatively simple to make changes (for example, to rectangle heights or vertical/horizontal scales) that will then lend credibility to your point. For example, look at the two graphs provided here, which illustrate the average distance of PGA golfers’ drives off the tee for the 10-year period 1997 – 2006. The graph on the left shows a sharp increase in driving distance over the years, while the graph on the right shows little, if no, increase in driving distance at all. Both are from the same set of data! The difference between the two graphs is the choice of vertical scale. While the graph on the left shows distances from 250 – 300 yards, the graph on the right shows distances from 0 – 400 yards. In other words, the graph on the left has zoomed in on the relevant range of distances.

Although no social or political harm can come from misleading the public on golf driving distances, there are many issues where certain groups can easily deceive the public by showing graphs in ways that illustrate the point(s) they are trying to get across. It is up to you to take a few moments when you see a graphical summary of someone’s data to see what is actually going on.

Summary

Qualitative Data

Summary of Qualitative Data - Visual

 Quantitative Data

Summary of Quantative Data