3. Visualizing Quantitative Data

There are more ways to summarize quantitative data than qualitative data because numerical data comes in two forms: discrete or continuous (as mentioned earlier). You will learn how to create tables and histograms of each type of data. Two other summary methods for quantitative data are stem-and-leaf plots and dot-plots. These plots are rarely used except as preliminary (quick-and-dirty) techniques for understanding your data.

Discrete Data

The methods for summarizing discrete data are similar to methods used for summarizing qualitative data, since discrete data can be put into separate categories.

Tables

Discrete quantitative data can be presented in tables in several of the same ways as qualitative data: by values listed in a table, by a frequency table, or by a relative frequency table. The only difference is that instead of using category names, we use the discrete values taken by the data.

Histograms

Discrete quantitative data can be presented in bar graphs in the same ways as qualitative data. A bar graph for any type of quantitative data is called a histogram. The discrete values taken by the data are labeled in ascending order across the horizontal axis, and a rectangle is drawn vertically so that the height of each rectangle corresponds to each discrete variable’s frequency or relative frequency. The main visual difference between a bar graph (qualitative data) and a histogram (quantitative data) is that there should be no horizontal spacing between numerical values along the horizontal axis. In other words, rectangles touch each other in a histogram.

Stem-and-Leaf Plots

A stem-and-leaf plot is a graph of quantitative data that is similar to a histogram in the way that it visually displays the distribution. A stem-and-leaf plot retains the original data. The leaves are usually the last digit in each data value and the stems are the remaining digits. A legend, sometimes called a key, should be included so that the reader can interpret the information.

Constructing a Stem‑and-Leaf Plot

  1. Create two columns, one on the left for stems and one on the right for leaves.
  2. List each stem that occurs in the data set in numerical order. Each stem is normally listed only once; however, the stems are sometimes listed two or more times if splitting the leaves would make the data set’s features clearer.
  3. List each leaf next to its stem. Each leaf will be listed as many times as it occurs in the original data set. There should be as many leaves as there are data values. Be sure to line up the leaves in straight columns so that the table is visually accurate.
  4. Create a key to guide interpretation of the stem‑and-leaf plot.
  5. If desired, put the leaves in numerical order to create an ordered stem-and-leaf plot.

 Example 1.3: Creating a Stem-and-Leaf Plot

Create a stem-and-leaf plot of the following ACT scores from a group of college freshmen.

1.3 Creating A Stem and Leaf Plot

Solution

1.3 Creating A Stem and Leaf Plot - Answer

Continuous Data

Continuous data has an infinite number of possibilities (like weights, heights, and times). In terms of summarizing techniques, the main difference between discrete data and continuous data is that continuous data cannot directly be put into frequency tables since they do not have any obvious categories (you cannot create a table or histogram with an infinite number of categories).

To get around this, categories are created using classes, or intervals (ranges) of numbers. Each class has a lower class limit, which is the smallest value within the class, and an upper class limit, which is the largest value within the class. The class width is the difference between the upper class limit and the lower class limit. Finally, if a class does not have a lower or upper class limit (e.g., “shorter than 4 feet” or “60 and older”), the class is said to be open ended.

Tables

Once classes are established for a continuous variable, each data value will belong to one (and only one) class. Counts of the number of data values within each class can now be made, resulting in a table of either a frequency distribution (raw counts) or of a relative frequency distribution (percentage).

Some good practices for constructing tables for continuous variables are listed below. The word “reasonable” in the last two points is very subjective.

    • Classes should not overlap.
    • Classes should not have any gaps between them.
    • Classes should have the same width (except for possible open-ended classes at the extreme low or extreme high ends).
    • Class boundaries should be reasonable numbers.
    • A class width should be a reasonable number.

Histograms

Once a table of values has been created for your continuous data, a histogram is created, as before, where the classes now make up the horizontal scale (remember that in a histogram, the rectangles touch each other). One drawback with histograms of continuous data is that when changing the class width, the appearance of the histogram can change dramatically.

A Continuous Data Example

section1-6
The included table presents the percentage of each state’s residents who were living in poverty in 2002. The data includes a data value for the District of Columbia so there are 51 data values. I’ll describe the process for creating frequency and relative frequency tables and their corresponding.

Frequency and Relative Frequency Tables

In just looking over the data (which illustrates the problem of obtaining information from just a raw list of numbers), the lowest percentage of poverty is 5.6% and the largest is 18.0%. Since percentages are typically not discrete, we must create classes of percentages to begin summarizing the data. Let’s utilize a class width of 1 (percent) with the first class limit beginning at 5%.

When labeling the class widths, you must be careful so that there is no overlap. If we had class widths labeled, “5 – 6,” “6 – 7,” “7 – 8,” and so on, confusion arises when you encounter a percentage of, say, 7.0%. Which class does the data value fall into? “6 – 7” or “7 – 8?” Therefore, be careful and label each class as, “5 – 5.9,” “6 – 6.9,” “7 – 7.9,” and so on.

Class Frequency Relative Frequency
5 – 5.9 1 0.0196 ( = 1/51)
6 – 6.9 1 0.0196
7 – 7.9 3 0.0588
8 – 8.9 7 0.1373
9 – 9.9 9 0.1765
10 – 10.9 6 0.1177
11 – 11.9 5 0.0980
12 – 12.9 3 0.0588
13 – 13.9 5 0.0980
14 – 14.9 4 0.0784
15 – 15.9 1 0.0196
16 – 16.9 2 0.0392
17 – 17.9 3 0.0589
18 – 18.9 1 0.0196

How you round the percentages in the “Relative Frequency” column is a matter of taste. Just make sure not to round too much or to carry unnecessary decimals.

Histograms of the Frequency and Relative Frequency Table Results

I’m using Excel for this, although if you’re careful, you can draw them by hand.

section1-7

Notice that, as with bar graphs, both histograms have exactly the same shape. The only difference is the vertical scale. It’s up to you, and the point you’re trying to get across to your audience, which graphical summary to present.

If we repeat the above process with a class width of 2, rather than 1, we get the following table and histograms:

Class Frequency Relative Frequency
5 – 6.9 2 0.0392 ( = 2/51)
7 – 8.9 10 0.1961
9 – 10.9 15 0.2941
11 – 12.9 8 0.1569
13 – 14.9 9 0.1765
15 – 16.9 3 0.0588
17 – 18.9 4 0.0784

section1-8

Notice the differences between the histograms for class widths of 1% and 2%. Which histogram better conveys the information from the data? Why? Detail can be lost when grouping too much data together. On the other hand, if too many classes have no data values (counts of 0 or 1), then the histogram is likely too specific. You may have to experiment with a few histograms and choose the one that looks best. Again, this can be a matter of taste.

Describing the Shape of a Histogram

One important characteristic of continuous data is that we can describe its distribution. This results from continuous data having a unique ordering of possible values (since it is numerical). In plain terms, a distribution shows how data values are spread out (or distributed) across all possible results. We will focus mainly on one specific distribution in class, the Normal distribution, but you should keep in mind that this is only one of possibly infinite distributions of data. Wikipedia has an impressive list of many of the more important discrete and continuous distributions that often arise in statistical studies. See:

http://en.wikipedia.org/wiki/List_of_probability_distributions

There is a set of common language to describe the overall shape of data distributions. Are all the rectangles (roughly) the same height for each category? Is there one central peak, with frequencies failing off in either direction? Here are the common shapes we’ll encounter:

Uniform Distribution: A distribution where each of the values tends to occur with the same frequency. In this case, the histogram looks flat.

Bell-Shaped Distribution: A distribution where most of the values fall in the middle (a central peak), and the frequencies tail off to the left and to the right. A bell-shaped distribution is called symmetric, where both the right and left sides have (roughly) the same shape.

Right-Skewed Distribution: A distribution that is not symmetric, and where the tail to the right is longer than the tail to the left.

Left-Skewed Distribution: A distribution that is not symmetric, and where the tail to the left is longer than the tail to the right.

If you look to the histograms of class width 1% created for the previous example, it is clear that these are right-skewed distributions, since the peak of values tends to fall around 9%, and there is a tail of values extending far to the right, all the way out to 18%.

Time-Series Data

section1-9

When a variable of interest is measured at different points in time, the data is time-series data. This means that time is the variable on the horizontal axis. Such a plot is called a time-series plot. Time series plots are used to identify long-term trends in a variable, or to identify regularly occurring trends. This time-series plot illustrates how fetal head circumference changes throughout a gestational period. Notice the roughly linear pattern. This information can be used to identify changes in or problems with fetal growth.

Careful Graphical Techniques

Statistical displays can distort the truth. It’s frightening how easy it can be to mislead or even deceive people with graphical summaries. If you unintentionally distort the truth with a bad graph, this is called misleading. If you intentionally distort your results, this is called deceiving. Many people, including scientists and especially the media, are guilty of both.

section1-10

If you want to illustrate a certain point, but your graphical summary does not support your point, it is relatively simple to make changes (for example, to rectangle heights or vertical/horizontal scales) that will then lend credibility to your point. For example, look at the two graphs provided here, which illustrate the average distance of PGA golfers’ drives off the tee for the 10-year period 1997 – 2006. The graph on the left shows a sharp increase in driving distance over the years, while the graph on the right shows little, if no, increase in driving distance at all. Both are from the same set of data! The difference between the two graphs is the choice of vertical scale. While the graph on the left shows distances from 250 – 300 yards, the graph on the right shows distances from 0 – 400 yards. In other words, the graph on the left has zoomed in on the relevant range of distances.

Although no social or political harm can come from misleading the public on golf driving distances, there are many issues where certain groups can easily deceive the public by showing graphs in ways that illustrate the point(s) they are trying to get across. It is up to you to take a few moments when you see a graphical summary of someone’s data to see what is actually going on.

Summary

Qualitative Data

Summary of Qualitative Data - Visual

 Quantitative Data

Summary of Quantative Data