## 2. Collecting Data by Surveys

In a sense, the techniques for collecting data are the most important step in the process of statistics; the procedures set the stage for obtaining information that can be used to draw meaningful and accurate conclusions. All of the statistical calculations we’ll be learning can be used on any set of data, regardless of how the data was obtained. Useless results come from bad data. Therefore, no matter how careful and exacting you are in organizing, summarizing, and analyzing data, your conclusions will be useless if you aren’t careful in the beginning to collect data appropriately.

There are four main ways to obtain data.

1. A census is a list of all the individuals and their characteristics in a population. An example of a census is the US Census held every 10 years (this is only an example, though). The main advantage of using a survey to obtain information is that your conclusions will have 100% certainty. The disadvantages of conducting a census are that it may be difficult or impossible to obtain all the information, and costs may be prohibitive.
2. An existing source is an appropriate data set that has already been collected, and can be used for your study. The advantage of finding an existing source of data is obviously the savings in both time and money. A disadvantage is that it can often be difficult to find the exact data you need.
3. A survey sample is a study when only a subset of the population is considered and where there is no attempt to influence the value of the variable of interest. The advantage of using a survey is the savings in both time and money of not having to get information from every individual in the population. The main disadvantage of a survey sample, and this is extremely important, is that choosing an appropriate sample could be difficult. The sample must represent the overall population, even though it is just a subset of the population.
A survey sample is an example of an observational study, where there is no attempt to influence the value of the variable. Observational studies are great for detecting associations (relationships) between variables, but they cannot isolate causes to determine causation. This happens when we fail to observe certain variables, called lurking variables.
4. A designed experiment is an experiment that applies a treatment to individuals. In an experiment, information from the treated group is often compared with a control (untreated) group. Variables from the individuals and the treatments can easily be controlled in an experiment. A major advantage of an experiment is that you can analyze individual factors. Disadvantages of experiments are that they cannot be conducted when the variables cannot be controlled and in cases for moral/ethical reasons. Section 1.5 discusses methods for setting up and conducting an experiment.

When conducting a census is unrealistic (as is usually the case), sampling from the population is the next best thing. There is one main question: How do you choose your sample? For example, if you are interested in knowing the average grade point average (GPA) of graduating high school students in your city, you wouldn’t want your sample to consist of only women or of just athletes or of just honor roll students. You would want your sample to represent the entire population of interest.

We must use the process of randomness to select the individuals included in our sample. If we are allowed to do the selecting, our sample will most certainly be biased, i.e., it will include a group of individuals that does not represent the entire population, and therefore, conclusions will most certainly systematically favor certain outcomes.

The most popular sampling technique that relies on randomness is simple random sampling, a technique where every possible sample of size n out of a population of size N has an equally likely chance of occurring. For example, a simple random sample of size n = 2 from a population size of N = 4 has 6 possible samples, and each has an equally likely chance of occurring:

Population: {1, 2, 3, 4}
Possible Samples of Size n = 2: {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}

As simple random sampling is similar to “drawing names out of a hat,” we need a method to select the individuals for our sample. We will use either a table of random digits or technology to do this. A quick search on the Internet shows many places to find tables of random digits. One great site contains freely available tables (download as a PDF or read online): http://www.rand.org/pubs/monograph_reports/MR1418.html

In such a table of random digits, each entry is equally likely to be any of the 10 digits 0 through 9, which means that entries are independent of one another (knowledge of one number gives us no information about any of the other entries surrounding it). In fact, if we read the table in groups of two numbers, each pair of entries is equally likely to be any of the 100 pairs 00, 01, …, 98, 99. Reading each triple of entries gives us an equally likely chance of seeing any of the 1000 entries 000, 001, 002, …, 998, 999.

To conduct a Simple Random Sample, begin by numbering every member in your population. If your population has size 30, you will read numbers from the table in groups of two (pairs); if your population has size 168, you will read numbers from the table in groups of three (triples). Start anywhere you’d like in the table, and read in any direction, left, right, up, or down. It’s nice to follow a pattern, just so you don’t get lost. Select the random numbers as you move along, and match the numbers chosen to the individuals in your population. If you select a random number that does not correspond to an individual in your population, or if you encounter a repeat number (this WILL happen because the digits are random!), skip it and move on.

For example, suppose I want to select 4 students from a class of 30 to estimate the class average on an exam (this is unrealistic because it’s trivial to find the average of only 30 students, but it’s just an example). I would first assign a number to each student, starting at 01 and ending at 30. Start at the beginning of a line, say 263, and read in pairs from left to right. Here are the numbers that I’d record:

32 03 13 96 08 75 99 27 34 45 01 …

Table of Random Digits
00250 59467 58309 87834 57213 37510 33689 01259 62486 56320 46265
00251 73452 17619 56421 40725 23439 41701 93223 41682 45026 47505
00252 27635 56293 91700 04391 67317 89604 73020 69853 61517 51207
00253 86040 02596 01655 09918 45161 00222 54577 74821 47335 08582
00254 52403 94255 26351 46527 68224 90183 85057 72310 34963 83462
00255 49465 46581 61499 04844 94626 02963 41482 83879 44942 63915
00256 94365 92560 12363 30246 02086 75036 88620 91088 67691 67762
00257 34261 08769 91830 23313 18256 28850 37639 92748 57791 71328
00258 37110 66538 39318 15626 44324 82827 08782 65960 58167 01305
00259 83950 45424 72453 19444 68219 64733 94088 62006 89985 36936
00260 61630 97966 76537 46467 30942 07479 67971 14558 22458 35148
00261 01929 17165 12037 74558 16250 71750 55546 29693 94984 37782
00262 41659 39098 23982 29899 71594 77979 54477 13764 17315 72893
00263 32031 39608 75992 73445 01317 50525 87313 45191 30214 19769
00264 90043 93478 58044 06949 31176 88370 50274 83987 45316 38551

Since the first number does not correspond to anyone in the population, we skip it. The first student to be selected for the population would be Student 03. Following this would be Students 13, 08, and finally 27:

32 03 13 96 08 75 99 27 34 45 01 …

Therefore, all our statistical research would focus on the exam scores of the four students pertaining to the numbers 03, 08, 13, and 27. Numbering students and “allowing” a table to choose the students to include in our study removes the baises that exist if we tried to choose the students ourselves.

### Sampling Errors

Now that we have seen how to obtain samples appropriately, here are some of the issues that can arise during a sampling process. There are two types of errors that arise: sampling errors and nonsampling errors.

Sampling errors are very difficult to control or predict. These errors result from using the sample (a subset of the population) to describe characteristics of the population. Therefore, the process of sampling may give incomplete information about the population. In other words, even if we use a random process to select a sample, our sample may not “perfectly” represent the population of interest. This occurs because data and information vary from member to member in the population.

There are numerous nonsampling errors that result from the sampling process, including the nonresponse of individuals selected in the sample, inaccurate responses to poorly worded questions, bias in the selection of the sample, and so on. Nonsampling errors are often largely avoidable with a good study design, and minimizing these errors is of high priority in designing a sample survey. Some examples of nonsampling errors include:

• Using an incomplete population
• Nonresponse
• Interviewer errors
• Mistakes in recording or entering data
• Questionnaire design
• Wording of questions
• Order of questions, words, and responses

Example 1: Identifying Parts of a Survey

Two shortened survey reports are given. In each report, identify the following: the population, the sample, the results, and whether the results represent a sample statistic or a population parameter.

a.     A headline about the rising obesity among young people led a school board to survey local high school students. Out of 231 students surveyed, 58% reported eating a “high fat” snack at least 4 times a week.

b.     A nonprofit organization interviewed 618 adult shoppers at malls across Louisiana about their views on obesity in youths. The resulting report stated that an estimated 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options.

Solution

a.     Population: local high school students.

Sample: the 231 students who were surveyed.

Results: 58% of students surveyed eat a “high fat” snack at least 4 times a week. The result refers to only those students who were surveyed, thus the result is a sample statistic.

Sample: the 618 adult Louisiana mall shoppers who were surveyed.

Results: 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options. The results refer to all Louisiana adults, thus this is a population parameter. This population parameter is an estimate based on the sample statistics, which were not reported.

### Types of Sampling

There are many ways that one can sample from a population. Ideally we would like to sample in such a way that we get a sample that reflects all the characteristics of the population and therefore a statistic that represents the parameter well. The quality of a sample statistic (i.e., accuracy, precision, representativeness) is highly affected by how sample(s) are chosen; that is., by the sampling method. Below we will describe different sampling methods.

• A representative sample is one that has the same relevant characteristics as the population and does not favor one group of the population over another.
• A random sample is one in which every member of the population has an equal chance of being selected.
• A stratified sample is one in which members of the population are divided into two or more subgroups, called strata, that share similar characteristics like age, gender, or ethnicity. A random sample from each stratum is then drawn.
• A cluster sample is one chosen by dividing the population into groups, called clusters, that are each similar to the entire population. The researcher then randomly selects some of the clusters. The sample consists of the data collected from every member of each cluster selected.
• A systematic sample is one chosen by selecting every nth member of the population.  Systematic sampling is easy to detect because it always produces the same sample for the same n. To get a different sample you will need a different n value.
• A convenience sample is one in which the sample is “convenient” to select. It is so named because it is convenient for the researcher.

Cluster sampling and stratified sampling are often confused but perhaps a simple thought experiment will help. Suppose you wanted to study the comparison of fuel-efficient in different cars driven in the United States. Can you think of some ways to divide the cars into strata that might represent a broader scope of vehicles on the market? For example: size of engine, manufacturer, make, safety rating, number of doors. Notice that these are characteristics the individuals in the samplemay or may not have have and are qualitative. For a clustering look at the same example ask yourself, do you think this method would produce a good representative sample of vehicles if we allowed our clusters to be price ranges? Why or why not? Because cluster sampling is an “all from one group” method, comparing mpg’s from cars in only certain price ranges would not produce a representative sample.

Suppose instead you decide to gather data from half the students in your class for our comparison of fuel-efficient cars. Do you think this example of convenience sampling would be an accurate picture of the population of cars driven in the United States? It is unlikely that students (or any age group for that matter) will drive a wide range of cars. Newer, more expensive cars are less likely to be driven by students and would not be well represented in the student sample.

Lastly, lets try identifying every 5th car accessing the interstate on a particular entrance ramp during rush hour traffic. This is an example of systematic sampling for our fuel study. Can you identify any potential biases that we might need to be aware of when choosing the observation spot? The location of the entrance ramp might lend itself to having cars only on one end of the price scale depending on the businesses located in the area.

Example 2: Identifying Sampling Techniques

Identify the sampling technique used to obtain a sample in each of the following situations.

a.     To conduct a survey on collegiate social life, you knock on every 5th dorm room door on campus.

b.     Student ID numbers are randomly selected from a computer print out for free tickets to the championship game.

c.     Fourth grade reading levels across the county were analyzed by the school board by randomly selecting 25 fourth graders from each school in the county district.

d.     In order to determine what ice cream flavors would sell best, a grocery store polls shoppers that are in the frozen foods section.

e.     To determine the average number of cars per household, each household in 4 of the 20 local counties were sent a survey regarding car ownership.

Solution

a.     Because the sample is obtained by choosing every nth dorm room, this is systematic sampling. This is a representative sample, as long as students were randomly assigned to dorm rooms and there are no hidden potential biases, like only males may live in every nth room.

b.     Since every member has an equal chance of being selected, this is random sampling.

c.     The students were divided into strata based on their schools and then a random sample from each school was chosen. This is stratified sampling.

d.     Because of the ease of choosing shoppers right in their own store, this is convenience sampling. In this case, convenience sampling is a viable method for gaining a representative sample since the store would be interested in knowing the thoughts of their customers.

e.     Cluster sampling was used here because the counties are the natural clusters and all of the households in some of the counties received the surveys.