In a sense, the techniques for collecting data are the most important step in the process of statistics; the procedures set the stage for obtaining information that can be used to draw meaningful and accurate conclusions. All of the statistical calculations we’ll be learning can be used on any set of data, regardless of how the data was obtained. Useless results come from bad data. Therefore, no matter how careful and exacting you are in organizing, summarizing, and analyzing data, your conclusions will be useless if you aren’t careful in the beginning to collect data appropriately.

There are four main ways to obtain data.

- A
**census**is a list of all the individuals and their characteristics in a population. An example of a census is the US Census held every 10 years (this is only an example, though). The main advantage of using a survey to obtain information is that your conclusions will have 100% certainty. The disadvantages of conducting a census are that it may be difficult or impossible to obtain all the information, and costs may be prohibitive. - An
**existing source**is an appropriate data set that has already been collected, and can be used for your study. The advantage of finding an existing source of data is obviously the savings in both time and money. A disadvantage is that it can often be difficult to find the*exact*data you need. - A
**survey sample**is a study when only a subset of the population is considered and where there is no attempt to influence the value of the variable of interest. The advantage of using a survey is the savings in both time and money of not having to get information from every individual in the population. The main disadvantage of a survey sample, and this is extremely important, is that choosing an appropriate sample could be difficult. The sample*must*represent the overall population, even though it is just a subset of the population.

A survey sample is an example of an**observational study**, where there is no attempt to influence the value of the variable. Observational studies are great for detecting*associations*(relationships) between variables, but they cannot isolate causes to determine*causation*. This happens when we fail to observe certain variables, called**lurking variables**. - A
**designed experiment**is an experiment that applies a treatment to individuals. In an experiment, information from the treated group is often compared with a control (untreated) group. Variables from the individuals and the treatments can easily be controlled in an experiment. A major advantage of an experiment is that you can analyze individual factors. Disadvantages of experiments are that they cannot be conducted when the variables cannot be controlled and in cases for moral/ethical reasons. Section 1.5 discusses methods for setting up and conducting an experiment.

When conducting a census is unrealistic (as is usually the case), sampling from the population is the next best thing. There is one main question: How do you choose your sample? For example, if you are interested in knowing the average grade point average (GPA) of graduating high school students in your city, you wouldn’t want your sample to consist of only women or of just athletes or of just honor roll students. You would want your sample to represent the entire population of interest.

We must use the process of **randomness** to select the individuals included in our sample. If we are allowed to do the selecting, our sample will most certainly be **biased**, i.e., it will include a group of individuals that does not represent the entire population, and therefore, conclusions will most certainly systematically favor certain outcomes.

The most popular sampling technique that relies on randomness is **simple random sampling**, a technique where every possible sample of size *n* out of a population of size *N* has an equally likely chance of occurring. For example, a simple random sample of size n = 2 from a population size of N = 4 has 6 possible samples, and each has an equally likely chance of occurring:

**Population:** {1, 2, 3, 4}

**Possible Samples of Size n = 2:** {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}

As simple random sampling is similar to “drawing names out of a hat,” we need a method to select the individuals for our sample. We will use either a table of random digits or technology to do this. A quick search on the Internet shows many places to find tables of random digits. One great site contains freely available tables (download as a PDF or read online): http://www.rand.org/pubs/monograph_reports/MR1418.html

In such a table of random digits, each entry is equally likely to be any of the 10 digits 0 through 9, which means that entries are *independent* of one another (knowledge of one number gives us no information about any of the other entries surrounding it). In fact, if we read the table in groups of two numbers, each pair of entries is equally likely to be any of the 100 pairs 00, 01, …, 98, 99. Reading each triple of entries gives us an equally likely chance of seeing any of the 1000 entries 000, 001, 002, …, 998, 999.

To conduct a Simple Random Sample, begin by numbering every member in your population. If your population has size 30, you will read numbers from the table in groups of two (pairs); if your population has size 168, you will read numbers from the table in groups of three (triples). Start anywhere you’d like in the table, and read in any direction, left, right, up, or down. It’s nice to follow a pattern, just so you don’t get lost. Select the random numbers as you move along, and match the numbers chosen to the individuals in your population. If you select a random number that does not correspond to an individual in your population, or if you encounter a repeat number (this WILL happen because the digits are random!), skip it and move on.

For example, suppose I want to select 4 students from a class of 30 to estimate the class average on an exam (this is unrealistic because it’s trivial to find the average of only 30 students, but it’s just an example). I would first assign a number to each student, starting at 01 and ending at 30. Start at the beginning of a line, say 263, and read in pairs from left to right. Here are the numbers that I’d record:

32 03 13 96 08 75 99 27 34 45 01 …

Table of Random Digits | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

00250 | 59467 | 58309 | 87834 | 57213 | 37510 | 33689 | 01259 | 62486 | 56320 | 46265 | |||||

00251 | 73452 | 17619 | 56421 | 40725 | 23439 | 41701 | 93223 | 41682 | 45026 | 47505 | |||||

00252 | 27635 | 56293 | 91700 | 04391 | 67317 | 89604 | 73020 | 69853 | 61517 | 51207 | |||||

00253 | 86040 | 02596 | 01655 | 09918 | 45161 | 00222 | 54577 | 74821 | 47335 | 08582 | |||||

00254 | 52403 | 94255 | 26351 | 46527 | 68224 | 90183 | 85057 | 72310 | 34963 | 83462 | |||||

00255 | 49465 | 46581 | 61499 | 04844 | 94626 | 02963 | 41482 | 83879 | 44942 | 63915 | |||||

00256 | 94365 | 92560 | 12363 | 30246 | 02086 | 75036 | 88620 | 91088 | 67691 | 67762 | |||||

00257 | 34261 | 08769 | 91830 | 23313 | 18256 | 28850 | 37639 | 92748 | 57791 | 71328 | |||||

00258 | 37110 | 66538 | 39318 | 15626 | 44324 | 82827 | 08782 | 65960 | 58167 | 01305 | |||||

00259 | 83950 | 45424 | 72453 | 19444 | 68219 | 64733 | 94088 | 62006 | 89985 | 36936 | |||||

00260 | 61630 | 97966 | 76537 | 46467 | 30942 | 07479 | 67971 | 14558 | 22458 | 35148 | |||||

00261 | 01929 | 17165 | 12037 | 74558 | 16250 | 71750 | 55546 | 29693 | 94984 | 37782 | |||||

00262 | 41659 | 39098 | 23982 | 29899 | 71594 | 77979 | 54477 | 13764 | 17315 | 72893 | |||||

00263 | 32031 | 39608 | 75992 | 73445 | 01317 | 50525 | 87313 | 45191 | 30214 | 19769 | |||||

00264 | 90043 | 93478 | 58044 | 06949 | 31176 | 88370 | 50274 | 83987 | 45316 | 38551 |

Since the first number does not correspond to anyone in the population, we skip it. The first student to be selected for the population would be Student 03. Following this would be Students 13, 08, and finally 27:

32 **03** **13** 96 **08** 75 99 **27** 34 45 01 …

Therefore, all our statistical research would focus on the exam scores of the four students pertaining to the numbers 03, 08, 13, and 27. Numbering students and “allowing” a table to choose the students to include in our study removes the baises that exist if we tried to choose the students ourselves.

### Sampling Errors

Now that we have seen *how* to obtain samples appropriately, here are some of the issues that can arise during a sampling process. There are two types of errors that arise: **sampling errors** and **nonsampling errors**.

**Sampling errors** are very difficult to control or predict. These errors result from using the sample (a *subset* of the population) to describe characteristics of the population. Therefore, the process of sampling may give incomplete information about the population. In other words, even if we use a random process to select a sample, our sample may not “perfectly” represent the population of interest. This occurs because data and information vary from member to member in the population.

There are numerous **nonsampling errors** that result from the sampling process, including the nonresponse of individuals selected in the sample, inaccurate responses to poorly worded questions, bias in the selection of the sample, and so on. Nonsampling errors are often largely avoidable with a good study design, and minimizing these errors is of high priority in designing a sample survey. Some examples of nonsampling errors include:

- Using an incomplete population
- Nonresponse
- Interviewer errors
- Misrepresented answers
- Mistakes in recording or entering data
- Questionnaire design
- Wording of questions
- Order of questions, words, and responses

**Example 1: Identifying Parts of a Survey**

Two shortened survey reports are given. In each report, identify the following: the population, the sample, the results, and whether the results represent a sample statistic or a population parameter.

a. A headline about the rising obesity among young people led a school board to survey local high school students. Out of 231 students surveyed, 58% reported eating a “high fat” snack at least 4 times a week.

b. A nonprofit organization interviewed 618 adult shoppers at malls across Louisiana about their views on obesity in youths. The resulting report stated that an estimated 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options.

**Solution**

a. Population: local high school students.

Sample: the 231 students who were surveyed.

Results: 58% of students surveyed eat a “high fat” snack at least 4 times a week. The result refers to only those students who were surveyed, thus the result is a sample statistic.

b. Population: Louisiana adults

Sample: the 618 adult Louisiana mall shoppers who were surveyed.

Results: 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options. The results refer to all Louisiana adults, thus this is a population parameter. This population parameter is an estimate based on the sample statistics, which were not reported.

**Types of Sampling**

There are many ways that one can sample from a population. Ideally we would like to sample in such a way that we get a sample that reflects all the characteristics of the population and therefore a statistic that represents the parameter well. The quality of a sample statistic (i.e., accuracy, precision, representativeness) is highly affected by how sample(s) are chosen; that is., by the sampling method. Below we will describe different sampling methods.

- A
**representative sample**is one that has the same relevant characteristics as the population and does not favor one group of the population over another. - A
**random sample**is one in which every member of the population has an equal chance of being selected. - A
**stratified sample**is one in which members of the population are divided into two or more subgroups, called**strata**, that share similar characteristics like age, gender, or ethnicity. A random sample from each stratum is then drawn. - A
**cluster sample**is one chosen by dividing the population into groups, called clusters, that are each similar to the entire population. The researcher then randomly selects some of the clusters. The sample consists of the data collected from every member of each cluster selected. - A
**systematic sample**is one chosen by selecting every nth member of the population. Systematic sampling is easy to detect because it always produces the same sample for the same n. To get a different sample you will need a different n value. - A
**convenience sample**is one in which the sample is “convenient” to select. It is so named because it is convenient for the researcher.

Cluster sampling and stratified sampling are often confused but perhaps a simple thought experiment will help. Suppose you wanted to study the comparison of fuel-efficient in different cars driven in the United States. Can you think of some ways to divide the cars into strata that might represent a broader scope of vehicles on the market? For example: size of engine, manufacturer, make, safety rating, number of doors. Notice that these are characteristics the individuals in the samplemay or may not have have and are qualitative. For a clustering look at the same example ask yourself, do you think this method would produce a good representative sample of vehicles if we allowed our clusters to be price ranges? Why or why not? Because cluster sampling is an “all from one group” method, comparing mpg’s from cars in only certain price ranges would not produce a representative sample.

Suppose instead you decide to gather data from half the students in your class for our comparison of fuel-efficient cars. Do you think this example of convenience sampling would be an accurate picture of the population of cars driven in the United States? It is unlikely that students (or any age group for that matter) will drive a wide range of cars. Newer, more expensive cars are less likely to be driven by students and would not be well represented in the student sample.

Lastly, lets try identifying every 5th car accessing the interstate on a particular entrance ramp during rush hour traffic. This is an example of systematic sampling for our fuel study. Can you identify any potential biases that we might need to be aware of when choosing the observation spot? The location of the entrance ramp might lend itself to having cars only on one end of the price scale depending on the businesses located in the area.

**Example 2: Identifying Sampling Techniques**

Identify the sampling technique used to obtain a sample in each of the following situations.

a. To conduct a survey on collegiate social life, you knock on every 5th dorm room door on campus.

b. Student ID numbers are randomly selected from a computer print out for free tickets to the championship game.

c. Fourth grade reading levels across the county were analyzed by the school board by randomly selecting 25 fourth graders from each school in the county district.

d. In order to determine what ice cream flavors would sell best, a grocery store polls shoppers that are in the frozen foods section.

e. To determine the average number of cars per household, each household in 4 of the 20 local counties were sent a survey regarding car ownership.

**Solution**

a. Because the sample is obtained by choosing every nth dorm room, this is systematic sampling. This is a representative sample, as long as students were randomly assigned to dorm rooms and there are no hidden potential biases, like only males may live in every nth room.

b. Since every member has an equal chance of being selected, this is random sampling.

c. The students were divided into strata based on their schools and then a random sample from each school was chosen. This is stratified sampling.

d. Because of the ease of choosing shoppers right in their own store, this is convenience sampling. In this case, convenience sampling is a viable method for gaining a representative sample since the store would be interested in knowing the thoughts of their customers.

e. Cluster sampling was used here because the counties are the natural clusters and all of the households in some of the counties received the surveys.