## 1. Intro

This competency set will address the subject of data collection. You’ll learn how to appropriately collect meaningful data, either through sampling or experiments, from which conclusions can be drawn. Anyone can collect data. However, if the data is not collected in a way to eliminate (or reduce) bias or the data does not accurately represent the population of interest, all results and conclusions drawn from the data will be practically meaningless.

As a word of warning, there are a tremendous number of terms that form the “language of statistics.” It may be very beneficial to create your own statistics dictionary, consisting of terms, definitions, and examples. As this language will be used throughout the course, having a good understanding of terms from the start will help you succeed.

### Parameters vs. Statistics

Remember from the first unit that a population is the entire group being studied, while a sample is a representative subset of the population. By definition, a sample is always smaller than the population.

If you are able to collect data from the entire population (for example, exam scores for all students in a statistics course), then the descriptive measures of the population are called parameters. Parameters are often written using Greek letters like μ (pronounced “mew”) or σ, pronounced “sigma.” If you only have data from a sample, the descriptive measures of samples are called statistics. Statistics are written using Roman letters like  and s, as you saw in Unit 2. An easy way to remember the distinction is by:

parameter ⇔ population
statistic ⇔ sample

The main reason for the difference is so you know whether someone is reporting a descriptive measure from the entire population or just a sample. This very subtle, yet extremely important, difference forms the basis for the process of statistical inference.

## 2. Collecting Data by Surveys

In a sense, the techniques for collecting data are the most important step in the process of statistics; the procedures set the stage for obtaining information that can be used to draw meaningful and accurate conclusions. All of the statistical calculations we’ll be learning can be used on any set of data, regardless of how the data was obtained. Useless results come from bad data. Therefore, no matter how careful and exacting you are in organizing, summarizing, and analyzing data, your conclusions will be useless if you aren’t careful in the beginning to collect data appropriately.

There are four main ways to obtain data.

1. A census is a list of all the individuals and their characteristics in a population. An example of a census is the US Census held every 10 years (this is only an example, though). The main advantage of using a survey to obtain information is that your conclusions will have 100% certainty. The disadvantages of conducting a census are that it may be difficult or impossible to obtain all the information, and costs may be prohibitive.
2. An existing source is an appropriate data set that has already been collected, and can be used for your study. The advantage of finding an existing source of data is obviously the savings in both time and money. A disadvantage is that it can often be difficult to find the exact data you need.
3. A survey sample is a study when only a subset of the population is considered and where there is no attempt to influence the value of the variable of interest. The advantage of using a survey is the savings in both time and money of not having to get information from every individual in the population. The main disadvantage of a survey sample, and this is extremely important, is that choosing an appropriate sample could be difficult. The sample must represent the overall population, even though it is just a subset of the population.
A survey sample is an example of an observational study, where there is no attempt to influence the value of the variable. Observational studies are great for detecting associations (relationships) between variables, but they cannot isolate causes to determine causation. This happens when we fail to observe certain variables, called lurking variables.
4. A designed experiment is an experiment that applies a treatment to individuals. In an experiment, information from the treated group is often compared with a control (untreated) group. Variables from the individuals and the treatments can easily be controlled in an experiment. A major advantage of an experiment is that you can analyze individual factors. Disadvantages of experiments are that they cannot be conducted when the variables cannot be controlled and in cases for moral/ethical reasons. Section 1.5 discusses methods for setting up and conducting an experiment.

When conducting a census is unrealistic (as is usually the case), sampling from the population is the next best thing. There is one main question: How do you choose your sample? For example, if you are interested in knowing the average grade point average (GPA) of graduating high school students in your city, you wouldn’t want your sample to consist of only women or of just athletes or of just honor roll students. You would want your sample to represent the entire population of interest.

We must use the process of randomness to select the individuals included in our sample. If we are allowed to do the selecting, our sample will most certainly be biased, i.e., it will include a group of individuals that does not represent the entire population, and therefore, conclusions will most certainly systematically favor certain outcomes.

The most popular sampling technique that relies on randomness is simple random sampling, a technique where every possible sample of size n out of a population of size N has an equally likely chance of occurring. For example, a simple random sample of size n = 2 from a population size of N = 4 has 6 possible samples, and each has an equally likely chance of occurring:

Population: {1, 2, 3, 4}
Possible Samples of Size n = 2: {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}

As simple random sampling is similar to “drawing names out of a hat,” we need a method to select the individuals for our sample. We will use either a table of random digits or technology to do this. A quick search on the Internet shows many places to find tables of random digits. One great site contains freely available tables (download as a PDF or read online): http://www.rand.org/pubs/monograph_reports/MR1418.html

In such a table of random digits, each entry is equally likely to be any of the 10 digits 0 through 9, which means that entries are independent of one another (knowledge of one number gives us no information about any of the other entries surrounding it). In fact, if we read the table in groups of two numbers, each pair of entries is equally likely to be any of the 100 pairs 00, 01, …, 98, 99. Reading each triple of entries gives us an equally likely chance of seeing any of the 1000 entries 000, 001, 002, …, 998, 999.

To conduct a Simple Random Sample, begin by numbering every member in your population. If your population has size 30, you will read numbers from the table in groups of two (pairs); if your population has size 168, you will read numbers from the table in groups of three (triples). Start anywhere you’d like in the table, and read in any direction, left, right, up, or down. It’s nice to follow a pattern, just so you don’t get lost. Select the random numbers as you move along, and match the numbers chosen to the individuals in your population. If you select a random number that does not correspond to an individual in your population, or if you encounter a repeat number (this WILL happen because the digits are random!), skip it and move on.

For example, suppose I want to select 4 students from a class of 30 to estimate the class average on an exam (this is unrealistic because it’s trivial to find the average of only 30 students, but it’s just an example). I would first assign a number to each student, starting at 01 and ending at 30. Start at the beginning of a line, say 263, and read in pairs from left to right. Here are the numbers that I’d record:

32 03 13 96 08 75 99 27 34 45 01 …

Table of Random Digits
00250 59467 58309 87834 57213 37510 33689 01259 62486 56320 46265
00251 73452 17619 56421 40725 23439 41701 93223 41682 45026 47505
00252 27635 56293 91700 04391 67317 89604 73020 69853 61517 51207
00253 86040 02596 01655 09918 45161 00222 54577 74821 47335 08582
00254 52403 94255 26351 46527 68224 90183 85057 72310 34963 83462
00255 49465 46581 61499 04844 94626 02963 41482 83879 44942 63915
00256 94365 92560 12363 30246 02086 75036 88620 91088 67691 67762
00257 34261 08769 91830 23313 18256 28850 37639 92748 57791 71328
00258 37110 66538 39318 15626 44324 82827 08782 65960 58167 01305
00259 83950 45424 72453 19444 68219 64733 94088 62006 89985 36936
00260 61630 97966 76537 46467 30942 07479 67971 14558 22458 35148
00261 01929 17165 12037 74558 16250 71750 55546 29693 94984 37782
00262 41659 39098 23982 29899 71594 77979 54477 13764 17315 72893
00263 32031 39608 75992 73445 01317 50525 87313 45191 30214 19769
00264 90043 93478 58044 06949 31176 88370 50274 83987 45316 38551

Since the first number does not correspond to anyone in the population, we skip it. The first student to be selected for the population would be Student 03. Following this would be Students 13, 08, and finally 27:

32 03 13 96 08 75 99 27 34 45 01 …

Therefore, all our statistical research would focus on the exam scores of the four students pertaining to the numbers 03, 08, 13, and 27. Numbering students and “allowing” a table to choose the students to include in our study removes the baises that exist if we tried to choose the students ourselves.

### Sampling Errors

Now that we have seen how to obtain samples appropriately, here are some of the issues that can arise during a sampling process. There are two types of errors that arise: sampling errors and nonsampling errors.

Sampling errors are very difficult to control or predict. These errors result from using the sample (a subset of the population) to describe characteristics of the population. Therefore, the process of sampling may give incomplete information about the population. In other words, even if we use a random process to select a sample, our sample may not “perfectly” represent the population of interest. This occurs because data and information vary from member to member in the population.

There are numerous nonsampling errors that result from the sampling process, including the nonresponse of individuals selected in the sample, inaccurate responses to poorly worded questions, bias in the selection of the sample, and so on. Nonsampling errors are often largely avoidable with a good study design, and minimizing these errors is of high priority in designing a sample survey. Some examples of nonsampling errors include:

• Using an incomplete population
• Nonresponse
• Interviewer errors
• Mistakes in recording or entering data
• Questionnaire design
• Wording of questions
• Order of questions, words, and responses

Example 1: Identifying Parts of a Survey

Two shortened survey reports are given. In each report, identify the following: the population, the sample, the results, and whether the results represent a sample statistic or a population parameter.

a.     A headline about the rising obesity among young people led a school board to survey local high school students. Out of 231 students surveyed, 58% reported eating a “high fat” snack at least 4 times a week.

b.     A nonprofit organization interviewed 618 adult shoppers at malls across Louisiana about their views on obesity in youths. The resulting report stated that an estimated 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options.

Solution

a.     Population: local high school students.

Sample: the 231 students who were surveyed.

Results: 58% of students surveyed eat a “high fat” snack at least 4 times a week. The result refers to only those students who were surveyed, thus the result is a sample statistic.

Sample: the 618 adult Louisiana mall shoppers who were surveyed.

Results: 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options. The results refer to all Louisiana adults, thus this is a population parameter. This population parameter is an estimate based on the sample statistics, which were not reported.

### Types of Sampling

There are many ways that one can sample from a population. Ideally we would like to sample in such a way that we get a sample that reflects all the characteristics of the population and therefore a statistic that represents the parameter well. The quality of a sample statistic (i.e., accuracy, precision, representativeness) is highly affected by how sample(s) are chosen; that is., by the sampling method. Below we will describe different sampling methods.

• A representative sample is one that has the same relevant characteristics as the population and does not favor one group of the population over another.
• A random sample is one in which every member of the population has an equal chance of being selected.
• A stratified sample is one in which members of the population are divided into two or more subgroups, called strata, that share similar characteristics like age, gender, or ethnicity. A random sample from each stratum is then drawn.
• A cluster sample is one chosen by dividing the population into groups, called clusters, that are each similar to the entire population. The researcher then randomly selects some of the clusters. The sample consists of the data collected from every member of each cluster selected.
• A systematic sample is one chosen by selecting every nth member of the population.  Systematic sampling is easy to detect because it always produces the same sample for the same n. To get a different sample you will need a different n value.
• A convenience sample is one in which the sample is “convenient” to select. It is so named because it is convenient for the researcher.

Cluster sampling and stratified sampling are often confused but perhaps a simple thought experiment will help. Suppose you wanted to study the comparison of fuel-efficient in different cars driven in the United States. Can you think of some ways to divide the cars into strata that might represent a broader scope of vehicles on the market? For example: size of engine, manufacturer, make, safety rating, number of doors. Notice that these are characteristics the individuals in the samplemay or may not have have and are qualitative. For a clustering look at the same example ask yourself, do you think this method would produce a good representative sample of vehicles if we allowed our clusters to be price ranges? Why or why not? Because cluster sampling is an “all from one group” method, comparing mpg’s from cars in only certain price ranges would not produce a representative sample.

Suppose instead you decide to gather data from half the students in your class for our comparison of fuel-efficient cars. Do you think this example of convenience sampling would be an accurate picture of the population of cars driven in the United States? It is unlikely that students (or any age group for that matter) will drive a wide range of cars. Newer, more expensive cars are less likely to be driven by students and would not be well represented in the student sample.

Lastly, lets try identifying every 5th car accessing the interstate on a particular entrance ramp during rush hour traffic. This is an example of systematic sampling for our fuel study. Can you identify any potential biases that we might need to be aware of when choosing the observation spot? The location of the entrance ramp might lend itself to having cars only on one end of the price scale depending on the businesses located in the area.

Example 2: Identifying Sampling Techniques

Identify the sampling technique used to obtain a sample in each of the following situations.

a.     To conduct a survey on collegiate social life, you knock on every 5th dorm room door on campus.

b.     Student ID numbers are randomly selected from a computer print out for free tickets to the championship game.

c.     Fourth grade reading levels across the county were analyzed by the school board by randomly selecting 25 fourth graders from each school in the county district.

d.     In order to determine what ice cream flavors would sell best, a grocery store polls shoppers that are in the frozen foods section.

e.     To determine the average number of cars per household, each household in 4 of the 20 local counties were sent a survey regarding car ownership.

Solution

a.     Because the sample is obtained by choosing every nth dorm room, this is systematic sampling. This is a representative sample, as long as students were randomly assigned to dorm rooms and there are no hidden potential biases, like only males may live in every nth room.

b.     Since every member has an equal chance of being selected, this is random sampling.

c.     The students were divided into strata based on their schools and then a random sample from each school was chosen. This is stratified sampling.

d.     Because of the ease of choosing shoppers right in their own store, this is convenience sampling. In this case, convenience sampling is a viable method for gaining a representative sample since the store would be interested in knowing the thoughts of their customers.

e.     Cluster sampling was used here because the counties are the natural clusters and all of the households in some of the counties received the surveys.

## 3. Collecting Data Through Experiments

In this section you will learn how to appropriately design, set up, and implement a designed experiment to collect data. Recall that data can be collected in two main ways: (1) through sample surveys or (2) through designed experiments. While sample surveys lead to observational studies, designed experiments enable researchers to control variables, leading to additional conclusions.

A designed experiment is a controlled study whose purpose is to control as many factors as possible to isolate the effects of a particular factor. Designed experiments must be carefully set up to achieve their purposes.

The variables in a designed experiment that are controlled are called the explanatory variables or are sometimes called the factors. Factors have values that can be changed by the researcher and are considered as possible causes. Examples of factors are:

• The dosage of a drug in a medical experiment
• The type of teaching method in an education experiment
• One drug by itself compared with that drug used in conjunction with another

The designed experiment analyzes the effects of the factors on the response variable. Response variables are not part of a controlled environment and have values that are measured by the researcher. Examples of response variables are:

• The blood pressures of the patients
• The test scores for a class
• The sizes of a cancerous tumor for patients

A treatment is the specific combination of the values of the factors. Examples of treatments are:

• Giving one medication to one group of patients and a different medication to another
• Using one type of fertilizer on a set of plots of corn and a different type of fertilizer on a different set of plots
• Playing country music to one group of mice and rap music to another

A treatment is applied to “experimental units” (people, plants, materials, or other objects). When experimental units are people, we refer to them as subjects. Subjects in an experiment correspond to individuals in a survey.

Here is an example of a designed experiment. While reading this, think about who are the subjects, what is/are the factors and treatments, and what is/are the response variables.

Example 3: Drug Trials

Suppose you want to determine whether a new drug, Drug N, is more effective at treating high blood pressure than the existing drug, Drug E. Patients with high blood pressure are given either Drug N or Drug E, and the blood pressures are measured one month later.

Solution

For this experiment, the subjects are the patients selected to receive either Drug N or Drug E; the factor is the type of drug that a subject receives; the treatment is the specific drug administered (Drug N or Drug E); and the response variable is the subject’s blood pressure after one month. If patients given Drug N have significantly lower blood pressures than patients given Drug E, we would wish to conclude that Drug N is more effective. However, it’s not the easy to immediately draw such conclusions.

A carefully designed experiment ensures that the behavior of the researcher and/or subjects does not influence the outcome of the experiment. It is important for subjects to not know which treatment they get. In addition, many experiments will have a group of subjects that are not given any medication. These subjects are given a placebo (e.g., a sugar tablet) to control against the possibility that subjects imagine a change in their response variable because they know they are receiving “medication.” It is also important for the researchers to not know which group of patients is given which medication or placebo. An experiment where neither the experimenter nor the experimental unit knows what treatment is being administered is call a double-blind experiment.

Conducting an experiment involves considerable planning. Here are some steps to consider:

1. Identify the problem. The first step in planning an experiment (or in most any project at all) is to identify the problem. This includes identifying the general purpose of the experiment, the response variable of interest, and the population. The identified problem is often referred to as a claim about the population of interest.
2. Determine the factors. The second step in planning an experiment is to determine the factors to be studied. Factors can be identified by experts in the field, by the overall purpose of the experiment, or by using results from previous studies. Factors must be identified as either fixed at some predetermined level, controlled (those that will be manipulated in the experiment), or uncontrolled.
3. Determine the number of experimental units (i.e., the sample size). In general, the more the experiment units, the more effective the experiment. However, the number of experimental units could have to be limited by time or money. We will learn some techniques later in the semester to calculate an appropriate number of experimental units.
4. Determine the level of each factor. There are three ways to deal with the factors:
• Control – Fix the levels at a constant level (for factors not of interest)
• Manipulate – Set the levels at predetermined levels (for factors of interest)
• Randomize – Randomize the experimental units (for uncontrolled factors not of interest). Randomization decreases (or averages out) the effects of uncontrolled factors, even ones not identified or thought about in advance.
5. Conduct the experiment. Subjects must be assigned at random to a treatment group. There are different good methods for assigning treatments to experimental units: completely random, matched-pairs (see below), and randomized blocks.If a treatment is applied to more than one experimental unit, this is called replication, which can be useful for experimental accuracy and to further decrease the effects of uncontrolled factors. In this step, the experimenter then collects and processes the data.
6. Test the claim. In the final step, we conduct inferential statistics, which will be studied in detail in sub-competencies 8 through 12.

A completely randomized design is when each experimental unit is assigned to a treatment completely at random.

Another type of experimental design is the matched-pairs design. A matched-pairs design is when the experimental units are paired up (e.g., twins, the same person before and after the treatment, a husband and wife) and each of the pair is assigned to a different treatment. There are only two levels of treatment (one for each of the pair). For example, a researcher would collect and compare information from the same subject before receiving a certain medication and then after receiving the medication.

Finally, we cannot always control all factors whose effects we do not care about but we suspect might have an effect on our response variable or the factors effecting our response variable. When this occurs. For example, customers with young children have different purchasing habits than those without. Perhaps men or women will respond differently to treatment. However, these are not factors that can be assigned to them. Factors like these may account for some of the variation in the response in experiments because subjects at different levels may respond differently.  So we deal with them by grouping or blocking, our subjects together and, in effect, analyzing the experiment separately for each block. Such factors are called blocking factors, and their levels are called blocks. Blocking an experiment is like stratifying in survey design.

Example 4

An Internet sales site randomly sent customers to one of three versions of its welcome page. It recorded how long each visitor stated on the site. Additionally analysts want to know if customers that came directly to the site (by typing in the URL) behave differently than those who were referred to the site from other sources (such as search engines). The decide to block by how the customers arrived. Draw a diagram of their experimental design.

Solution