## 1. Intro

So far you’ve learned how to appropriately obtain meaningful data, how to make sense of that data through distributions or numerical summaries, and how to determine the probability of seeing certain outcomes.

From now on, we’re going to use sample data to draw conclusions about the larger population from which the sample was taken. If you spend some time thinking about this, it’s pretty cool that we’re going to be able to make some very educated guesses as to a certain characteristic of a population just by looking at a small sample of individuals from that population. We won’t be computing exact values (because that would entail knowing results from the entire population); rather, we’ll be computing a range of possible values for the population. All of our inference methods are based on one extremely important fact: data varies. Therefore, we’ll never know for certainty the exact value of a population’s characteristic. But with the methods of inference, we’ll take the variation into account, and create a range of possible values.

Using the idea of the law of large numbers, we do know that if we increase the size of our sample, our sample mean  will get closer and closer to the true, unknown, population mean μ. But there is a trade-off: although a larger sample means a more accurate estimate of the population mean, obtaining a larger sample requires more time and money to collect, record, and summarize the data.

## 2. Sampling Distribution of the Sample Mean

The most important idea of this unit is that of the sampling distribution of the sample mean. To help understand this, think about the following example:

Example

Suppose I want to estimate the average height of all eight-year-old girls. I can proceed by randomly selecting 100 eight-year-old girls, computing the sample mean 1 of the 100 heights, and then using 1 as my estimate of the average height of all eight-year-old girls. This is an example of using the sample mean to estimate the population mean.

Now suppose you conduct your own random sample of 100 eight-year-old girls and compute the mean 2 of your sample. You would use your average value 2 as the estimate of the average height of all eight-year-old girls. Almost surely my sample mean 1 and your sample mean 2 will be different values since data varies. My random sample will consist of different girls than your sample, which will lead to different estimates of the true (yet still unknown) average height.

Since our two sample means are different, we decide to work together to conduct one last random sample of 100 eight-year-old girls. We compute the mean 3 of the sample. What do you think will happen? Most likely 3 will be different from <1 and 2! In other words, we now have yet another estimate of the average height of all eight-year-old girls. In fact, each time we sample a different group of 100 girls, we will likely get a different result. This means that the sample mean  is a random variable!

There is only one true average height of all eight-year-old girls… but we’d need to get this info from the entire population of eight-year-old girls. Here’s the real question: How is it that we can use different values of the sample average as estimates for the one, true average height of all eight-year-old girls? The answer to this question reveals the beauty of the sampling distribution of the sample mean. Because the sample mean  is a random variable, the sample mean itself has a mean and a standard deviation, and thus has a probability distribution.

If we took many random samples of 100 eight-year-old girls, computed the average for each, and then created a distribution of our average values, we would see a stunning picture… our distribution of sample average values will be a normal distribution!

### The Mean and Standard Deviation of the Sampling Distribution of the Sample Mean

Suppose the random variable X has a normal distribution N(μ, σ). We need some new notation for the mean and standard deviation of the distribution of sample means, simply to differentiate from the mean and standard deviation of the distribution of individual values. Denote the mean of the distribution of sample means by μ and denote the standard deviation of the distribution of sample means by σ.

The following figure illustrates a surprising result of a sampling distributions of sample means: it really doesn’t matter what the distribution of individual values looks like, if we make a histogram of many sample means, the distribution will almost always be approximately normally distributed!! The value n represents the size of the sample from the original population (for example, we used an n = 100 for the above example of 8-year-old-girl heights. Image from: Nature Methods 10, 809–810 (2013); doi:10.1038/nmeth.2613. Published online 29 August 2013

If you pay attention to the shapes of the distribution of sample means from the figure, you should notice two things:

1. The distribution of sample means seems to have the same center as the distribution of individual values, and
2. The variation in the distribution of sample means is much less than the original distribution. (In other words, the distribution of sample means is much skinnier than the distribution of individual values.)

The first point tells us that the mean of the distribution of sample values, denoted μ, is the same as the mean of the individual values, μ! Therefore, μ = μ.

What about the second point? The process of averaging takes many individual values that are spread out and reduces them to one value, namely the sample mean . All that variation in the individual values is significantly reduced! Therefore, it should be obvious that σ < σ. In fact, due to calculations and theory that go beyond this course, it turns out that: where n is the size of the random sample.

Important Aside Note: You can see how the law of large numbers operates now… as the size n of your sample increases, the standard deviation of the distribution of sample means decreases. (A fraction with a fixed numerator and an increasingly large denominator becomes a very small fraction.) In other words, as n increases, the value gets smaller, and thus  is algebraically “forced” to fall closer and closer to the true population mean μ.

Putting the above two points together tells us that the distribution of sample means, , follows the normal distribution: ## 3. The Central Limit Theorem (CLT)

We saw above that if the individual values from a data set follow a normal distribution N(μ, σ), then the distribution of sample means also has a normal distribution, but with a smaller standard deviation What if the set of individual data values does not come from a normal distribution? Another important idea from taken from the above picture is the Central Limit Theorem (CLT), which states that as the sample size n increases, the sampling distribution of  becomes approximately normal. Therefore, even if the individual data values come from a continuous distribution that is skewed, by averaging enough values from a sample, the distribution of sample means will become normal.

Example (from Fundamentals of Statistics, by Sullivan)

This problem deals with the non-preventable contamination of food with certain particles. The Food and Drug Administration (FDA) sets acceptable levels of foreign substances that end up in our food and drink. For example, the acceptable level for insect fragments in peanut butter is 3 fragments per 10 grams. Suppose a random sample of n = 50 ten-gram portions of peanut butter is collected and it is found that = 3.6 for the 50 samples.

1. We can be sure that the sampling distribution of the sample mean  will be approximately normal because the sample size n = 50 is rather large. See the CLT for more information.
2. If we know that the mean and standard deviation of the individual data values is μ = 3 and σ = standard deviation = √3, then the mean and standard deviation of the sample mean  will be μ = 3 and Thus, the sampling distribution of sample means is approximately Normally distributed according to N(3, 0.245).
3. Since our sample of size n = 50 results in a sample mean 3.6, we want to know the probability of seeing such a sample mean value or larger; in symbols:

P( ≥ 3.6)

First we need to sketch a normal curve with the mean and standard deviation values from part (b). Notice how tiny the area is in the upper (right) tail. To compute this probability, we first need to convert the value 3.6 into a standard normal value using the Z-transformation: Remember, this z-value tells us that our sample mean result from the 50 samples is a value that falls 2.45 standard deviations ABOVE the intended target mean of μ = 3.

Using a table of standard normal values, we find that with z = 2.45, the associated area (or probability) is 0.9929. Does this seem correct? No! Since our area is to the right of the z-score, we need to subtract 0.9929 from 1. Therefore:

P( ≥ 3.6) = 0.0071 = 0.71%

In other words, only 71 times out of 10,000 would we ever expect that a sample of n = 50 ten-gram samples of peanut butter would result in a sample average of more than 3.6 insect fragments. This is very unusual. In fact, it’s so unusual that would we never expect it to occur. In other words, something could be very wrong at the plant responsible for packaging the peanut butter!

Notice how the work in part (c) differs from the work we did in sub-competency 3. At that point we wanted to know about the probability of finding one individual data value that satisfied some condition. In part (c) we are asking about the probability of taking a random sample of size n = 50 and having the average of the sample be larger than a specified condition. What do you think is harder to find, one value that satisfies a condition, or an average of a bunch of values that satisfies the condition?

## 4. Distribution of a Sample Proportion

In this part you will learn about the distribution of sample proportions. In a sense, this is an exact repeat of the sampling distribution of the sample means. Whereas the mean of a population is obtained by averaging the value of interest, a proportion is simply the percent of a population that does or does not have a certain characteristic. The value of a proportion must, therefore, fall between 0 and 1, inclusive: 0 ≤ p ≤ 1.

Our goal, once again, is to use our sample data to predict something about the population. In this case, we’ll use the proportion of our sample that satisfies some specific condition to predict the overall proportion of the population that satisfies the same condition. We denote the population proportion by p, and the sample proportion by .

### Calculating a Sample Proportion

To calculate the value of  from a sample of size n, simply count the number of people, x, in the population that satisfy the required condition and divide by the size of the sample, n. In symbols: ### The Sampling Distribution of the Sample Proportion

Just as with the sample mean, the larger our sample size, the more closely will be to the true population proportion p. But since there is randomness to every sample obtained, the value of will vary from sample to sample. Thus, the value of  is a random variable, and must have a mean and standard deviation. It turns out that the distribution of the sample proportion will be approximately normal (as long as the sample size is large enough) with mean (or expected value) of p (the true population proportion) and standard deviation: In symbols, the distribution of the sample proportion  is approximately normal with distribution It turns out this distribution of the sample proportion holds only when the sample size satisfies an important size requirement, namely that the sample size n be less than or equal to 5% of the population size, N. So n ≤ 0.05 ⋅ N. Although important, in this class we will not focus on this result.

Example (from Fundamentals of Statistics, by Sullivan)

In a survey, 500 parents were asked about the importance of sports for both boys and girls. Of the parents interviewed, 60% agreed that the genders are equal and should have opportunities to participate in sports. Describe the sampling distribution of the sample proportion  of parents who agree that the genders are equal and should have equal opportunities.

We will assume that the sample of 500 represents a random sample of the parents of all boys and girls in the United States. The true proportion in the population is equal to some unknown value . The sampling distribution of  can be approximated by a normal distribution with distribution Even though we do not know the exact value of p, we can use  as a good estimator of of p, to approximate the distribution of the sample proportion of The distribution of the sample proportion  is approximately normal with distribution N(0.6, 0.022).