3. Part 2: The Level of Confidence C

Part 2: The Level of Confidence C

What does a “level of confidence” actually represent? Suppose we have a process for calculating confidence intervals with a 90% level of confidence. Obtain a series of 50 random samples, and then apply our process to the data from each random sample to obtain a confidence interval for each. Then we would expect that 90% of those 50 confidence intervals, or about 45 intervals, would contain the true, yet still unknown, population proportion.

There are two quick mathematical facts to recall. First, one way to represent an interval of numbers is using the parentheses notation: (a, b). For example, the interval (2, 3) is a representation for ALL numbers that are bigger than 2 yet less than 3.

Second, recall the empirical rule associated with a normal distribution. Because of the way that normally distributed data lie around the population mean, we know that within 1 standard deviation of the middle of the distribution, approximately 68% of the data values will lie. Now remember, and this is important, we are no longer looking at the distribution of just individual data values. We are looking at a distribution of sample proportions . We know that the standard deviation of the distribution of  is

section8-2

Combining this with the empirical rule tells us:

  1. 68% of the time,  will lie within 1 standard deviation
    section8-2
    of the population proportion p. In symbols, 68% of the time, will fall in the interval:
    section8-3
  2. 95% of the time,  will lie within 2 standard deviations
    2 ⋅ section8-2

    of the population proportion p. In symbols, 95% of the time, will fall in the interval:
    section8-4

  3. 99.7% of the time,  will lie within 3 standard deviations
    3 ⋅ section8-2

    of the population proportion p. In symbols, 99.7% of the time, will fall in the interval:

section8-5

Back in sub-competency 3 we learned how to identify which value(s) of the standard normal distribution separate off specific regions (i.e., areas = probabilities) under the normal curve. We’ll use the same techniques here. Recall that the total area under a normal distribution is 1 (100%). If we want to have a probability C of obtaining a value around the population proportion, that means there is a probability of (1 – C) of falling outside of that central region. Due to the symmetry of the normal distribution, half of that probability (1 – C) must occur in each tail:

section8-6

We call the z-value that separates the central area C the critical value and denote it by the symbols zσ/2. Remember that a z-value indicates a number of standard deviations from the mean. Make sure you are familiar with these.

The General Formula for a Level C Confidence Interval about a Population Proportion

Putting together the preceding discussions gives us the formula for creating a confidence interval for a population proportion:

section8-7

Keep in mind, the reason we create an interval of confidence is that we know our point estimate is likely to not be the same as the true population proportion. Therefore, we “hope”–with level C = (1 – σ) ⋅ 100% confidence–that by allowing for some error on either side of our point estimate, our resulting interval contains the true value of the population proportion. In symbols:

section8-8

It is important to note that this formula does not say that the true population proportion falls directly in the middle of the interval; it could be anywhere in the interval.

Example 1:

Suppose a sample of 410 randomly selected radio listeners revealed that 48 listened to WXQI. To obtain an interval estimate, then the amount of confidence to be placed in the interval must be specified. Suppose we desire 95% confidence.

Solution

sec8-4

Note that the sample proportion p ̂ is used in place of p in the computation of the standard deviation. For any realistic problem, this will always be the case. Fortunately, unless p ̂ and p are far apart, the value of the standard deviation will not be greatly affected.

Computing the confidence interval results in:

.117 ± 1.96 (0.0159) = .117 ± 0.03116

Creating the confidence interval 0.0858 to 0.1482.

We are 95% confident in the procedure that created this interval. Another interpretation would be that we are 95% confident that the point estimate, .117, has a maximum error of estimation of .03116. A maximum error of only .03116 with 95% confidence suggests a rather high level of accuracy in the estimation of the proportion.

The Margin of Error

We will often discuss the margin of error in many problems. The margin of error, E, in a level C = (1 – σ) ⋅ 100% confidence is

section8-9

By studying this formula, you will see how the margin of error is influenced by three factors: (1) the level of confidence C, (2) the sample size n, and (3) the standard deviation of the population

section8-10

Changing the level of confidence C changes the value of zσ/2. Increasing the sample size n causes a decrease in the margin of error, while a decrease in the sample size n increases the margin of error.

Determining the Necessary Sample Size

Often we have the reverse problem, in which we want an experiment to result in an answer with a particular, predetermined accuracy, or error. In this case, we start with a given margin of error E and need to find the sample size n needed to achieve this goal. From a math standpoint, this means we need to solve for the value n in the margin of error formula. Using a little bit of algebra, we can arrive at the solution:

section8-11

where  is a prior estimate of the population proportion. Remember we are going to determine an appropriate sample size n before doing any sampling. This means we won’t have a value of  to actually use. To be safe, if nothing is known about the population proportion p, we choose = 0.5. Can you think why this is the “best” strategy? Think of the game where two people are asked to guess a number between 1 and 10, with the person guessing the number closest to the secret number winning. A good strategy is for the first person to choose 5, right in the middle of the interval. So in this case where we have no prior information about the population proportion, our value for the sample size n would be:

The most important note about determining a necessary sample size is that typically your calculations will result in a number with decimal places. Since you cannot sample a portion of an individual, you should always round up to the next whole individual. For example, if you find that n = 462.15 individuals, we must sample n = 463 individuals to obtain our predetermined margin of error. The simple reason for rounding up is that we’d rather over sample than under sample. In other words, it’s better to have a slightly larger sample than to have one that is a bit too small. Remember, more useful information typically comes from larger amounts of data.

After calculating a few sample sizes, you will begin to understand why political polls and almost any poll presented in the media often have a margin of error of 3 or 4 percentage points. Since it takes a large sample (n = 16,577) in order to be 99% confident within 1 percentage point, the 3 or 4 percentage points margin of error targets are good compromises between accuracy and cost-effectiveness. For a margin of error of 3 percentage points, a polling institution only has to obtain a random sample of (roughly):

section8-13Isn’t it odd (but really cool!) that a sample of size 1844 individuals is large enough to give us fairly accurate information about the entire population?

A Confidence Interval for a Population Mean

In this section we focus on how to create a confidence interval for the population mean μ. Recall that the process of creating a confidence interval requires two steps:

  • Obtain a point estimate of the population parameter of interest. (This time we’ll focus on the population mean.)
  • Quantify the accuracy and precision of the point estimate using the standard deviation of the distribution of sample means. This step mathematically creates a confidence level in our point estimate.

Again, the first step is easy. The most obvious point estimate for the population mean μ is the mean from a sample taken from that population, which we denote by . However, the chances of a small sample giving us precise information about the population are extremely low. In other words, there is almost no chance that our sample mean is exactly equal to the unknown population mean. Therefore, we will use the sampling distribution of the sample mean, to quantify the accuracy and precision of our point estimate .

A confidence interval for the population mean μ should look the same as it does for a population proportion:

point estimate ± margin of error

As the point estimate of the population mean is , our confidence interval will look like:

± margin of error

Our goal for the rest of this section is to figure out, mathematically, what the margin of error portion of the confidence interval looks like. Unfortunately, it’s not as “easy” as the confidence interval for a population proportion.

Previously we learned that the distribution of a sample mean has a normal distribution with a center of μ and a standard deviation of σ/√n. In symbols,

section8-14

If you look at the formula, you should notice a pretty significant problem for us. In order to figure out how spread out our distribution of  values will be, we need to know the population standard deviation σ. How likely do you think it can be to not know the population mean μ but to know exactly what the value of the population standard deviation σ would be? This is basically impossible. The question now becomes: what value can we use as a proxy, or estimate, of σ if all we have at our disposal is a sample mean  and a sample size n?

Hopefully you could guess that the sample standard deviation s will be our stand-in for the population standard deviation σ in the formula above. That seems easy enough. But there is another slight problem, based on very deep statistical theories that we will not cover in this class. The basic problem arises from the computations of z-scores. You should recall that z-scores, which are calculated using the formula

section8-15

follow a normal distribution with mean 0 and standard deviation 1: N(0,1). We relied heavily on these “critical values,” denoted by zσ/2, in our creation of confidence intervals for a population proportion.

Unfortunately, the critical values we obtain by simply replacing σ with s

section8-16

do not follow a normal distribution! Therefore, our confidence interval for a population mean will not simply be:

section8-17

It turns out that the collection of values

section8-18

follows what is called the student’s t-distribution. This result was discovered by an employee of the Guinness brewing company. Versions of the story differ on why it’s called “student’s” t-distribution instead of naming it after the actual employee (William S. Gosset).

It seems crazy that one little change to the formula can completely change its distribution. Just remember, statistics is all about information: the more we know, the better. When the sample size n is very large, the sample is likely to contain elements representative of the whole population. In this case, the sample standard deviation, s, would be a very good estimate of σ. But when the sample size n is small in relation to the size of the population, s is likely just a mediocre estimate of the true population standard deviation. Therefore, the distribution of values

section8-19

must change as the sample size n changes.

Properties of the t-distribution

The following are several properties of the Student’s t-distribution:

  • Just like the normal distribution, it is centered at 0 and symmetric about 0.
  • Just like the normal curve, the total area under the Student’s t curve is 1.
  • The area to the left of 0 is 1/2, and the area to the right of 0 is also 1/2.
  • Just like the normal curve, as values for t increase, the Student’s t curve gets close to, but never reaches, 0.

There are a few key differences, however:

  • The number of degrees of freedom, (n – 1), is crucial for the t-distributions.
  • Unlike the normal distribution, there are many different “standard” t-distributions:

There is a “standard” t-distribution with 1 degree of freedom

There is a “standard” t-distribution with 2 degrees of freedom

There is a “standard” t-distribution with 3 degrees of freedom, etc.

  • The spread of the t-distributions is a bit greater than that of the standard Normal curve (i.e., the t curve is slightly “fatter”). This means that there is more variation in t-values than z-values.
  • As the number of degrees of freedom increases (i.e., the sample size gets larger), the Student’s t-distribution becomes closer and closer to the normal distribution.

Determining t-Values

What is important to keep in mind is the critical value tσ/2, used to determine the number standard deviations (s/√n) to add and subtract from our point estimate of the population mean, changes as our sample size changes. As n increases, it turns out that tσ/2 becomes closer and closer to zσ/2. But if our sample size is pretty small, especially when compared with the size of the population, tσ/2 will be much larger than zσ/2. This makes sense because if our sample size is pretty small, we have relatively little information about the population, and therefore our point estimate  is not likely to be a very good estimate of the population mean μ. Because of this, we’ll need our margin of error to be quite a bit larger, in the hopes that our confidence interval captures the true mean.

A Confidence Interval for a Population Mean

The new formula of a level C = (1 – σ) ⋅ 100% confidence interval for a population mean is:

section8-21

which, in interval notation is:

section8-22

The values for tσ/2 come from either technology or a table of standard t-values (which can be found anywhere on the Internet), with (n – 1) degrees of freedom. The interpretation of a level C = (1 – σ) ⋅ 100% confidence interval is the same as before: we are (1 – σ) ⋅ 100% confident that our interval contains the true population mean (if we were repeatedly to draw a sample of size n from the population).

Example 2:

Given the following data drawn from a normal population with unknown mean and variance, construct a 95% confidence interval for the population mean.

Seven data values have been selected randomly from the population:

25, 19, 37, 29, 40, 28, 31

The sample mean and standard deviation are x ̅=29.86 and s = 7.08, respectively.
The degrees of freedom associated with the problem is d.f. = n − 1 = 7 − 1 = 6.

The t-value corresponding to 6 degrees of freedom and 95% confidence is given in Table D of Appendix A as t.025,6 = 2.447. The corresponding confidence interval is:

sec8-4a

Thus, we are 95 percent “confident” that the interval 23.31 to 36.41 will contain the population mean.

An alternate interpretation would be that we are 95% confident that the point estimate (29.86) has a maximum error of estimation of 6.55.

Determining the Necessary Sample Size

Often we have the reverse problem, in which we want an experiment to result in an answer with a particular, predetermined accuracy, or error. In a similar method as in Section 9.1, we start with a given margin of error E and use a few algebraic steps to solve for the necessary sample size (n) needed to achieve our goal:

section8-23

The most important note about determining a necessary sample size is that typically your calculations will result in a number with decimal places. Since you cannot sample a portion of an individual, you should always round up to the next whole individual. For example, if you find that n = 462.15 individuals, we must sample n = 463 individuals to obtain our predetermined margin of error. The simple reason for rounding up is that we’d rather over sample than under sample. In other words, it’s better to have a slightly larger sample than to have one that is a bit too small. Remember, more useful information typically comes from larger amounts of data.

Confidence Intervals on the TI-83/84

All of these test can be found by hitting the [STAT] button and arrowing over to the TESTS menu.

Calculator Example 1: Confidence interval with a z

A sample of 38 items is chose from a normally distributed population with a sample mean of 12.5 and a population standard deviation of 2.8. Construct a 95% confidence interval for the population mean.

Solution:

We choose [7:Z-Interval] since we are using a z distribution. Enter the information shown in the screen 1 below, highlight [Calculate] and press ENTER to get screen 2:

sec8-4b

Note: This is the form of the interval that the calculator gives. (11.61, 13.39) is equivalent to 11.61< µ <13.39.

Calculator Example 2: Confidence interval for data with a t

A sample of 7 items is chosen from a normal distribution with the following results: {1,5,6,8,12,16,18}. Construct a 95% confidence interval for the true population mean.

Solution:

Here we are given the actual data from the sample. We can have the calculator do all the work on the sample by entering the data into a list, say L1, as shown in in screen 3. Choose [8: T-interval] and enter the information shown in in screen 4, highlight [Calculate] and press ENTER to get the result shown in screen 5.

sec8-4c

Note: Freq stands for frequency which may be used if you have data where a lot of the data points are repeated. For example, if your data consisted of {1,1,1,2,2,3,4} you can enter all the distinct data in L1 and the frequencies in L2. So. L1={1,2,3,4} and L2={3,2,1,1,}. Then you would enter L1 as the list option for the interval and L2 as the Freq. It will most often be the case that we will use 1as the Freq but this option is nice for summary data.

Calculator Example 3: Confidence interval for a proportion
A certain carnival game is won if a blue fish is drawn from a barrel full of blue and red fish. After playing the game 35 times you have won 14 times. Construct a 95% confidence interval for the proportion of blue fish in the barrel.

Solution:
Select [A:1-PropZInterval] and enter in the information as shown in screen 6., highlight [Calculate] and hit ENTER to get the results in screen 7.

sec8-4d

Note: There is also a 1-PropTInt for when you want to use Student’s t distribution instead of a z.