1. Intro

The second type of inferential statistics, hypothesis testing (also called a test of significance), is very important in scientific fields. Basically, you use a hypothesis test when you want to investigate statements (or beliefs) about a characteristic of one or more populations. In this sub-competency you are introduced to the major ideas, computations, and conclusions of hypothesis testing. You will also see the types of errors (or mistakes) that can be made when using hypothesis tests.

An example will illustrate the language and all necessary components of a hypothesis test.

Example: What Is the “Normal” Human Body Temperature?

We all believe that the acceptable, or normal, human body temperature is 98.6°F. How do we know this? Have any of us done our own testing or research to prove this? No! Our parents or friends or doctors told us this. So again how do we know that the normal human body temperature is 98.6°F? The answer is that we don’t know for sure…we take it on faith that 98.6°F is correct. This belief, in statistical terms, is called the null hypothesis, and is denoted by H0. A null hypothesis is an underlying belief about a characteristic of a population.

To test this deeply ingrained belief, i.e., to test the validity of the null hypothesis H0, doctors from the University of Maryland took a sample of n = 106 body temperatures from healthy subjects. Their sample results were = 98.2°F and s = 0.62°F.

The question to answer now is this:

Do the results from the sample support the null hypothesis or do the results lead us to believe that the normal human body temperature is different from 98.6°F?

The two numbers, 98.6°F and 98.2°F, are extremely close mathematically…they differ by only 0.4°F. But you have to keep the main issue of statistics in mind: data varies. In terms of the variation of the sample results, s = 0.62°F and the size of our sample, n = 106, how close are the two values 98.6°F and 98.2°F?

This is where the z-score will come into play again. We need to determine how different these two values are in terms of the variation and size of the sample. Recall that the sample mean has a distribution that is normal with mean μ and standard deviation s/√n. Performing the calculation with this in mind gives us our test statistic:

This means that the two numbers 98.6°F and 98.2°F actually differ by more than 6 standard deviations! Recalling the empirical rule discussed in subcompetency 3, roughly 99.7% of all data falls within 3 standard deviation of the mean. So, if the true population mean human body temperature really is 98.6°F, then our sample produced a result that differs from the true population mean by more than 6 standard deviations. What are the chances this would happen? In other words, how likely is it that such a large sample (n = 106) will produce a result that is so different from what is believed to be the true population mean? We’ll learn how to determine the exact probability, or likelihood, of this happening in the next subcompetency.

Finally, what are the conclusions we can draw from our test of the null hypothesis? There are only two choices:

• The true population mean really is 98.6°F and we were just really unlucky to get a sample result that was so different from the true population mean. It turns out, we’ll find later, that the probability of getting a sample result this different from the claimed mean is:

0.00000000142

In other words, we would only expect to get a sample this different in (roughly) 14 out of ever 10 billion samples of size n = 106. Were we just that unlucky? Or…

• The true population mean is not 98.6°F. In fact, since our sample result of = 98.2°F is so far below 98.6°F, our data is suggesting that the true population mean is likely lower than 98.6°F. We cannot claim that the true population mean is the same as the sample mean, μ =  = 98.2°F (because data varies!!), but our sample does suggest that a normal human body temperature is likely much lower than what is commonly believed.

Our best conclusion statistically would be no. 2, that the true population mean is not 98.6°F. Therefore, our data and results are strongly suggesting that we need to reject the belief that the normal human body temperature is 98.6°F. Statistically we say that, “we reject the null hypothesis H0 because our sample data provides strong evidence that the temperature is different (most likely a bit lower) than 98.2°F.”

2. Four Parts of a Hypothesis

Let’s quickly investigate the four main parts of any hypothesis test:

1. The Null and Alternative Hypotheses

In statistics, a hypothesis is a statement, or assumption, about the characteristics of one or more variables in one or more populations. Since a statement can be either true or false, there are two hypotheses to identify.

• The null hypothesis, denoted by H0, is the statement about a value of a population parameter that we intend to test. Since H0 is the statement we (or someone else) believe to be true, H0 is the statement of “no difference,” and thus always contains the condition of equality. The conclusions of our hypothesis test will be either: “reject H0” or “do not reject H0.” Keep in mind that we always assume the null hypothesis to be true until we get evidence from sample data that suggests otherwise.
• The alternative hypothesis, denoted by Hα (or H1 in other resources), must be what is true if the null hypothesis is false. There are three ways to be different from the null hypothesis test: larger than, smaller than, or just plain different (not equal). Therefore, the notation for Hα will always contain a condition of inequality. If Hα contains either of the inequalities > or <, we call the hypothesis test a one-tail test, since there is only one way to be greater than or one way to be less than H0. If Hα contains the inequality ≠, then we call the hypothesis test a two-tailed test, since being not equal to something means you could be less than or greater than the given value.

To see the difference between a one-tail test and a two-tail test, imagine that I ask you to guess a number between one and 10, and in addition I tell you that the number is greater than 5. What numbers will you guess? You will choose numbers in only one direction from 5…those greater than 5. You won’t bother guessing numbers below five, since I specified the direction where you should start guessing numbers. Now suppose I ask you to guess a number between one and 10, and all I tell you is that the number is different from (i.e., not equal to) 5. Which numbers will you guess this time? You now have to choose numbers in two directions from 5…those less than five and those greater than 5.

2. The Test Statistic

Once we have a statement of hypothesis, the next thing that happens is we analyze sample data (collected appropriately…see sampling in subcompetency 5) using both graphical and numerical summaries (subcompetencies 1 and 2, respectively). We specifically want to use numerical summaries to compare with what is being claimed in the null hypothesis. The types of hypothesis tests you will run in this course will focus on either population means (μ vs. ) or population proportions (p vs. ).

The nice thing about computing a test statistics is that it is a computation you are already familiar with. In subcompetency 3 you were introduced to the z-score

which in English terms is:

For hypothesis testing, we’ll use the same form but slightly modify the terms:

3. Probability Values and Statistical Significance

The key to making an appropriate conclusion to a hypothesis test is to identify results that are statistically significant. When results observed from the sample are unlikely under the assumption that the null hypothesis is true, we say the result is statistically significant. If our sample results are unlikely, then we reject the null hypothesis H0. Repeat this to yourself over and over: if we have statistically significant results, we reject H0. In other words, if the difference between our sample result and the null hypothesis claim is large (which results in a really small probability value), then we reject H0.

There are actually TWO ways to proceed with hypothesis testing, the “classical” method and the P-value (probability value) approach, with both giving the same conclusions to the test. For this class we will use the P-value approach in our hypothesis testing, as it is far more prevalent in scientific research. Our goal is to calculate the probability of observing a sample statistic as extreme, or more extreme, than the one observed from our sample under the assumption that the null hypothesis is true. Since AREA = PROBABILITY, a P-value represents the total area that lies outside of the region(s) defined by our test statistic(s). The main question we are trying to answer is: Could random variation alone account for the difference between the null hypothesis and our observations from a random sample? A small P-value implies that random variation through the sampling process alone is not likely to account for the observed difference. Therefore, with a small P-value, we reject H0 and are led to believe that the true value of the population is significantly different from what was stated in H0. Small P-values are strong evidence against H0.

Once we know the value of our test statistic and whether our hypothesis test is a one-tail or two-tail test, we determine the P-value. Keeping in mind the idea of statistical significance, our hypothesis test boils down to two situations:

• A small P-value implies that the probability of seeing our sample mean is very unlikely if the null hypothesis is true. This is considered significant evidence that the null hypothesis is not true. Therefore, a small P-value means we need to reject the null hypothesis H0.
• A large P-value implies that it is not unlikely to see our sample results given the null hypothesis is true. This is considered significant evidence that the null hypothesis could be true, and we do not reject the null hypothesis H0. Notice, we never conclude that the null hypothesis is indeed true!

We call the level of significance of our hypothesis test the value of alpha, α. Typical values for α are: α = 0.10, α = 0.05, or α = 0.01. The value of α chosen for a hypothesis test must be reported using language such as, “Our hypothesis test has a level of significance α = 0.01.”

4. The Conclusions of Hypothesis Testing

It is important to keep in mind that the statistical results from a hypothesis test only deal with the null hypothesis H0. There are only two statistical conclusions to make at the end of a hypothesis test: “we reject H0” or “we fail to reject H0.” The statistical conclusions never deal with the alternative hypothesis H1. Moreover, we never say that, “we accept H0.” It is either, “we reject H0” or “we fail to reject H0.”

After stating the statistical conclusions, it is important to write a sentence or two about what are our conclusions are in a way that a non-statistics person can understand. Usually this sentence starts out with, “Our data provides sufficient evidence that…” or, “Our data does not provides sufficient evidence that…” It will take lots of practice to become comfortable making the proper conclusions from a hypothesis test.

Concluding Example

Returning to the example at the beginning of our discussion, a hypothesis test would look like the following:

1. Statement of Null and Alternative Hypotheses
• H0: μ = 98.6°
• Hα: μ ≠ 98.6°
2. Calculate the Test Statistic
3. Compute the Probability Value (Two-Tail Test)Using a table of probability values, all we’d be able to say is that the probability of seeing such a test statistic assuming that 98.6°F is indeed the average human temperature is much less than 0.0005 (or 0.05%). Relying on technology to calculate the probability, we find a p-value of 0.00000000142, which is as close to 0 as we’d ever care to see.
4. Make ConclusionsAs our P-value is lower than any significance value (such as α = 0.05), we would state:
• “We reject the null hypothesis H0: μ = 98.6°. Our sample data provides sufficient evidence that the typical average human body temperature is significantly different from 98.6°.”

That’s it… the entire process of carrying out a hypothesis test. To conclude this subcompetency, it must be acknowledged that even with careful data collection and precise test statistics and p-value computations, mistakes (errors) can be made! Remember, we are relying on sample data on which to base our conclusions. Even using proper experimental design with randomization throughout the process, sometimes by chance our sample data may not adequately represent the overall population. There is nothing we can do about that! This means there is a possiblity that we could draw an inaccurate conclusion from the hypothesis test!

3. Type I and Type II Errors

There are 4 possible outcomes when conducting a hypothesis test:

• We reject the null hypothesis when the alternative hypothesis is actually true.
• We do not reject the null hypothesis when the null hypothesis is actually true.
• We reject the null hypothesis when it is actually true.
• We do not reject the null hypothesis when the alternative hypothesis is actually true.

With careful thinking, it’s easy to see that the first two possibilities are CORRECT decisions (for example, in the first possibility we are rejecting the null hypothesis…telling the world we have data that shows our underlying belief is likely not true…when indeed the alternative hypothesis is correct). It is the last two possibilities, no. 3 and no. 4, that are INCORRECT decisions.

For the third choice, we would be rejecting the null hypothesis–showing we have data that leads us to believe it is incorrect–when it is actually true. Our sample data leads us to an incorrect decision. This mistake is called a TYPE I ERROR. For the fourth choice, we would fail to reject the null hypothesis–our sample data would actually support the value of the null hypotheis–when indeed the alternative hypothesis is actually the “true” value. This mistake is called a TYPE II ERROR.

Nice visuals of Types I and II errors can be found all over the Internet. One such chart comes from the suggested textbook for the course, and looks like this.

In most problems we do, we try to keep the probability of making a Type I Error, denoted by the symbol alpha α (YES, the same α from the hypothesis testing!), as small as possible, since making a Type I Error can often be more serious. In this class we will rarely, if ever, discuss Type II Errors. If you go on to take additional statistics courses, you will become familiar with Type II Errors then.