2. Two-Way Tables

Recall that as we saw in sub-competency 6 we can organize two categorical variables in a two-way table of counts. Here is an example to start us down the path to analyzing this type of data.

Example: Asthma and Smoking

The table below describes the smoking habits of a group of asthma sufferers in comparison to their continent of residence.

Location
Nonsmoker
Occasional Smoker
Regular Smoker
Heavy Smoker
Total
North America
339
33
61
34
467
South America
377
132
184
136
829
Total
716
165
245
170
1296

 

Since there are 2 genders to consider and 4 possible smoking habits, there are 8 possible counts that occupy the cells of the table. Now if we wanted to compare these counts, this can be hard to do since there are so many more people responding from South America, than the other category. One of the ways to adjust this is to rewrite the table in terms of percentages, using the row totals.

Location
Nonsmoker
Occasional Smoker
Regular Smoker
Heavy Smoker
Total
North America
72.59%
7.07%
13.06%
7.28%
100%
South America
45.47%
15.92%
22.20%
16.41%
100%

 

Just as a reminder, this is the conditional distribution of the smoking habits, given their continent of residence. If we organize this into a graph, we can more easily compare the categorical data.

section12-1

If we want to compare the Continent category, we will begin with the assumption that there is no difference in distribution in the outcomes for North and South America. This will form our null hypothesis. This creates a problem of multiple comparison. It is a misconception to look at the nonsmoking category only and declare that North America is greater, since it contains the largest difference. Hopefully you are not fooled, since in all of the other smoking categories South America is greater. By looking at the entire graph, you may be tempted to say that there is a difference since in three categories South America is always higher, but you should wonder is that difference significant? You should wonder even more because this difference is not consistent through the categories.

Expected Counts of Two-Way Tables

Our null hypothesis is that there is no relationship between the two categorical variables in our two-way table. Acting under this assumption, we can ask what value would we expect to get in the cells, provided that this is true? These are called the expected counts, and are found by multiplying the row and column totals and then dividing by the tables total.

Expected Counts Table
Location
Nonsmoker
Occasional Smoker
Regular Smoker
Heavy Smoker
Total
North America

716 ⋅ 467 / 1296

= 258.00

165 ⋅ 467 / 1296

= 59.46

245 ⋅ 467 / 1296

= 88.28

170 ⋅ 467 / 1296

= 61.26

467
South America

716 ⋅ 829 / 1296

= 458.00

165 ⋅ 829 / 1296

= 105.54

245 ⋅ 829 / 1296

= 165.72

170 ⋅ 829 / 1296

= 108.74

829
Total
716
165
245
170
1296

 

The calculations in expected counts work because you are using the count for the total in comparison to the total per column and row. Alternatively, you can view the column total over the total as a probability, and then multiply by the row total, to get the expected value for that row category. This is similar to using np to estimate the expected value.