Recall that as we saw in sub-competency 6 we can organize two categorical variables in a two-way table of counts. Here is an example to start us down the path to analyzing this type of data.
Example: Asthma and Smoking
The table below describes the smoking habits of a group of asthma sufferers in comparison to their continent of residence.
Location
|
Nonsmoker
|
Occasional Smoker
|
Regular Smoker
|
Heavy Smoker
|
Total
|
North America |
339
|
33
|
61
|
34
|
467
|
South America |
377
|
132
|
184
|
136
|
829
|
Total |
716
|
165
|
245
|
170
|
1296
|
Since there are 2 genders to consider and 4 possible smoking habits, there are 8 possible counts that occupy the cells of the table. Now if we wanted to compare these counts, this can be hard to do since there are so many more people responding from South America, than the other category. One of the ways to adjust this is to rewrite the table in terms of percentages, using the row totals.
Location
|
Nonsmoker
|
Occasional Smoker
|
Regular Smoker
|
Heavy Smoker
|
Total
|
North America |
72.59%
|
7.07%
|
13.06%
|
7.28%
|
100%
|
South America |
45.47%
|
15.92%
|
22.20%
|
16.41%
|
100%
|
Just as a reminder, this is the conditional distribution of the smoking habits, given their continent of residence. If we organize this into a graph, we can more easily compare the categorical data.
If we want to compare the Continent category, we will begin with the assumption that there is no difference in distribution in the outcomes for North and South America. This will form our null hypothesis. This creates a problem of multiple comparison. It is a misconception to look at the nonsmoking category only and declare that North America is greater, since it contains the largest difference. Hopefully you are not fooled, since in all of the other smoking categories South America is greater. By looking at the entire graph, you may be tempted to say that there is a difference since in three categories South America is always higher, but you should wonder is that difference significant? You should wonder even more because this difference is not consistent through the categories.
Expected Counts of Two-Way Tables
Our null hypothesis is that there is no relationship between the two categorical variables in our two-way table. Acting under this assumption, we can ask what value would we expect to get in the cells, provided that this is true? These are called the expected counts, and are found by multiplying the row and column totals and then dividing by the tables total.
Expected Counts Table
|
|||||
Location
|
Nonsmoker
|
Occasional Smoker
|
Regular Smoker
|
Heavy Smoker
|
Total
|
North America |
716 ⋅ 467 / 1296 = 258.00 |
165 ⋅ 467 / 1296 = 59.46 |
245 ⋅ 467 / 1296 = 88.28 |
170 ⋅ 467 / 1296 = 61.26 |
467
|
South America |
716 ⋅ 829 / 1296 = 458.00 |
165 ⋅ 829 / 1296 = 105.54 |
245 ⋅ 829 / 1296 = 165.72 |
170 ⋅ 829 / 1296 = 108.74 |
829
|
Total |
716
|
165
|
245
|
170
|
1296
|
The calculations in expected counts work because you are using the count for the total in comparison to the total per column and row. Alternatively, you can view the column total over the total as a probability, and then multiply by the row total, to get the expected value for that row category. This is similar to using np to estimate the expected value.