Recall that as we saw in subcompetency 6 we can organize two categorical variables in a twoway table of counts. Here is an example to start us down the path to analyzing this type of data.
Example: Asthma and Smoking
The table below describes the smoking habits of a group of asthma sufferers in comparison to their continent of residence.
Location

Nonsmoker

Occasional Smoker

Regular Smoker

Heavy Smoker

Total

North America 
339

33

61

34

467

South America 
377

132

184

136

829

Total 
716

165

245

170

1296

Since there are 2 genders to consider and 4 possible smoking habits, there are 8 possible counts that occupy the cells of the table. Now if we wanted to compare these counts, this can be hard to do since there are so many more people responding from South America, than the other category. One of the ways to adjust this is to rewrite the table in terms of percentages, using the row totals.
Location

Nonsmoker

Occasional Smoker

Regular Smoker

Heavy Smoker

Total

North America 
72.59%

7.07%

13.06%

7.28%

100%

South America 
45.47%

15.92%

22.20%

16.41%

100%

Just as a reminder, this is the conditional distribution of the smoking habits, given their continent of residence. If we organize this into a graph, we can more easily compare the categorical data.
If we want to compare the Continent category, we will begin with the assumption that there is no difference in distribution in the outcomes for North and South America. This will form our null hypothesis. This creates a problem of multiple comparison. It is a misconception to look at the nonsmoking category only and declare that North America is greater, since it contains the largest difference. Hopefully you are not fooled, since in all of the other smoking categories South America is greater. By looking at the entire graph, you may be tempted to say that there is a difference since in three categories South America is always higher, but you should wonder is that difference significant? You should wonder even more because this difference is not consistent through the categories.
Expected Counts of TwoWay Tables
Our null hypothesis is that there is no relationship between the two categorical variables in our twoway table. Acting under this assumption, we can ask what value would we expect to get in the cells, provided that this is true? These are called the expected counts, and are found by multiplying the row and column totals and then dividing by the tables total.
Expected Counts Table


Location

Nonsmoker

Occasional Smoker

Regular Smoker

Heavy Smoker

Total

North America 
716 ⋅ 467 / 1296 = 258.00 
165 ⋅ 467 / 1296 = 59.46 
245 ⋅ 467 / 1296 = 88.28 
170 ⋅ 467 / 1296 = 61.26 
467

South America 
716 ⋅ 829 / 1296 = 458.00 
165 ⋅ 829 / 1296 = 105.54 
245 ⋅ 829 / 1296 = 165.72 
170 ⋅ 829 / 1296 = 108.74 
829

Total 
716

165

245

170

1296

The calculations in expected counts work because you are using the count for the total in comparison to the total per column and row. Alternatively, you can view the column total over the total as a probability, and then multiply by the row total, to get the expected value for that row category. This is similar to using np to estimate the expected value.