17 Contingency Tables

A contingency table, sometimes called a two-way frequency table, presents a summary (count) of categorical data. It consists of at least two rows and at least two columns. An example would be number of votes for each political party in the fifty states. Contingency tables are used to study the association between the two variables, and to check if data follow an assumed pattern.. Feedback

17.1 Testing independence between two factors

Let’s assume you wan to find out if voting behavior differs by age group. You collected data from 52,615 registered voters that records their ages and if they voted (1) or not (0). Normally you would read in a file with that information, here I just generate a sample data set:

age <- 1:52615
voted <- 1:52615
age[1:5412] <- "18-19"
age[5413:11455] <- "20-24"
age[11456:18787] <- "25-29"
age[18788:27252] <- "30-39"
age[27253:33852] <- "40-49"
age[33853:39908] <- "50-59"
age[39909:46397] <- "60-69"
age[46398:52615] <- "70+"

voted[1:52615] <- 0
voted[1:3084] <- 1
voted[5413:8977] <- 1
voted[11456:16148] <- 1
voted[18788:23951] <- 1
voted[27253:31608] <- 1
voted[33853:38151] <- 1
voted[39909:44905] <- 1
voted[46398:51620] <- 1

voter_data <- data.frame(age, voted)

In this format, the data is hard to read. I am using the table command to arrange the data in a more readable format, called a contingency table (also known as a cross tabulation or crosstab):

cont_table <- table(voter_data$age,voter_data$voted)
cont_table
#>        
#>            0    1
#>   18-19 2328 3084
#>   20-24 2478 3565
#>   25-29 2639 4693
#>   30-39 3301 5164
#>   40-49 2244 4356
#>   50-59 1757 4299
#>   60-69 1492 4997
#>   70+    995 5223

Next, I want to add some totals to the table:

total_age_group <- rowSums(cont_table)
cont_table1 <- cbind(cont_table, total_age_group)
cont_table1
#>          0    1 total_age_group
#> 18-19 2328 3084            5412
#> 20-24 2478 3565            6043
#> 25-29 2639 4693            7332
#> 30-39 3301 5164            8465
#> 40-49 2244 4356            6600
#> 50-59 1757 4299            6056
#> 60-69 1492 4997            6489
#> 70+    995 5223            6218

and also the percentage of voters and non-voters per age group:

percent_not_voting <- cont_table1[,1]/cont_table1[,3]
percent_voting <- cont_table1[,2]/cont_table1[,3]
cont_table2 <- cbind(cont_table1,  percent_not_voting,percent_voting)
cont_table2
#>          0    1 total_age_group percent_not_voting
#> 18-19 2328 3084            5412          0.4301552
#> 20-24 2478 3565            6043          0.4100612
#> 25-29 2639 4693            7332          0.3599291
#> 30-39 3301 5164            8465          0.3899587
#> 40-49 2244 4356            6600          0.3400000
#> 50-59 1757 4299            6056          0.2901255
#> 60-69 1492 4997            6489          0.2299276
#> 70+    995 5223            6218          0.1600193
#>       percent_voting
#> 18-19      0.5698448
#> 20-24      0.5899388
#> 25-29      0.6400709
#> 30-39      0.6100413
#> 40-49      0.6600000
#> 50-59      0.7098745
#> 60-69      0.7700724
#> 70+        0.8399807

The question I want to answer is “Is voting behavior (i.e voted or didn’t) independent or dependent of age group?” Formally, we test

\(\small H_0\): voting behavior and age groups are independent

\(\small H_a\): voting behavior and age groups are not independent

We will start by finding out how many people voted / didn’t vote overall, and what percent voted / didn’t vote overall.

voters <- sum(cont_table2[,2])
non_voters <- sum(cont_table2[,1])
total <- voters + non_voters

perc_voters <- voters / total
perc_non_voters <- non_voters / total
#> [1] "Out of  52615  people  17234  or  32.75 percent didn't vote"
#> [1] "35381  people or  67.25  percent voted"

We can now compute the expected number of voters / non voters for each age group if they voted the same as all voters overall.

expected_voters <- perc_voters * cont_table2[,3]
expected_non_voters <- perc_non_voters * cont_table2[,3]
cont_table3 <- cbind(cont_table2, expected_non_voters,expected_voters)
cont_table3
#>          0    1 total_age_group percent_not_voting
#> 18-19 2328 3084            5412          0.4301552
#> 20-24 2478 3565            6043          0.4100612
#> 25-29 2639 4693            7332          0.3599291
#> 30-39 3301 5164            8465          0.3899587
#> 40-49 2244 4356            6600          0.3400000
#> 50-59 1757 4299            6056          0.2901255
#> 60-69 1492 4997            6489          0.2299276
#> 70+    995 5223            6218          0.1600193
#>       percent_voting expected_non_voters expected_voters
#> 18-19      0.5698448            1772.696        3639.304
#> 20-24      0.5899388            1979.380        4063.620
#> 25-29      0.6400709            2401.591        4930.409
#> 30-39      0.6100413            2772.704        5692.296
#> 40-49      0.6600000            2161.825        4438.175
#> 50-59      0.7098745            1983.638        4072.362
#> 60-69      0.7700724            2125.467        4363.533
#> 70+        0.8399807            2036.701        4181.299

Our test statistic \(C^2\) is based on the difference between expected and observed frequency of voting:

\[ \small C^2=\sum_{\text{all age groups, all vote types}} \frac{(\text{observed value - expected value})^2}{\text{expected value}} = \sum_{\text{all cells}}\frac{(O-E)^2}{E}\] If \(\small H_o\) is true, then the test statistic \(\small C^2\) has an approximate Chi-square distribution with (n-1)(m-1) degrees of freedom. There are several (similar) rules of thumb for using this test:

  • some sources say all expected cell values have to be larger than 5
  • others say no more than 20% of cells can have an expected count less than 5

Here we compute the test statistic:

ratio <- (cont_table3[,1]-cont_table3[,6])^2/cont_table3[,6]+(cont_table3[,2]-cont_table3[,7])^2/cont_table3[,7]
Csq <- sum(ratio)
Csq
#> [1] 1746.287

We have n = 8 age groups and m = 2 categories (voted / didn’t vote), so we need Chi distribution with (8-1)(2-1)=7 degrees of freedom.

p_value <- 1-pchisq(Csq, df=7)
p_value
#> [1] 0

The following is a visualization. The red vertical line represents the test statistic, the area to the right of the red line the p-value.

x <- c(seq(0, 20, by = 0.1), seq(2,1750, by=2)) 
y <- dchisq(x, df = 7)
data <- data.frame(x,y)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(data, aes(x=x, y=y))+geom_line()+ggtitle("Chi-square distribution with 7 degrees of freedom")+geom_vline(xintercept=Csq, color="red")
Instead of programming the test ourselves, we can use the built-in chi-square test in just two lines:
vote <- as.factor(voted)
chisq.test(age, vote)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  age and vote
#> X-squared = 1746.3, df = 7, p-value < 2.2e-16

17.2 Testing Homogeneity

In the previous section we drew a sample from a single population (eligible voters) and recorded two factors (age group, voting behavior). We then tested for independence between the two factors. In this case the number of people in each age group was not fixed from the onset.

We could also interpret the situation as drawing from 8 different populations, taking each age group as a separate population. We draw a set number (5412, 6043, etc) from each age group. In this case, we would ask if a single variable (the voting behavior) has the same distribution for each population. In this case:

\(\small H_0\): voting behavior has the same distribution across age groups

\(\small H_a\): voting behavior has different distributions across age groups

or more formal: Let 1, 2, 3, 4, 5, 6, 7, 8 be the eight age groups, 1, 0 be voted / didn’t voted, \(P_{i,j}\)=P(group \(i\) voted \(j\)). Then

\(\small H_0\) \(P_{1, 1}=P_{2, 1}=..=P_{8,1}\), \(P_{1,0}=P_{2.0}=...=P_{8,0}\)

\(\small H_a\) at least one of those equalities doesn’t hold

We call this testing for homogeneity. As we saw in class, the test statistic turns out to be the same as before, but there is an important difference in what you are testing.

17.3 Testing Goodness-of-fit, all parameters known

We use this method if we have an idea what distribution a data follows and want to test our assumption. As an example, let’s look at the birthdays of presidents. (I got the data by using a list of president’s birthdays and extracting the weekday).

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
days <- read.csv("pres_age.csv")
as.date <- mdy(days$bday)
day_born <- weekdays(as.date)
table(day_born)
#> day_born
#>    Friday    Monday  Saturday    Sunday  Thursday   Tuesday 
#>         6         9         8         5         8         7 
#> Wednesday 
#>         3

We will test

\(\small H_0\) presidents’ birthdays are uniformly distributed

\(\small H_a\) presidents’ birthdays are not uniformly distributed

The observed values are 9,7,3,8,6,8,5

The expected values are 46/7, 46/7,46/7,46/7,46/7,46/7,46/7

Here we have n-1 = 6 degrees of freedom, because we have n=7 different groups. We compute \(C\) and the p-value:

Obs <- c(9,7,3,8,6,8,5)
Exp <- c(46/7, 46/7,46/7,46/7,46/7,46/7,46/7)
C <- sum((Obs-Exp)^2/Exp)
p_value <- 1-pchisq(C, df=6)
p_value
#> [1] 0.6884428

We have a large p-value, so the birthdays may or may not be uniformly distributed.

17.4 Testing Goodness-of-fit, not all parameters known

We use this method if we have an idea how the data are distributed, but we don’t quiet know all the parameters. For example, we want to test if the data below are form a normal distribution. Because the population mean and population standard deviation are unknown, we substitute the sample mean and sample standard deviation.

x <- c( 2.93,  1.30,  2.60, -0.97,  3.47, -3.39,  3.34,  3.27, -2.04,  3.20,  1.46,  2.88,  4.38, -1.20,  1.10, -2.62,  2.12, -3.90, -0.26,  2.35, -0.48,  2.37, -2.79,  2.98,  5.23,  3.03, -1.25,  3.88, -0.21,  1.15,  3.84,  5.68,  7.83,  3.11,  4.25,  3.56, 1.70,  0.97,  6.19, -0.60)
smean <- mean(x)
smean
#> [1] 1.7615
ssdev <- sd(x)
ssdev
#> [1] 2.709547
x <- sort(x)
x
#>  [1] -3.90 -3.39 -2.79 -2.62 -2.04 -1.25 -1.20 -0.97 -0.60
#> [10] -0.48 -0.26 -0.21  0.97  1.10  1.15  1.30  1.46  1.70
#> [19]  2.12  2.35  2.37  2.60  2.88  2.93  2.98  3.03  3.11
#> [28]  3.20  3.27  3.34  3.47  3.56  3.84  3.88  4.25  4.38
#> [37]  5.23  5.68  6.19  7.83

We now have to group our 40 observations, but the groups can’t be too large (we would loose information) or too small ( we need to expect 5 entries per group). Let’s use 5 groups. The cut-offs would be for \(\small P_{20}, P_{40}, P_{60}, P_{80}\)

cut_offs_expected <- qnorm(p=c(0.2, 0.4, 0.6, 0.8), mean = smean, sd=ssdev)
cut_offs_expected
#> [1] -0.5189121  1.0750442  2.4479558  4.0419121

This is such a small example, we can just count the groups by hand

Obs <- c(9,4,8,13,6)
Exp <- c(8,8,8,8,8)

Or you could use the code chunk below

count <- 1:5
count[1] <- sum(x<=cut_offs_expected[1])
for (i in 2:4){
  count[i] <- sum(x<=cut_offs_expected[i])-sum(x<=cut_offs_expected[i-1])
}
count[5] <- sum(x>cut_offs_expected[4])
Obs <- count

Next we find the test statistic \(C\) and the p-value

C <- sum((Obs-Exp)^2/Exp)
C
#> [1] 5.75

The degrees of freedom used are n-i-1, where n is the number of cells or groups (5 in this example) and i is the number of unknown parameters (2 in this case).

p_value <- 1-pchisq(C, df=3)
p_value
#> [1] 0.1244274

17.5 Assignment

  1. Explain what the difference between the two tests (independence,homogeneity) introduced above is, and why that difference matters. Write up the derivation that the two test statistics are the same.
  2. Use the presidents’ birthday data and extract the birth month. Can we run a test for uniform distribution on this data?
  3. Group the presidents’ birthday data by season and test the claim that an equal number of presidents was born in each season.