# 17 Contingency Tables

A contingency table, sometimes called a two-way frequency table, presents a summary (count) of categorical data. It consists of at least two rows and at least two columns. An example would be number of votes for each political party in the fifty states. Contingency tables are used to study the association between the two variables, and to check if data follow an assumed pattern.. Feedback

## 17.1 Testing independence between two factors

Let’s assume you wan to find out if voting behavior differs by age group. You collected data from 52,615 registered voters that records their ages and if they voted (1) or not (0). Normally you would read in a file with that information, here I just generate a sample data set:

```
age <- 1:52615
voted <- 1:52615
age[1:5412] <- "18-19"
age[5413:11455] <- "20-24"
age[11456:18787] <- "25-29"
age[18788:27252] <- "30-39"
age[27253:33852] <- "40-49"
age[33853:39908] <- "50-59"
age[39909:46397] <- "60-69"
age[46398:52615] <- "70+"
voted[1:52615] <- 0
voted[1:3084] <- 1
voted[5413:8977] <- 1
voted[11456:16148] <- 1
voted[18788:23951] <- 1
voted[27253:31608] <- 1
voted[33853:38151] <- 1
voted[39909:44905] <- 1
voted[46398:51620] <- 1
voter_data <- data.frame(age, voted)
```

In this format, the data is hard to read. I am using the `table`

command to arrange the data in a more readable format, called a contingency table (also known as a cross tabulation or crosstab):

```
cont_table <- table(voter_data$age,voter_data$voted)
cont_table
#>
#> 0 1
#> 18-19 2328 3084
#> 20-24 2478 3565
#> 25-29 2639 4693
#> 30-39 3301 5164
#> 40-49 2244 4356
#> 50-59 1757 4299
#> 60-69 1492 4997
#> 70+ 995 5223
```

Next, I want to add some totals to the table:

```
total_age_group <- rowSums(cont_table)
cont_table1 <- cbind(cont_table, total_age_group)
cont_table1
#> 0 1 total_age_group
#> 18-19 2328 3084 5412
#> 20-24 2478 3565 6043
#> 25-29 2639 4693 7332
#> 30-39 3301 5164 8465
#> 40-49 2244 4356 6600
#> 50-59 1757 4299 6056
#> 60-69 1492 4997 6489
#> 70+ 995 5223 6218
```

and also the percentage of voters and non-voters per age group:

```
percent_not_voting <- cont_table1[,1]/cont_table1[,3]
percent_voting <- cont_table1[,2]/cont_table1[,3]
cont_table2 <- cbind(cont_table1, percent_not_voting,percent_voting)
cont_table2
#> 0 1 total_age_group percent_not_voting
#> 18-19 2328 3084 5412 0.4301552
#> 20-24 2478 3565 6043 0.4100612
#> 25-29 2639 4693 7332 0.3599291
#> 30-39 3301 5164 8465 0.3899587
#> 40-49 2244 4356 6600 0.3400000
#> 50-59 1757 4299 6056 0.2901255
#> 60-69 1492 4997 6489 0.2299276
#> 70+ 995 5223 6218 0.1600193
#> percent_voting
#> 18-19 0.5698448
#> 20-24 0.5899388
#> 25-29 0.6400709
#> 30-39 0.6100413
#> 40-49 0.6600000
#> 50-59 0.7098745
#> 60-69 0.7700724
#> 70+ 0.8399807
```

The question I want to answer is “Is voting behavior (i.e voted or didn’t) independent or dependent of age group?” Formally, we test

\(\small H_0\): voting behavior and age groups are independent

\(\small H_a\): voting behavior and age groups are not independent

We will start by finding out how many people voted / didn’t vote overall, and what percent voted / didn’t vote overall.

```
voters <- sum(cont_table2[,2])
non_voters <- sum(cont_table2[,1])
total <- voters + non_voters
perc_voters <- voters / total
perc_non_voters <- non_voters / total
```

```
#> [1] "Out of 52615 people 17234 or 32.75 percent didn't vote"
#> [1] "35381 people or 67.25 percent voted"
```

We can now compute the expected number of voters / non voters for each age group *if they voted the same as all voters overall*.

```
expected_voters <- perc_voters * cont_table2[,3]
expected_non_voters <- perc_non_voters * cont_table2[,3]
cont_table3 <- cbind(cont_table2, expected_non_voters,expected_voters)
cont_table3
#> 0 1 total_age_group percent_not_voting
#> 18-19 2328 3084 5412 0.4301552
#> 20-24 2478 3565 6043 0.4100612
#> 25-29 2639 4693 7332 0.3599291
#> 30-39 3301 5164 8465 0.3899587
#> 40-49 2244 4356 6600 0.3400000
#> 50-59 1757 4299 6056 0.2901255
#> 60-69 1492 4997 6489 0.2299276
#> 70+ 995 5223 6218 0.1600193
#> percent_voting expected_non_voters expected_voters
#> 18-19 0.5698448 1772.696 3639.304
#> 20-24 0.5899388 1979.380 4063.620
#> 25-29 0.6400709 2401.591 4930.409
#> 30-39 0.6100413 2772.704 5692.296
#> 40-49 0.6600000 2161.825 4438.175
#> 50-59 0.7098745 1983.638 4072.362
#> 60-69 0.7700724 2125.467 4363.533
#> 70+ 0.8399807 2036.701 4181.299
```

Our test statistic \(C^2\) is based on the difference between expected and observed frequency of voting:

\[ \small C^2=\sum_{\text{all age groups, all vote types}} \frac{(\text{observed value - expected value})^2}{\text{expected value}} = \sum_{\text{all cells}}\frac{(O-E)^2}{E}\]
*If* \(\small H_o\) is true, then the test statistic \(\small C^2\) has an approximate Chi-square distribution with (n-1)(m-1) degrees of freedom. There are several (similar) rules of thumb for using this test:

- some sources say all expected cell values have to be larger than 5
- others say no more than 20% of cells can have an expected count less than 5

Here we compute the test statistic:

```
ratio <- (cont_table3[,1]-cont_table3[,6])^2/cont_table3[,6]+(cont_table3[,2]-cont_table3[,7])^2/cont_table3[,7]
Csq <- sum(ratio)
Csq
#> [1] 1746.287
```

We have n = 8 age groups and m = 2 categories (voted / didn’t vote), so we need Chi distribution with (8-1)(2-1)=7 degrees of freedom.

```
p_value <- 1-pchisq(Csq, df=7)
p_value
#> [1] 0
```

The following is a visualization. The red vertical line represents the test statistic, the area to the right of the red line the p-value.

```
x <- c(seq(0, 20, by = 0.1), seq(2,1750, by=2))
y <- dchisq(x, df = 7)
data <- data.frame(x,y)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(data, aes(x=x, y=y))+geom_line()+ggtitle("Chi-square distribution with 7 degrees of freedom")+geom_vline(xintercept=Csq, color="red")
```

```
vote <- as.factor(voted)
chisq.test(age, vote)
#>
#> Pearson's Chi-squared test
#>
#> data: age and vote
#> X-squared = 1746.3, df = 7, p-value < 2.2e-16
```

## 17.2 Testing Homogeneity

In the previous section we drew a sample from a single population (eligible voters) and recorded two factors (age group, voting behavior). We then tested for independence between the two factors. In this case the number of people in each age group was not fixed from the onset.

We could also interpret the situation as drawing from 8 different populations, taking each age group as a separate population. We draw a set number (5412, 6043, etc) from each age group. In this case, we would ask if a single variable (the voting behavior) has the same distribution for each population. In this case:

\(\small H_0\): voting behavior has the same distribution across age groups

\(\small H_a\): voting behavior has different distributions across age groups

or more formal: Let 1, 2, 3, 4, 5, 6, 7, 8 be the eight age groups, 1, 0 be voted / didn’t voted, \(P_{i,j}\)=P(group \(i\) voted \(j\)). Then

\(\small H_0\) \(P_{1, 1}=P_{2, 1}=..=P_{8,1}\), \(P_{1,0}=P_{2.0}=...=P_{8,0}\)

\(\small H_a\) at least one of those equalities doesn’t hold

We call this *testing for homogeneity*. As we saw in class, the test statistic turns out to be the same as before, but there is an important difference in what you are testing.

## 17.3 Testing Goodness-of-fit, all parameters known

We use this method if we have an idea what distribution a data follows and want to test our assumption. As an example, let’s look at the birthdays of presidents. (I got the data by using a list of president’s birthdays and extracting the weekday).

```
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
days <- read.csv("pres_age.csv")
as.date <- mdy(days$bday)
day_born <- weekdays(as.date)
table(day_born)
#> day_born
#> Friday Monday Saturday Sunday Thursday Tuesday
#> 6 9 8 5 8 7
#> Wednesday
#> 3
```

We will test

\(\small H_0\) presidents’ birthdays are uniformly distributed

\(\small H_a\) presidents’ birthdays are not uniformly distributed

The observed values are 9,7,3,8,6,8,5

The expected values are 46/7, 46/7,46/7,46/7,46/7,46/7,46/7

Here we have n-1 = 6 degrees of freedom, because we have n=7 different groups. We compute \(C\) and the p-value:

```
Obs <- c(9,7,3,8,6,8,5)
Exp <- c(46/7, 46/7,46/7,46/7,46/7,46/7,46/7)
C <- sum((Obs-Exp)^2/Exp)
p_value <- 1-pchisq(C, df=6)
p_value
#> [1] 0.6884428
```

We have a large p-value, so the birthdays may or may not be uniformly distributed.

## 17.4 Testing Goodness-of-fit, not all parameters known

We use this method if we have an idea how the data are distributed, but we don’t quiet know all the parameters. For example, we want to test if the data below are form a normal distribution. Because the population mean and population standard deviation are unknown, we substitute the sample mean and sample standard deviation.

```
x <- c( 2.93, 1.30, 2.60, -0.97, 3.47, -3.39, 3.34, 3.27, -2.04, 3.20, 1.46, 2.88, 4.38, -1.20, 1.10, -2.62, 2.12, -3.90, -0.26, 2.35, -0.48, 2.37, -2.79, 2.98, 5.23, 3.03, -1.25, 3.88, -0.21, 1.15, 3.84, 5.68, 7.83, 3.11, 4.25, 3.56, 1.70, 0.97, 6.19, -0.60)
smean <- mean(x)
smean
#> [1] 1.7615
ssdev <- sd(x)
ssdev
#> [1] 2.709547
x <- sort(x)
x
#> [1] -3.90 -3.39 -2.79 -2.62 -2.04 -1.25 -1.20 -0.97 -0.60
#> [10] -0.48 -0.26 -0.21 0.97 1.10 1.15 1.30 1.46 1.70
#> [19] 2.12 2.35 2.37 2.60 2.88 2.93 2.98 3.03 3.11
#> [28] 3.20 3.27 3.34 3.47 3.56 3.84 3.88 4.25 4.38
#> [37] 5.23 5.68 6.19 7.83
```

We now have to group our 40 observations, but the groups can’t be too large (we would loose information) or too small ( we need to expect 5 entries per group). Let’s use 5 groups. The cut-offs would be for \(\small P_{20}, P_{40}, P_{60}, P_{80}\)

```
cut_offs_expected <- qnorm(p=c(0.2, 0.4, 0.6, 0.8), mean = smean, sd=ssdev)
cut_offs_expected
#> [1] -0.5189121 1.0750442 2.4479558 4.0419121
```

This is such a small example, we can just count the groups by hand

Or you could use the code chunk below

```
count <- 1:5
count[1] <- sum(x<=cut_offs_expected[1])
for (i in 2:4){
count[i] <- sum(x<=cut_offs_expected[i])-sum(x<=cut_offs_expected[i-1])
}
count[5] <- sum(x>cut_offs_expected[4])
Obs <- count
```

Next we find the test statistic \(C\) and the p-value

```
C <- sum((Obs-Exp)^2/Exp)
C
#> [1] 5.75
```

The degrees of freedom used are n-i-1, where n is the number of cells or groups (5 in this example) and i is the number of unknown parameters (2 in this case).

```
p_value <- 1-pchisq(C, df=3)
p_value
#> [1] 0.1244274
```

## 17.5 Assignment

- Explain what the difference between the two tests (independence,homogeneity) introduced above is, and why that difference matters. Write up the derivation that the two test statistics are the same.
- Use the presidents’ birthday data and extract the birth month. Can we run a test for uniform distribution on this data?
- Group the presidents’ birthday data by season and test the claim that an equal number of presidents was born in each season.