Chapter 11 Association in categorical data
11.1 Contingency table
A contingency table (also called a cross-tabulation or cross table) is a type of table used in statistics to summarize the relationship between two categorical variables. It displays the frequency (counts) or proportions of observations that fall into each combination of categories.
- Rows represent categories of one variable.
- Columns represent categories of another variable.
- Cells contain counts or proportions of observations.
Contingency tables help to:
- identify patterns or associations between categorical variables,
- compare distributions across groups,
- provide input for association measures, (such as Cramér’s V) or statistical tests (like a chi-squared test)
Example
In October 2025, 94 students were asked about their gender and eye colour. The answers are summarised in the following contingency table:
| Female | Male | Total | |
|---|---|---|---|
| Blue | 16 | 15 | 31 |
| Brown | 16 | 17 | 33 |
| Green | 5 | 14 | 19 |
| Other | 5 | 6 | 11 |
| Total | 42 | 52 | 94 |
11.2 Cramér’s V
Cramér’s V is a measure of association between two categorical variables, based on a contingency table.
Consider a contingency table with:
- \(r\) rows,
- \(c\) columns,
- observed frequencies \(O_{ij}\) in cells \((i,j)\),
- total sample size (n).
The chi-squared (\(\chi^2\)) statistic is defined as:
\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \tag{11.1} \]
where the expected frequencies \(E_{ij}\) are given by:
\[E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{n} \tag{11.2} \].
Notice that the more similar the observed frequencies are to the expected frequencies, the smaller the chi-squared statistic is. On the other hand, the greater the discrepancies between observed and expected frequencies, the larger the chi-squared statistic becomes, providing stronger evidence that there is a statistical dependence in the data-generating process.
Cramer’s V is defined as a function of the chi-squared statistic:
\[V = \sqrt{ \frac{\chi^2}{n \cdot \min(r - 1, c - 1)} } \tag{11.3} \]
where:
- \(\chi^2\) is the chi-squared statistic defined above,
- \(n\) is the total number of observations,
- \(r\) is the number of rows,
- \(c\) is the number of columns.
By construction, Cramer’s V satisfies:
\(0 \le V \le 1\).
- \(V = 0\) indicates no association between the variables
- \(V = 1\) indicates a perfect association
Common informal benchmarks used in practice are:
| Cramer’s V | Strength of association |
|---|---|
| 0.00–0.10 | Negligible |
| 0.10–0.30 | Weak |
| 0.30–0.50 | Moderate |
| > 0.50 | Strong |
Interpretation should always be made in the context of the domain.
11.3 Chi-squared independence test
The chi-squared test of independence is a statistical procedure used to assess whether two categorical variables are statistically independent in the data-generating process.
It uses the chi-squared statistic introduced in equation (11.1). A small chi-squared value indicates that observed frequencies are close to expected frequencies, providing little evidence against independence. A large chi-squared value indicates substantial deviations from independence, suggesting an association between the variables.
Statistical packages performing the chi-squared test of independence compute the p-value. In this test, the p-value indicates how likely it is to observe a chi-squared statistic at least as large as the one obtained, assuming that the null hypothesis – that the variables are independent in the data-generating process – is true.
The chi-squared test performs best when all (or most) expected frequencies exceed 5.
Importantly, the chi-squared test indicates whether an association exists, but not how strong it is. Measures such as Cramér’s V are therefore used to quantify the strength of association.
Example
##
## Pearson's Chi-squared test
##
## data: table(students$eye_colour, students$gender)
## X-squared = 3.3912, df = 3, p-value = 0.3352
The chi-squared (\(\chi^2\)) test statistic is 3.39 and the corresponding p-value is 0.335.
This means that such a discrepancy between the observed and expected frequencies would occur quite often – about one third of the time – if sex and eye colour were actually independent. More precisely, if sex and eye colour were generated independently, we would obtain a \(\chi^2\) value equal to or greater than 3.39 in 33.5% of similar samples purely due to random variation.
Therefore, the data do not provide evidence of an association between the two variables.
11.4 Correlation ratio and eta-squared
The correlation ratio (denoted by \(\eta\), eta) and eta squared (\(\eta^2\)) are measures of association used primarily when:
one of the variables (often referred to in this context as an independent variable, just like in regression modelling) is qualitative (categorical) with \(k\) categories (groups)
and another (the dependent variable) is quantitative.
They quantify how much of the variability in a numerical variable can be explained by group membership.
Let \(y_{ij}\) denote observation \(j\) in group \(i\), \(\bar{y}_i\) denote the mean of group \(i\), \(\bar{y}\) denote the overall mean.
The total sum of squares (SST) can be decomposed as:
\[\text{SST} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{11.4}\]
The between-group sum of squares (SSB) is:
\[\text{SSB} = \sum_{i=1}^{k} n_i (\bar{y}_i - \bar{y})^2 \tag{11.5} \]
Eta squared \((\eta^2)\) is defined as the proportion of total variation explained by the grouping variable:
\[\eta^2 = \frac{\text{SSB}}{\text{SST}} \tag{11.6} \]
Properties:
- \(0 \le \eta^2 \le 1\)
- \(\eta^2 = 0\) indicates no group effect (the categorical and quantitative variables are not associated)
- \(\eta^2 = 1\) indicates perfect separation of groups (perfect association)
Sometimes we take square root: the correlation ratio \(\eta\) is defined as the square root of eta squared:
\[\eta = \sqrt{\eta^2} = \sqrt{ \frac{\text{SSB}}{\text{SST}}} \tag{11.7} \]
The correlation ratio can be interpreted as a generalization of the Pearson correlation coefficient to situations involving situations where one of the variables is categorical and another is quantitative.
11.5 Links
Contingency tables and chi-squared test: https://istats.shinyapps.io/ChiSquaredTest/
11.6 Exercises
Exercise 11.1 In October 2025, 94 students were asked about their gender and eye colour. The responses are summarised in the following contingency table:
| Female | Male | Total | |
|---|---|---|---|
| Blue | 16 | 15 | 31 |
| Brown | 16 | 17 | 33 |
| Green | 5 | 14 | 19 |
| Other | 5 | 6 | 11 |
| Total | 42 | 52 | 94 |
Provide the expected frequencies under the assumption of independence based on the table above.
| Female | Male | Total | |
|---|---|---|---|
| Blue | 31 | ||
| Brown | 33 | ||
| Green | 19 | ||
| Other | 11 | ||
| Total | 42 | 52 | 94 |
Compute \(\chi^2\):
Compute Cramer’s V:
Exercise 11.2 94 students were asked whether they agree or disagree with the following statement: “Statistics is difficult”. Their answers are presented in a contingency table below. Compute an appropriate association measure. What are your conclusions?
| Female | Male | Total | |
|---|---|---|---|
| Agree | 20 | 24 | 44 |
| Neither agree nor disagree | 17 | 18 | 35 |
| Disagree | 5 | 10 | 15 |
| Total. | 42 | 52 | 94 |
Exercise 11.3 At three lectures in three different groups, the lecturer brought a large jar filled with coins and asked the students to guess the total amount of money inside.
Using the linked data, show whether there is an association between the group and the estimated amount.