Chapter 11 Association in categorical data

11.1 Contingency table

A contingency table (also called a cross-tabulation or cross table) is a type of table used in statistics to summarize the relationship between two categorical variables. It displays the frequency (counts) or proportions of observations that fall into each combination of categories.

  • Rows represent categories of one variable.
  • Columns represent categories of another variable.
  • Cells contain counts or proportions of observations.

Contingency tables help to:

  • identify patterns or associations between categorical variables,
  • compare distributions across groups,
  • provide input for association measures, (such as Cramér’s V) or statistical tests (like a chi-squared test)

Example

In October 2025, 94 students were asked about their gender and eye colour. The answers are summarised in the following contingency table:

Female Male Total
Blue 16 15 31
Brown 16 17 33
Green 5 14 19
Other 5 6 11
Total 42 52 94

11.2 Cramér’s V

Cramér’s V is a measure of association between two categorical variables, based on a contingency table.

Consider a contingency table with:

  • \(r\) rows,
  • \(c\) columns,
  • observed frequencies \(O_{ij}\) in cells \((i,j)\),
  • total sample size (n).

The chi-squared (\(\chi^2\)) statistic is defined as:

\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \tag{11.1} \]

where the expected frequencies \(E_{ij}\) are given by:

\[E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{n} \tag{11.2} \].

Notice that the more similar the observed frequencies are to the expected frequencies, the smaller the chi-squared statistic is. On the other hand, the greater the discrepancies between observed and expected frequencies, the larger the chi-squared statistic becomes, providing stronger evidence that there is a statistical dependence in the data-generating process.

Cramer’s V is defined as a function of the chi-squared statistic:

\[V = \sqrt{ \frac{\chi^2}{n \cdot \min(r - 1, c - 1)} } \tag{11.3} \]

where:

  • \(\chi^2\) is the chi-squared statistic defined above,
  • \(n\) is the total number of observations,
  • \(r\) is the number of rows,
  • \(c\) is the number of columns.

By construction, Cramer’s V satisfies:

\(0 \le V \le 1\).

  • \(V = 0\) indicates no association between the variables
  • \(V = 1\) indicates a perfect association

Common informal benchmarks used in practice are:

Cramer’s V Strength of association
0.00–0.10 Negligible
0.10–0.30 Weak
0.30–0.50 Moderate
> 0.50 Strong

Interpretation should always be made in the context of the domain.

11.3 Chi-squared independence test

The chi-squared test of independence is a statistical procedure used to assess whether two categorical variables are statistically independent in the data-generating process.

It uses the chi-squared statistic introduced in equation (11.1). A small chi-squared value indicates that observed frequencies are close to expected frequencies, providing little evidence against independence. A large chi-squared value indicates substantial deviations from independence, suggesting an association between the variables.

Statistical packages performing the chi-squared test of independence compute the p-value. In this test, the p-value indicates how likely it is to observe a chi-squared statistic at least as large as the one obtained, assuming that the null hypothesis – that the variables are independent in the data-generating process – is true.

The chi-squared test performs best when all (or most) expected frequencies exceed 5.

Importantly, the chi-squared test indicates whether an association exists, but not how strong it is. Measures such as Cramér’s V are therefore used to quantify the strength of association.

Example

chisq.test(table(students$eye_colour, students$gender))
## 
##  Pearson's Chi-squared test
## 
## data:  table(students$eye_colour, students$gender)
## X-squared = 3.3912, df = 3, p-value = 0.3352

The chi-squared (\(\chi^2\)) test statistic is 3.39 and the corresponding p-value is 0.335.

This means that such a discrepancy between the observed and expected frequencies would occur quite often – about one third of the time – if sex and eye colour were actually independent. More precisely, if sex and eye colour were generated independently, we would obtain a \(\chi^2\) value equal to or greater than 3.39 in 33.5% of similar samples purely due to random variation.

Therefore, the data do not provide evidence of an association between the two variables.

11.4 Correlation ratio and eta-squared

The correlation ratio (denoted by \(\eta\), eta) and eta squared (\(\eta^2\)) are measures of association used primarily when:

  • one of the variables (often referred to in this context as an independent variable, just like in regression modelling) is qualitative (categorical) with \(k\) categories (groups)

  • and another (the dependent variable) is quantitative.

They quantify how much of the variability in a numerical variable can be explained by group membership.

Let \(y_{ij}\) denote observation \(j\) in group \(i\), \(\bar{y}_i\) denote the mean of group \(i\), \(\bar{y}\) denote the overall mean.

The total sum of squares (SST) can be decomposed as:

\[\text{SST} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{11.4}\]

The between-group sum of squares (SSB) is:

\[\text{SSB} = \sum_{i=1}^{k} n_i (\bar{y}_i - \bar{y})^2 \tag{11.5} \]

Eta squared \((\eta^2)\) is defined as the proportion of total variation explained by the grouping variable:

\[\eta^2 = \frac{\text{SSB}}{\text{SST}} \tag{11.6} \]

Properties:

  • \(0 \le \eta^2 \le 1\)
  • \(\eta^2 = 0\) indicates no group effect (the categorical and quantitative variables are not associated)
  • \(\eta^2 = 1\) indicates perfect separation of groups (perfect association)

Sometimes we take square root: the correlation ratio \(\eta\) is defined as the square root of eta squared:

\[\eta = \sqrt{\eta^2} = \sqrt{ \frac{\text{SSB}}{\text{SST}}} \tag{11.7} \]

The correlation ratio can be interpreted as a generalization of the Pearson correlation coefficient to situations involving situations where one of the variables is categorical and another is quantitative.

11.6 Exercises

Exercise 11.1 In October 2025, 94 students were asked about their gender and eye colour. The responses are summarised in the following contingency table:

Female Male Total
Blue 16 15 31
Brown 16 17 33
Green 5 14 19
Other 5 6 11
Total 42 52 94

Provide the expected frequencies under the assumption of independence based on the table above.

Female Male Total
Blue 31
Brown 33
Green 19
Other 11
Total 42 52 94

Compute \(\chi^2\):

Compute Cramer’s V:

Exercise 11.2 94 students were asked whether they agree or disagree with the following statement: “Statistics is difficult”. Their answers are presented in a contingency table below. Compute an appropriate association measure. What are your conclusions?

Female Male Total
Agree 20 24 44
Neither agree nor disagree 17 18 35
Disagree 5 10 15
Total. 42 52 94

Exercise 11.3 At three lectures in three different groups, the lecturer brought a large jar filled with coins and asked the students to guess the total amount of money inside.

Using the linked data, show whether there is an association between the group and the estimated amount.