4.3 Categorical Data Analysis

Categorical Data Analysis when we have categorical outcomes

  • Nominal variables: no logical ordering (e.g., sex)
  • Ordinal variables: logical order, but relative distances between values are not clear (e.g., small, medium, large)

The distribution of one variable changes when the level (or values) of the other variable change. The row percentages are different in each column.

4.3.1 Inferences for Small Samples

The approximate tests based on the asymptotic normality of \(\hat{p}_1 - \hat{p}_2\) do not apply for small samples.

Using Fisher’s Exact Test to evaluate \(H_0: p_1 = p_2\)

  • Assume \(X_1\) and \(X_2\) are independent Binomial
  • Let \(x_1\) and \(x_2\) be the corresponding observed values.
  • Let \(n= n_1 + n_2\) be the total sample size
  • \(m = x_1 + x_2\) be the observed number of successes.
  • By assuming that m (total successes) is fixed, and conditioning on this value, one can show that the conditional distribution of the number of successes from sample 1 is Hypergeometric
  • If we want to test \(H_0: p_1 = p_2\) and \(H_a: p_1 \neq p_2\), we have

\[ Z^2 = (\frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}})^2 \sim \chi_{1,\alpha}^2 \]

where \(\chi_{1,\alpha}^2\) is the upper \(\alpha\) percentage point for the central Chi-squared with one d.f.

This extends to the contingency table setting: whether the observed frequencies are equal to those expected under a null hypothesis of no association.

4.3.2 Test of Association

Pearson Chi-square test statistic is

\[ \chi^2 = \sum_{\text{all categories}} \frac{(observed - epxected)^2}{expected} \]

Comparison of proportions for several independent surveys or experiments

Experiment 1 Experiment 2 Experiment k
Number of successes \(x_1\) \(x_2\) \(x_k\)
Number of failures \(n_1 - x_1\) \(n_2 - x_2\) \(n_k - x_k\)
\(n_1\) \(n_2\) \(n_k\)

\(H_0: p_1 = p_2 = \dots = p_k\) vs. the alternative that the null is not true (at least one pair are not equal).

We estimate the common value of the probability of success on a single trial assuming \(H_0\) is true:

\[ \hat{p} = \frac{x_1 + x_2 + ... + x_k}{n_1 + n_2 + ...+ n_k} \]

we use table of expected counts when \(H_0\) is true:

success \(n_1 \hat{p}\) \(n_2 \hat{p}\) \(n_k \hat{p}\)
failure \(n_1(1-\hat{p})\) \(n_2(1-\hat{p})\) \(n_k (1-\hat{p})\)
\(n_1\) \(n_2\) \(n_k\)

\[ \chi^2 = \sum_{\text{all cells in table}} \frac{(observed - expected)^2}{expected} \]

with \(k-1\) degrees of freedom

4.3.2.1 Two-way Count Data

1 2 j c Row Total
1 \(n_{11}\) \(n_{12}\) \(n_{1j}\) \(n_{1c}\) \(n_{1.}\)
2 \(n_{21}\) \(n_{22}\) \(n_{2j}\) \(n_{2c}\) \(n_{2.}\)
. . . . . . . .
r \(n_{r1}\) \(n_{r2}\) \(n_{rj}\) \(n_{rc}\) \(n_{r.}\)
Column Total \(n_{.1}\) \(n_{.2}\) \(n_{.j}\) \(n_{.c}\) \(n_{}\)

Design 1
total sample size fixed \(n\) = constant (e.g., survey on job satisfaction and income); both row and column totals are random variables

Design 2
Fix the sample size in each group (in each row) (e.g., Drug treatments success or failure); fixed number of participants for each treatment; independent random samples from the two row populations.

These different sampling designs imply two different probability models.

4.3.2.2 Total Sample Size Fixed

Design 1

random sample of size n drawn from a single population, and sample units are cross-classified into \(r\) row categories and \(c\) column

This results in an \(r \times c\) table of observed counts

\(n_{ij} = 1,...,r;j=1,...,c\)

Let \(p_{ij}\) be the probability of classification into cell \((i,j)\) and \(\sum_{i=1}^r \sum_{j=1}^c p_{ij} = 1\). Let \(N_{ij}\) be the random variable corresponding to \(n_{ij}\)
The joint distribution of the \(N_{ij}\) is multinomial with unknown parameters \(p_{ij}\)

Denote the row variable by \(X\) and column variable by Y, then \(p_{ij} = P(X=i,Y = j)\) and \(p_{i.} = P(X = i)\) and \(p_{.j} = P(Y = j)\) are the marginal probabilities.


The null hypothesis that X and Y are statistically independent (i.e., no association) is just:

\[ H_0: p_{ij} = P(X =i,Y=j) = P(X =i) P(Y =j) = p_{i.}p_{.j} \\ H_a: p_{ij} \neq p_{i.}p_{.j} \] for all \(i,j\).

4.3.2.3 Row Total Fixed

Design 2

Random samples of sizes \(n_1,...,n_r\) are drawn independently from \(r \ge 2\) row populations. In this case, the 2-way table row totals are \(n_{i.} = n_i\) for \(i = 1,...,r\).

The counts from each row are modeled by independent multinomial distributions.

\(X\) is fixed, \(Y\) is observed.

Then, \(p_{ij}\) represent conditional probabilities \(p_{ij} = P(Y=j|X=i)\)

The null hypothesis is the probability of response j is the same, regardless of the row population (i.e., no association):

\[ \begin{cases} H_0: p_{ij} = P(Y = j | X = i) = p_j & \text{for all } i,j =1,2,...,c \\ \text{or } H_0: (p_{i1},p_{i2},...,p_{ic}) = (p_1,p_2,...,p_c) & \text{ for all } i\\ H_a: (p_{i1},p_{i2},...,p_{ic}) & \text{ are not the same for all } i \end{cases} \]

Although the hypotheses to be tested are different for two sampling designs, The \(\chi^2\) test is identical

We have estimated expected frequencies:

\[ \hat{e}_{ij} = \frac{n_{i.}n_{.j}}{n} \]

The Chi-square statistic is

\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(n_{ij}-\hat{e}_{ij})^2}{\hat{e}_{ij}} \sim \chi_{(r-1)(c-1)} \]

\(\alpha\)-level test rejects \(H_0\) if \(\chi^2 > \chi^2_{(r-1)(c-1),\alpha}\)

4.3.2.4 Pearson Chi-square Test

  • Determine whether an association exists
  • Sometimes, \(H_0\) represents the model whose validity is to be tested. Contrast this with the conventional formulation of \(H_0\) as the hypothesis that is to be disproved. The goal in this case is not to disprove the model, but to see whether data are consistent with the model and if deviation can be attributed to chance.
  • These tests do not measure the strength of an association.
  • These tests depend on and reflect the sample size - double the sample size by copying each observation, double the \(\chi^2\) statistic even thought the strength of the association does not change.
  • The Pearson Chi-square Test is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
  • When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
july.x=480 
july.n=1000 
sept.x=704 
sept.n=1600

\[ H_0: p_J = 0.5 \\ H_a: p_J < 0.5 \]

prop.test(
    x = july.x,
    n = july.n,
    p = 0.5,
    alternative = "less",
    correct = F
)
#> 
#>  1-sample proportions test without continuity correction
#> 
#> data:  july.x out of july.n, null probability 0.5
#> X-squared = 1.6, df = 1, p-value = 0.103
#> alternative hypothesis: true p is less than 0.5
#> 95 percent confidence interval:
#>  0.0000000 0.5060055
#> sample estimates:
#>    p 
#> 0.48

\[ H_0: p_J = p_S \\ H_a: p_j \neq p_S \]

prop.test(
    x = c(july.x, sept.x),
    n = c(july.n, sept.n),
    correct = F
)
#> 
#>  2-sample test for equality of proportions without continuity correction
#> 
#> data:  c(july.x, sept.x) out of c(july.n, sept.n)
#> X-squared = 3.9701, df = 1, p-value = 0.04632
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  0.0006247187 0.0793752813
#> sample estimates:
#> prop 1 prop 2 
#>   0.48   0.44

4.3.3 Ordinal Association

  • An ordinal association implies that as one variable increases, the other tends to increase or decrease (depending on the nature of the association).
  • For tests for variables with two or more levels, the levels must be in a logical ordering.

4.3.3.1 Mantel-Haenszel Chi-square Test

The Mantel-Haenszel Chi-square Test is more powerful for testing ordinal associations, but does not test for the strength of the association.

This test is presented in the case where one has a series of \(2 \times 2\) tables that examine the same effects under different conditions (If there are \(K\) such tables, we have \(2 \times 2 \times K\) table)

In stratum \(k\), given the marginal totals \((n_{.1k},n_{.2k},n_{1.k},n_{2.k})\), the sampling model for cell counts is the Hypergeometric (knowing \(n_{11k}\) determines \((n_{12k},n_{21k},n_{22k})\), given the marginal totals)

Assuming conditional independence, the Hypergeometric mean and variance of \(n_{11k}\) are

\[ m_{11k} = E(n_{11k}) = \frac{n_{1.k} n_{.1k}}{n_{..k}} \\ var(n_{11k}) = \frac{n_{1.k} n_{2.k} n_{.1k} n_{.2k}}{n_{..k}^2(n_{..k}-1)} \]

To test conditional independence, Mantel and Haenszel proposed

\[ M^2 = \frac{(|\sum_{k} n_{11k} - \sum_k m_{11k}| -.5)^2}{\sum_{k}var(n_{11k})} \sim \chi^2_{1} \] This method can be extended to general \(I \times J \times K\) tables.

\((2 \times 2 \times 3)\) table

Bron = array(
    c(20, 9, 382, 214, 10, 7, 172, 120, 12, 6, 327, 183),
    dim = c(2, 2, 3),
    dimnames = list(
        Particulate = c("High", "Low"),
        Bronchitis = c("Yes", "No"),
        Age = c("15-24", "25-39", "40+")
    )
)
margin.table(Bron, c(1, 2))
#>            Bronchitis
#> Particulate Yes  No
#>        High  42 881
#>        Low   22 517
# assess whether the relationship between 
# Bronchitis by Particulate Level varies by Age
library(samplesizeCMH)
marginal_table = margin.table(Bron, c(1, 2))
odds.ratio(marginal_table)
#> [1] 1.120318

#  whether these odds vary by age. 
# The conditional odds can be calculated using the original table.
apply(Bron, 3, odds.ratio)
#>     15-24     25-39       40+ 
#> 1.2449098 0.9966777 1.1192661

# Mantel-Haenszel Test
mantelhaen.test(Bron, correct = T)
#> 
#>  Mantel-Haenszel chi-squared test with continuity correction
#> 
#> data:  Bron
#> Mantel-Haenszel X-squared = 0.11442, df = 1, p-value = 0.7352
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6693022 1.9265813
#> sample estimates:
#> common odds ratio 
#>          1.135546
4.3.3.1.1 McNemar’s Test

special case of Mantel-Haenszel Chi-square Test

vote = cbind(c(682, 22), c(86, 810))
mcnemar.test(vote, correct = T)
#> 
#>  McNemar's Chi-squared test with continuity correction
#> 
#> data:  vote
#> McNemar's chi-squared = 36.75, df = 1, p-value = 1.343e-09

4.3.3.2 Spearman Rank Correlation

To test for the strength of association between two ordinally scaled variables, we can use Spearman Rank Correlation statistic

Let \(X\) and \(Y\) be two random variables measured on an ordinal scale. Consider \(n\) pairs of observations (\(x_i,y_i\)), \(i = 1,\dots,n\)

The Spearman Rank Correlation coefficient (denoted by \(r_S\) is calculated using the Pearson correlation formula, but based on the ranks of \(x_i\) and \(y_i\)).

Spearman Rank Correlation be calculated

  1. Assign ranks to \(x_i\)’s and \(y_i\)’s separately. Let \(u_i = rank(x_i)\) and \(v_i = rank(y_i)\)
  2. Calculate \(r_S\) using the formula for the Pearson correlation coefficient, but applied to the ranks:

\[ r_S = \frac{\sum_{i=1}^{n}(u_i - \bar{u})(v_i - \bar{v})}{\sqrt{(\sum_{i = 1}^{n}(u_i - \bar{u})^2)(\sum_{i=1}^{n}(v_i - \bar{v})^2)}} \]

\(r_S\) ranges between -1 and +1 , with

  • \(r_S = -1\) if there is a perfect negative monotone association
  • \(r_S = +1\) if there is a perfect positive monotone association between X and Y.

To test

  • \(H_0:\) \(X\) and \(Y\) independent

  • \(H_a\): \(X\) and \(Y\) positively associated

For large \(n\) (e.g., \(n \ge 10\)),

\[ r_S \sim N(0,1/(n-1)) \]

Then,

\[ Z = r_s \sqrt{n-1} \sim N(0,1) \]