Chapter 12 Association in binary data

12.1 2x2 tables

The smallest size of a contingency table is 2×2. A 2×2 table summarizes the joint distribution of two binary variables and forms the basis for many measures of association and classification performance.

Example

Three hundred and eighty-one students of both genders were asked following questions: “Did you watch videos for more than half an hour yesterday?” and “Do you have siblings?

The results are presented in the following table:

Watched videos Didn’t watch videos
Has siblings 270 54
Doesn’t have siblings 49 8

12.1.1 Phi coefficient

The phi coefficient measures the strength and direction of association between two binary variables.

We consider the following contingency table:

Y = 1 Y = 0 Total
X = 1 a b a + b
X = 0 c d c + d
Total a + c b + d n

where \(n = a + b + c + d\).

The \(\phi\) coefficient for a \(2 \times 2\) contingency table is defined as:

\[\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \tag{12.1}\]

The phi coefficient is confined to the range \([-1, 1]\).

Technically, it is equivalent to the Pearson correlation between two binary variables coded as 0/1 (see exercise 12.1).

The phi coefficient is also closely related to the chi-squared (\(\chi^2\)) statistic and Cramér’s V:

\[ \phi^2 = \chi^2/n \]

\[ |\phi| = V \]

Here, \(n\) denotes the total number of observations (sample size), and \(V\) is Cramér’s V for a 2×2 table (11.3).

Example

Three hundred and eighty-two students of both genders were asked whether they drank more than 2 liters of carbonated drinks (cola, sprite) last week. The results are following:

Female Male
No 151 106
Yes 39 86

Phi coefficient (correlation between being a male and soda drinking) is:

\[\phi = \frac{151\cdot86-39\cdot106}{\sqrt{(151+39)(106+86)(151+106)(39+86)}} \approx 0.259\]

12.1.2 Odds ratio

Odds ratio (OR) is a simple way to compare the likelihood (measured via odds) of an event between two groups.

Before defining the odds ratio, one hat to define odds. Odds compare how many times the event occurred to how many times it did not occur within the same group. Consider a binary outcome (Event / No event) and one group A. There were a observations of the Event and b observations of No event in the group.

Event No event
Group A a b

The odds of the event are then defined as:

\[\text{odds} = \frac{\text{number of events}}{\text{number of non-events}} = \frac{a}{b}\].

  • Odds = 1 → the event and non-event occur equally often

  • Odds > 1 → the event occurs more often than the non-event

  • Odds < 1 → the event occurs less often than the non-event

The odds ratio compares odds in two groups, say A and B:

\[ \text{OR} = \frac{\text{odds in group A}}{\text{odds in group B}}\].

  • OR = 1 → no difference between groups

  • OR > 1 → the event is more likely (in terms of odds) in group A

  • OR < 1 → the event is less likely in group A

OR is very popular in medical sciences (e.g. case–control studies) and in statistical modeling, especially logistic regression.

Example: Titanic survival by gender

  • Women: ~73.4% survived

  • Men: ~20.5% survived

2x2 table:

Female Male
Survived 359 352
Didn’t survive 130 1366

Odds of survival:

  • Women: \(359 / 130 \approx 2.76\)

  • Men: \(3677 / 0.81 \approx 0.26\)

Odds ratio (women vs men):

\[\text{OR} \approx \frac{2.76}{0.26} \approx 11\]

The odds of surviving the Titanic disaster were about 11 times higher for women than for men.

12.1.3 Confusion matrix

12.2 Binary and quantitative variables

12.2.1 Point-biserial correlation

12.2.2 Cohen’s d

12.3 Binary and ordinal variables

12.3.1 AUC

12.3.2 Somers’ D

12.5 Exercises

Exercise 12.1 Students of both sexes were asked the question: “Did you watch videos for more than half an hour yesterday?

Using the collected responses, compute the φ (phi) coefficient measuring the association between sex and video-watching behavior.

Check whether both methods

  • the one based on the contingency table (formula (12.1)), and
  • the one based on variables transformed into 0/1 form (i.e., recode the variables as 0/1 and compute Pearson’s correlation)

return the same result.