Chapter 12 Relationships Between Categorical Variables

12.1 Probability vs Odds

So far, we have always framed the chance of an event happening in terms of probability. Suppose there are $r$ outcomes deemed a ‘success’ in $n$ trials.

For example, suppose I choose one card from the deck of $n=52$ cards and I am looking for one of $r=13$ hearts. The probability of event $A$ is $\text{Probability}=\frac{r}{n}=\frac{13}{52}=\frac{1}{4}$

The odds of the event happening is the ratio of the event happening to it NOT happening.

$\text{Odds}=\frac{r}{n-r}=\frac{13}{52-13}=\frac{13}{39}=\frac{1}{3}$

We say the probability of getting a heart is $\frac{1}{4}$ and the odds are 1-to-3 for (or 3-to-1 against).

12.2 A $2 \times 2$ Contingency Table

A contingency table (also known as a two-way table or a cross-tab) is a way to display information on two categorical variables simultaneously. Here, the rows of the table represent whether a patient has Alzheimer’s or not. The columns represent testing positive or negative for E4, an enzyme that might indicate a genetic risk for Alzheimer’s.

A random sample of 50 individuals at autopsy with Alzheimer’s showed that 18 had E4, while a similar random sample of 200 individuals without Alzheimer’s showed that 24 had E4.

The

	Positive $T^+$	Negative $T^-$
Alzheimers $D^+$	$a=18$	$b=32$	$a+b=50$
No Alzheimers $D^-$	$c=24$	$d=176$	$c+d=200$
	$a+c=42$	$b+d=208$	$n=a+b+c+d=250$

12.3 Relative Risk, $RR$

The relative risk gives the rate of disease given exposure/positive test in ratio to the rate of disease given no exposure/negative test.
Or, it could represent the rate of incidence between a treatment vs control (i.e. the physician’s aspirin study).
It is computed as: $RR=\frac{a(c+d)}{c(a+b)}=\frac{18(200)}{24(50)}=2.786 ???$
This ratio tells me that Alzheimer’s is between 2 and 3 times as likely in patients with E4 versus those without E4.

12.4 Odds Ratio, $OR$

The odds ratio gives the odds of the disease given positive test/exposure in ratio to the odds of the disease given negative test/no exposure. This is similar, but not identical, to the relative risk.
It is computed as $OR=\frac{a/b}{c/d}=\frac{ad}{bc}=\frac{n_{11} n_{22}}{n_{12} n_{21}}=\frac{18/32}{24/176}=\frac{18(176)}{32(24)}=4.125$
This ratio indicates that the odds of having Alzheimer’s is over 4 times higher in those with E4 than those without E4.

12.5 A Typical Medical Journal

The following are findings from an article published in the medical journal The Lancet on March 11, 2020, written by a team of Chinese doctors studying the COVID-19 virus in China.

191 patients (135 from Jinyintan Hospital and 56 from Wuhan Pulmonary Hospital) were included in this study, of whom 137 were discharged and 54 died in hospital. 91 (48%) patients had a comorbidity, with hypertension being the most common (58 [30%] patients), followed by diabetes (36 [19%] patients) and coronary heart disease (15 [8%] patients). Multivariable regression showed increasing odds of in-hospital death associated with older age (odds ratio 1.10, 95% CI 1.03–1.17, per year increase; p=0·0043), …

Notice that the $OR=1.10$ for age indicates a higher risk of death for older patients, with a confidence interval that lies completely above 1 and a $p$ -value that is less than $\alpha=0.05$ .
Later in the article are other such results. The sex of the patient was not significant in terms of the odds of death: $OR=0.61$ (with females having lower odds of death, but not to a statistically significant degree), 95% CI was (0·31–1·20), and the $p$ -value was 0.15.

12.6 Relative Risk vs Odds Ratio

Rate and odds aren’t equivalent. Relative risk is the ratio of the rate of incidence of disease when exposed/not exposed, while the odds ratio is the ratio of the odds of the incidence of disease when exposed/not exposed.
In the case of the COVID-19 virus, an epidemiologist might compare the mortality rate (where the event is the death of a COVID-19 patient) between those at risk (defined as over a certain age and/or immunocompromised) versus those that are not, using one of these statistics.
$OR$ and $RR$ seem quite similar, and mathematically $OR$ converges to $RR$ when the event is rare.
However, $RR$ is easier to interpret. From Crichton’s Information Point (JCN, 2001):

The odds ratio is used extensively in the healthcare literature. However, few people have a natural ability to interpret odds ratios, except perhaps bookmakers. It is much easier to interpret relative risks. In many situations we will be able to interpret odds ratios by pretending that they are relative risks because, when the events are rare, risks and odds are very similar.

There are statistical reasons to prefer the odds ratio, particularly when fitting certain mathematical models known as logistic regression and Cox regression. These are more sophisticated versions of the linear regression model, appropriate when the response is binary (logistic) or based on survival (Cox).

12.7 A Larger Contingency Table

A common way to be presented quantitative data is in a two-way contingency table, where the rows of the table classify the subjects on one variable and the columns on a second variable.

For this example, we will have a $2 \times 3$ contingency table; the 2 rows will represent boys and girls and the 3 columns will represent three different flavors of ice cream that might be their favorite.

##       Chocolate Vanilla Strawberry
## Boys         75      75         50
## Girls       150     100         50

The marginal row totals are 200 and 300 (there are 200 boys and 300 girls, for a total of 500 children). The marginal column totals are 225, 175, and 100 (225 like chocolate, 175 vanilla, 100 strawberry).

$P(Boy)=\frac{200}{500}=\frac{2}{5}=0.40$ ; 40% of the sample are boys

$P(Girl)=\frac{300}{500}=\frac{3}{5}=0.60$ ; 60% of the sample are girls

$P(Chocolate)=\frac{225}{500}=0.45$ ; 45% of the children prefer chocolate ice cream

$P(Boy \: and \: Chocolate)=\frac{75}{500}=0.15$ 15% of the children are boys that prefer chocolate ice cream

$P(Boy \: or \: Chocolate)=\frac{200}{500}+\frac{225}{500}-\frac{75}{500}=\frac{350}{500}=0.70$ ; notice we subtract the intersection to avoid double-counting the boys that like chocolate; this counts all boys and also the girls that like chocolate

$P(Chocolate|Boy)=\frac{75}{200}=0.375$ ; 37.5% of the boys like chocolate the most

$P(Chocolate|Girl)=\frac{150}{300}=0.5$ ; 50% of the girls like chocolate the most

Girls are more likely and boys less likely to prefer chocolate; the events are NOT independent.

Notice that $P(Girl|Chocolate)=\frac{150}{225}=.66\bar{6}$ ; 66.7% of the children that like chocolate the best are girls.