# Chapter 32 Causal Inference

We want to (know what we can) learn about causation from observations: We know “correlation does not necessarily imply causation”, and that experiments are our best way to learn about causes. But we also understand that there is some use in observation, and we want to know how we can evaluate causal claims in observational studies.

## 32.1 What is a cause?

Like so much of statistics, understanding causation requires an healthy dose of our imagination.

Specifically imagine multiple worlds. For example, we can imagine a world in which there was some treatment (e.g. we drank coffee, we got a vaccine, we raised taxes etc) and one in which that treatment was absent (e.g. we didn’t have coffee, we didn’t raise taxes etc), and we then follow some response variable of interest. We say that the treatment is a cause of the outcome if changing it will change the outcome, on average. Note for quantitative treatments, we can imagine a bunch of worlds where the treatments was modified by some quantitative value.

In causal inference, considering the outcome if we had changed a treatment is called counterfactual thinking, and it is critical to our ability to think about causes.

## 32.2 DAGs, confounds, and experiments

Say we wanted to know if smoking causes cancer.

### 32.2.1 Confounds

In the specifics of this case, Fisher turned out to be quite wrong – genes do influence the probability of smoking and genes do influence the probability of lung cancer, but smoking has a much stronger influence on the probability of getting lung cancer than does genetics.

### 32.2.2 Randomized Controlled Experiments

This is why, despite their limitations (Ch. 30), randomized control experiments are our best to learn about causation – we randomly place participant in these alternative realities that we imagine and look at the outcome of alternative treatments. That is, we bring our imaginary worlds to life.

So to distinguish between the claim that smoking causes cancer and Fisher’s claim that genetics is a confound and that smoking does not cause cancer, he could randomly assign some people to smoke and some to not. Of course, this is not feasible for both ethical and logistical reasons, so we need some way to work through this. This is our goal today!

### 32.2.3 DAGs

I’ve introduced two DAGs so far.

• Figure 32.1 is a causal model of smoking causing lung cancer. Note this does it mean that nothing else causes lung cancer, or that everyone who smokes will get lung cancer, or that no one who doesn’t smoke will get lung cancer. Rather, it means that if we copied each person, ad had one version of them smoke and the other not, there would be more cases of lung cancer in the smoking clones than the nonsmoking clones.

• Figure 32.2 presents Fisher’s argument that smoking does not cause cancer and that rather, both smoking and cancer are influenced by a common cause – genetics.

These are not the only plausible causal models for an association between smoking and cancer. I present three other possibilities in Figure 32.3.

• A pipe is presented in Figure 32.3a. That is – genes cause smoking and smoking causes cancer. Empirically and statistically, this is a hard model to evaluate because changing genes would “cause” cancer in an experiment, and “controlling for genetics” by including it in a linear model would hide the effect of smoking. The right thing to do is to ignore the genetic component – but that feels wrong and how do we justify it? One way to get at this s to “match” on genetics and then compare outcomes for cancer. A 2017 study compared the incidence of lung cancer between monozygotic twins for which one smoked and one did not, and found a higher incidence of cancer in the smoking twin .
• A collider is presented in Figure 32.3b, as both genes and smoking cause cancer (they “collide”). Here there are two “paths” between smoking and cancer. 1. The front door causal path – smoking causes cancer, and 2. The back door non causal path in connecting smoking to cancer via the confounding variable, genetics. Here the challenge is to appropriately partition and attribute causes.
• A more complex and realistic model including the effects of the environment on cancer and smoking is presented in 1.5c. Noe that in this model genes do not cause the environment and the environment does not cause genes.
Collider bias: Colliders can have funny consequences when we condition on an outcome. ay in the world, there is no association between smoking and a genetic propensity to get lung cancer for reasons unrelated to smoking. If we only looked at lung cancer patients, it would appear that there is a negative correlation between smoking and a genetic risk for cancer unrelated to smoking because we do not see non-smokers with low genetic risk for lung cancer. This is known as “selection bias”, “M bias”, or “collider bias”.

## 32.3 When correlation is (not) good enough

So we are going to think through causation – but we might wonder when we need to know causes.

• We don’t need to understand causation to make predictions under the status quo. If I just want to make good predictions, I can build a good multiple regression model, and make predictions from it, and we will be just fine. If I want to buy good corn – I can go to the farm stand that reliably sells yummy corn, I don’t care if the corn is yummy because of the soil, the light environment, or the farmers playing Taylor Swift every morning to get the corn excited. Similarly, if I was selling insurance, I would just need to reliably predict who would get lung cancer, I wouldn’t need to know why.
• We need to understand causation when we want to intervene (or make causal claims). If I want to grow my own yummy corn, I would want to know what about the farmers practice made the corn yummy. I wouldn’t need to worry about fertilizing my soil if it turned out that pumping some Taylor swift tunes was all I needed to do to make yummy corn. Similarly, if I was giving public health advice I would need to know that smoking caused cancer to credibly suggest that people quit smoking to reduce their chance of developing lung cancer. selling insurance, I would just need to reliably predict who would get lung cancer, I wouldn’t need to know why.

## 32.4 Multiple regression, and causal inference

So far we have considered how we draw and think about causal models. This is incredibly useful – drawing a causal model makes our assumptions and reasoning clear.

But what can we do with these plots, and how can they help us do statistics? It turns out they can be pretty useful! To work through this I will simulate fake data under different causal models and run different linear regressions on the simulated data to see what happens.

### 32.4.1 Imaginary scenario

In evolution, fitness is the metric we care about most. While it is nearly impossble to measure and define, we often can measure things related to it, like the number of children that an organism has. For the purposes of this example let’s say that is good enough.

So, say we are studying a fish and want to see if being big (in length) increases fitness (measured as the number of eggs produced). To make things more interesting, let’s say that fish live environements whose quality we can measure. For the purpoes of this example, let’s say that we can reliably and correctly estimate all these values without bias, and that all have normally distributed residuals etc..

#### 32.4.1.1 Causal model 1: The confound

Let’s start with a simple confound – say a good environment makes fish bigger and increases their fitness, but being bigger itself has no impact on fitness. First let’s simulate

n_fish          <- 100
confounded_fish <- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs   = rnorm(n = n_fish, mean = env_quality/2, sd = 6) %>% round()) #

Now we know that fish length does not cause fish to lay more eggs – as we did not models this. Nonetheless, a plot and a statistical test show a strong association between length and eggs if we do not include if we do not include environmental quality in our model.

confound_plot <- ggplot(confounded_fish, aes(x = fish_length, y = fish_eggs)) +
geom_point()+
geom_smooth(method = "lm")+
labs("Confound", subtitle = "# eggs increases with length\nwithout a causal relationship.")

confound_plot

Our statistical analysis will not show cause

We can build a simple linear model predicting the number of fish eggs as a function of fish length. We can see that the prediction is good, and makes sense – egg number reliably increases with fish length. But we know this is not a causal relationship (because we didn’t have this cause in our simulation).

lm(fish_eggs ~ fish_length, confounded_fish) %>%  summary()
##
## Call:
## lm(formula = fish_eggs ~ fish_length, data = confounded_fish)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -14.1647  -3.8101   0.2738   3.4682  22.1632
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.9121     5.6811   0.337 0.737157
## fish_length   0.4320     0.1155   3.739 0.000311 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.164 on 98 degrees of freedom
## Multiple R-squared:  0.1249, Adjusted R-squared:  0.1159
## F-statistic: 13.98 on 1 and 98 DF,  p-value: 0.0003107
lm(fish_eggs ~ fish_length, confounded_fish) %>%  anova()  
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value    Pr(>F)
## fish_length  1  531.3  531.28  13.982 0.0003107 ***
## Residuals   98 3723.6   38.00
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the confound into our model

So, let’s build a model including the confound environmental quality.

fish_lm_w_confound <- lm(fish_eggs~ env_quality + fish_length, confounded_fish)  

Looking at the estimates from the model we see that the answers don’t make a ton of sense

fish_lm_w_confound %>% coef() %>% round(digits = 2)
## (Intercept) env_quality fish_length
##       -3.40        1.06       -0.52

In this case, an ANOVA with type one sums of squares give reasonable p-values, while an ANOVA with type II sums of squares shows that neither environment nor length is a significant predictor of egg number. This is weird.

fish_lm_w_confound %>% anova()            
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value    Pr(>F)
## env_quality  1  780.2  780.24 22.3564 7.678e-06 ***
## fish_length  1   89.4   89.37  2.5608    0.1128
## Residuals   97 3385.3   34.90
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fish_lm_w_confound %>% Anova(type = "II")
## Anova Table (Type II tests)
##
## Response: fish_eggs
##             Sum Sq Df F value  Pr(>F)
## env_quality  338.3  1  9.6942 0.00243 **
## fish_length   89.4  1  2.5608 0.11280
## Residuals   3385.3 97
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What to do?

First let’s look at all the relationships in our data

The right thing to do in this case is to just build a model with the environmental quality.

lm(fish_eggs ~ env_quality , confounded_fish) %>% summary()
##
## Call:
## lm(formula = fish_eggs ~ env_quality, data = confounded_fish)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -13.2822  -3.8655   0.3601   3.4024  22.4367
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -3.7814     5.7464  -0.658    0.512
## env_quality   0.5486     0.1169   4.691 8.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.954 on 98 degrees of freedom
## Multiple R-squared:  0.1834, Adjusted R-squared:  0.175
## F-statistic: 22.01 on 1 and 98 DF,  p-value: 8.813e-06
lm(fish_eggs ~ env_quality , confounded_fish) %>% anova()
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value    Pr(>F)
## env_quality  1  780.2  780.24  22.006 8.813e-06 ***
## Residuals   98 3474.7   35.46
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multicolinearity: This example also shows a statistical problem of multicolinearity – that is our predictors are correlated. This makes building and interpreting a model challenging.

#### 32.4.1.2 Causal model 2: The pipe

So now let’s look at a pipe in which the environment causes fish length and fish length causes fitness, but environment itself has has no impact on fitness. First let’s simulate

pipe_fish <- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs   = rnorm(n = n_fish, mean = fish_length/2, sd = 5) %>% round()) #

Now we know that environmental quality does not directly cause fish to lay more eggs – as we did not models this. Nonetheless, a plot and a statistical test show a strong association between quality and eggs if we do not include if we do not include fish length in our model.

pipe_plot <- ggplot(pipe_fish, aes(x = env_quality, y = fish_eggs)) +
geom_point()+
geom_smooth(method = "lm")+
labs( subtitle = "# eggs increases with env quality\nalthough the causal relationship is indirect.")

pipe_plot

Our statistical analysis will not show cause

We can build a simple linear model predicting the number of fish eggs as a function of environmental quality. We can see that the prediction is good, and makes sense – egg number reliably increases with environmental quality. But we know this is not a causal relationship (because we didn’t have this cause in our simulation).

lm(fish_eggs ~ env_quality, pipe_fish) %>%  summary()
##
## Call:
## lm(formula = fish_eggs ~ env_quality, data = pipe_fish)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -12.5129  -3.5799  -0.1769   3.8154  14.2840
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.22358    4.75834  -0.467    0.641
## env_quality  0.52996    0.09399   5.639 1.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.295 on 98 degrees of freedom
## Multiple R-squared:  0.245,  Adjusted R-squared:  0.2372
## F-statistic: 31.79 on 1 and 98 DF,  p-value: 1.655e-07
lm(fish_eggs ~ env_quality, pipe_fish) %>%  anova() 
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value    Pr(>F)
## env_quality  1  891.3  891.30  31.793 1.655e-07 ***
## Residuals   98 2747.3   28.03
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the immediate cause into our model

So, let’s build a model including the immediate cause, fish length.

fish_lm_w_cause <- lm(fish_eggs~ fish_length + env_quality, pipe_fish)  

Looking at the estimates from the model we see that the answers don’t make a ton of sense

fish_lm_w_cause %>% coef() %>% round(digits = 2)
## (Intercept) fish_length env_quality
##       -2.80       -0.19        0.73

The stats here again come out a bit funny. A type

lm(fish_eggs~ fish_length + env_quality, pipe_fish)  %>% anova()
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df  Sum Sq Mean Sq F value    Pr(>F)
## fish_length  1  733.33  733.33 26.0306 1.669e-06 ***
## env_quality  1  172.62  172.62  6.1273   0.01504 *
## Residuals   97 2732.69   28.17
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(fish_eggs~ env_quality + fish_length, pipe_fish)  %>% anova()
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df  Sum Sq Mean Sq F value    Pr(>F)
## env_quality  1  891.30  891.30 31.6377 1.793e-07 ***
## fish_length  1   14.66   14.66  0.5202    0.4725
## Residuals   97 2732.69   28.17
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(fish_eggs~ fish_length + env_quality, pipe_fish)  %>% Anova(type = "II")
## Anova Table (Type II tests)
##
## Response: fish_eggs
##              Sum Sq Df F value  Pr(>F)
## fish_length   14.66  1  0.5202 0.47248
## env_quality  172.62  1  6.1273 0.01504 *
## Residuals   2732.69 97
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What to do?

First let’s look at all the relationships in our data

The right thing to do in this case is to just build a model with the fish length.

lm(fish_eggs ~ fish_length, pipe_fish) %>% summary()
##
## Call:
## lm(formula = fish_eggs ~ fish_length, data = pipe_fish)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -12.7642  -3.8217  -0.2454   3.5790  14.3729
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.81992    4.38096   0.644    0.521
## fish_length  0.43252    0.08696   4.974  2.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.445 on 98 degrees of freedom
## Multiple R-squared:  0.2015, Adjusted R-squared:  0.1934
## F-statistic: 24.74 on 1 and 98 DF,  p-value: 2.803e-06
lm(fish_eggs ~ fish_length, pipe_fish) %>% anova()
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df  Sum Sq Mean Sq F value    Pr(>F)
## fish_length  1  733.33  733.33  24.736 2.803e-06 ***
## Residuals   98 2905.31   29.65
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#### 32.4.1.3 Causal model 3: The collider

So now let’s look at a collider in which the environment causes fitness and fish length, and fish length causes fitness, but environment itself has has no impact on fitness. First let’s simulate

collide_fish <- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs   = rnorm(n = n_fish, mean = (env_quality/4+ fish_length*3/4)/2, sd = 7) %>% round()) #

Now we know that environmental quality increases fish length and both environmental quality and fish length directly cause fish to lay more eggs.

But our models have a bunch of trouble figuring this out. Again, a type one sums of squares puts most of the “blame” on the first thing in the model.

lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>%  summary()
##
## Call:
## lm(formula = fish_eggs ~ env_quality + fish_length, data = collide_fish)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -20.388  -4.192   0.616   6.299  16.046
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   9.6019     7.6949   1.248    0.215
## env_quality  -0.4427     0.4282  -1.034    0.304
## fish_length   0.7621     0.4083   1.866    0.065 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.309 on 97 degrees of freedom
## Multiple R-squared:  0.07381,    Adjusted R-squared:  0.05472
## F-statistic: 3.865 on 2 and 97 DF,  p-value: 0.02426
lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>%  anova() 
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value  Pr(>F)
## env_quality  1  226.9 226.892  4.2474 0.04199 *
## fish_length  1  186.1 186.059  3.4830 0.06502 .
## Residuals   97 5181.6  53.419
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(fish_eggs ~ fish_length + env_quality, collide_fish ) %>%  anova() 
## Analysis of Variance Table
##
## Response: fish_eggs
##             Df Sum Sq Mean Sq F value  Pr(>F)
## fish_length  1  355.9  355.87  6.6618 0.01135 *
## env_quality  1   57.1   57.09  1.0686 0.30382
## Residuals   97 5181.6   53.42
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>%  Anova(type = "II") 
## Anova Table (Type II tests)
##
## Response: fish_eggs
##             Sum Sq Df F value  Pr(>F)
## env_quality   57.1  1  1.0686 0.30382
## fish_length  186.1  1  3.4830 0.06502 .
## Residuals   5181.6 97
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 32.6 Wrap up

The examples above show the complexity in deciphering causes without experiments. But they also show us the light about how we can infer causation, because causal diagrams can point to testable hypotheses.

If we can’t do experiments, causal diagrams offer us a glimpse into how we can infer causation.

Perhaps the best way to do this is by matching – if we can match subjects that are identical for all causal paths except the one we are testing, we can then test for a statistical association, ad make a causal claim we can believe in.

The field of causal inference is developing rapidly. If you want to hear more, the popular book, The Book of Why is a good place to start.

## 32.7 Quiz

### References

Hjelmborg, Jacob, Tellervo Korhonen, Klaus Holst, Axel Skytthe, Eero Pukkala, Julia Kutschke, Jennifer R Harris, et al. 2017. “Lung Cancer, Genetic Predisposition and Smoking: The Nordic Twin Study of Cancer.” Thorax 72 (11): 1021–27. https://doi.org/10.1136/thoraxjnl-2015-207921.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. New York: Basic Books.