# Chapter 28 Causal Inference

We want to (know what we can) learn about causation from observations: We know “correlation does not necessarily imply causation”, and that experiments are our best way to learn about causes. But we also understand that there is some use in observation, and we want to know how we can evaluate causal claims in observational studies.

**Required reading / Viewing:**

## 28.1 What is a cause?

Like so much of statistics, understanding causation requires an healthy dose of our imagination.

Specifically imagine multiple worlds. For example, we can imagine a world in which there was some treatment (e.g. we drank coffee, we got a vaccine, we raised taxes etc) and one in which that treatment was absent (e.g. we didn’t have coffee, we didn’t raise taxes etc), and we then follow some response variable of interest. We say that the treatment is a cause of the outcome if changing it will change the outcome, on average. Note for quantitative treatments, we can imagine a bunch of worlds where the treatments was modified by some quantitative value.

In causal inference, considering the outcome if we had changed a treatment is called *counterfactual thinking*, and it is critical to our ability to think about causes.

## 28.2 DAGs, confounds, and experiments

Say we wanted to know if smoking causes cancer.

### 28.2.1 Confounds

In the specifics of this case, Fisher turned out to be quite wrong – genes do influence the probability of smoking and genes do influence the probability of lung cancer, but smoking has a much stronger influence on the probability of getting lung cancer than does genetics.

### 28.2.2 Randomized Controlled Experiments

This is why, despite their limitations (Ch. 27), randomized control experiments are our best to learn about causation – we randomly place participant in these alternative realities that we imagine and look at the outcome of alternative treatments. That is, we bring our imaginary worlds to life.

So to distinguish between the claim that smoking causes cancer and Fisher’s claim that genetics is a confound and that smoking does not cause cancer, he could randomly assign some people to smoke and some to not. Of course, this is not feasible for both ethical and logistical reasons, so we need some way to work through this. This is our goal today!

### 28.2.3 DAGs

I’ve introduced two DAGs so far.

Figure 28.1 is a causal model of smoking causing lung cancer. Note this does it mean that nothing else causes lung cancer, or that everyone who smokes will get lung cancer, or that no one who doesn’t smoke will get lung cancer. Rather, it means that if we copied each person, ad had one version of them smoke and the other not, there would be more cases of lung cancer in the smoking clones than the nonsmoking clones.

Figure 28.2 presents Fisher’s argument that smoking does not cause cancer and that rather, both smoking and cancer are influenced by a common cause – genetics.

These are not the only plausible causal models for an association between smoking and cancer. I present three other possibilities in Figure 28.3.

- A
**pipe**is presented in Figure 28.3a. That is – genes cause smoking and smoking causes cancer. Empirically and statistically, this is a hard model to evaluate because changing genes would “cause” cancer in an experiment, and “controlling for genetics” by including it in a linear model would hide the effect of smoking. The right thing to do is to ignore the genetic component – but that feels wrong and how do we justify it? One way to get at this s to “match” on genetics and then compare outcomes for cancer. A 2017 study compared the incidence of lung cancer between monozygotic twins for which one smoked and one did not, and found a higher incidence of cancer in the smoking twin (Hjelmborg et al. 2017).

- A
**collider**is presented in Figure 28.3b, as both genes and smoking cause cancer (they “collide”). Here there are two “paths” between smoking and cancer. 1. The*front door*causal path – smoking causes cancer, and 2. The*back door*non causal path in connecting smoking to cancer via the confounding variable, genetics. Here the challenge is to appropriately partition and attribute causes.

- A more complex and realistic model including the effects of the environment on cancer and smoking is presented in 1.7c. Noe that in this model genes do not cause the environment and the environment does not cause genes.

**Collider bias:**Colliders can have funny consequences when we condition on an outcome. ay in the world, there is no association between smoking and a genetic propensity to get lung cancer for reasons unrelated to smoking. If we only looked at lung cancer patients, it would appear that there is a negative correlation between smoking and a genetic risk for cancer unrelated to smoking because we do not see non-smokers with low genetic risk for lung cancer. This is known as “selection bias”, “M bias”, or “collider bias”.

## 28.3 When correlation is (not) good enough

So we are going to think through causation – but we might wonder when we need to know causes.

**We don’t need to understand causation to make predictions under the status quo.**If I just want to make good predictions, I can build a good multiple regression model, and make predictions from it, and we will be just fine. If I want to buy good corn – I can go to the farm stand that reliably sells yummy corn, I don’t care if the corn is yummy because of the soil, the light environment, or the farmers playing Taylor Swift every morning to get the corn excited. Similarly, if I was selling insurance, I would just need to reliably predict who would get lung cancer, I wouldn’t need to know why.

**We need to understand causation when we want to intervene (or make causal claims)**. If I want to grow my own yummy corn, I would want to know what about the farmers practice made the corn yummy. I wouldn’t need to worry about fertilizing my soil if it turned out that pumping some Taylor swift tunes was all I needed to do to make yummy corn. Similarly, if I was giving public health advice I would need to know that smoking caused cancer to credibly suggest that people quit smoking to reduce their chance of developing lung cancer. selling insurance, I would just need to reliably predict who would get lung cancer, I wouldn’t need to know why.

## 28.4 Multiple regression, and causal inference

So far we have considered how we draw and think about causal models. This is incredibly useful – drawing a causal model makes our assumptions and reasoning clear.

But what can we do with these plots, and how can they help us do statistics? It turns out they can be pretty useful! To work through this I will simulate fake data under different causal models and run different linear regressions on the simulated data to see what happens.

### 28.4.1 Imaginary scenario

In evolution, fitness is the metric we care about most. While it is nearly impossble to measure and define, we often can measure things related to it, like the number of children that an organism has. For the purposes of this example let’s say that is good enough.

So, say we are studying a fish and want to see if being big (in length) increases fitness (measured as the number of eggs produced). To make things more interesting, let’s say that fish live environements whose quality we can measure. For the purpoes of this example, let’s say that we can reliably and correctly estimate all these values without bias, and that all have normally distributed residuals etc..

#### 28.4.1.1 Causal model 1: The confound

Let’s start with a simple confound – say a good environment makes fish bigger and increases their fitness, but being bigger itself has no impact on fitness. First let’s simulate

```
<- 100
n_fish <- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
confounded_fish fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs = rnorm(n = n_fish, mean = env_quality/2, sd = 6) %>% round()) #
```

Now we know that fish length does not cause fish to lay more eggs – as we did not models this. Nonetheless, a plot and a statistical test show a strong association between length and eggs if we do not include if we do not include environmental quality in our model.

```
<- ggplot(confounded_fish, aes(x = fish_length, y = fish_eggs)) +
confound_plot geom_point()+
geom_smooth(method = "lm")+
labs("Confound", subtitle = "# eggs increases with length\nwithout a causal relationship.")
confound_plot
```

**Our statistical analysis will not show cause**

We can build a simple linear model predicting the number of fish eggs as a function of fish length. We can see that the prediction is good, and makes sense – egg number reliably increases with fish length. But we know this is not a causal relationship (because we didn’t have this cause in our simulation).

`lm(fish_eggs ~ fish_length, confounded_fish) %>% summary()`

```
##
## Call:
## lm(formula = fish_eggs ~ fish_length, data = confounded_fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.350 -3.491 -0.379 4.059 15.040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0163 5.9963 0.00 1
## fish_length 0.5206 0.1228 4.24 5.1e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.16 on 98 degrees of freedom
## Multiple R-squared: 0.155, Adjusted R-squared: 0.146
## F-statistic: 18 on 1 and 98 DF, p-value: 5.11e-05
```

`lm(fish_eggs ~ fish_length, confounded_fish) %>% anova() `

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## fish_length 1 681 681 18 5.1e-05 ***
## Residuals 98 3718 38
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Adding the confound into our model**

So, let’s build a model including the confound environmental quality.

`<- lm(fish_eggs~ env_quality + fish_length, confounded_fish) fish_lm_w_confound `

Looking at the estimates from the model we see that the answers don’t make a ton of sense

`%>% coef() %>% round(digits = 2) fish_lm_w_confound `

```
## (Intercept) env_quality fish_length
## -3.82 0.54 0.06
```

In this case, an ANOVA with type one sums of squares give reasonable p-values, while an ANOVA with type II sums of squares shows that neither environment nor length is a significant predictor of egg number. This is weird.

`%>% anova() fish_lm_w_confound `

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## env_quality 1 766 766 20.45 1.7e-05 ***
## fish_length 1 1 1 0.03 0.86
## Residuals 97 3632 37
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`%>% Anova(type = "II") fish_lm_w_confound `

```
## Anova Table (Type II tests)
##
## Response: fish_eggs
## Sum Sq Df F value Pr(>F)
## env_quality 86 1 2.29 0.13
## fish_length 1 1 0.03 0.86
## Residuals 3632 97
```

**What to do?**

First let’s look at all the relationships in our data

The right thing to do in this case is to just build a model with the environmental quality.

`lm(fish_eggs ~ env_quality , confounded_fish) %>% summary()`

```
##
## Call:
## lm(formula = fish_eggs ~ env_quality, data = confounded_fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.639 -3.474 -0.511 3.739 14.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.840 6.432 -0.60 0.55
## env_quality 0.595 0.131 4.54 1.6e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.09 on 98 degrees of freedom
## Multiple R-squared: 0.174, Adjusted R-squared: 0.166
## F-statistic: 20.7 on 1 and 98 DF, p-value: 1.57e-05
```

`lm(fish_eggs ~ env_quality , confounded_fish) %>% anova()`

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## env_quality 1 766 766 20.7 1.6e-05 ***
## Residuals 98 3633 37
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Multicolinearity:**This example also shows a statistical problem of

*multicolinearity*– that is our predictors are correlated. This makes building and interpreting a model challenging.

#### 28.4.1.2 Causal model 2: The pipe

So now let’s look at a *pipe* in which the environment causes fish length and fish length causes fitness, but environment itself has has no impact on fitness. First let’s simulate

```
<- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
pipe_fish fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs = rnorm(n = n_fish, mean = fish_length/2, sd = 5) %>% round()) #
```

Now we know that environmental quality does not directly cause fish to lay more eggs – as we did not models this. Nonetheless, a plot and a statistical test show a strong association between quality and eggs if we do not include if we do not include fish length in our model.

```
<- ggplot(pipe_fish, aes(x = env_quality, y = fish_eggs)) +
pipe_plot geom_point()+
geom_smooth(method = "lm")+
labs( subtitle = "# eggs increases with env quality\nalthough the causal relationship is indirect.")
pipe_plot
```

**Our statistical analysis will not show cause**

We can build a simple linear model predicting the number of fish eggs as a function of environmental quality. We can see that the prediction is good, and makes sense – egg number reliably increases with environmental quality. But we know this is not a causal relationship (because we didn’t have this cause in our simulation).

`lm(fish_eggs ~ env_quality, pipe_fish) %>% summary()`

```
##
## Call:
## lm(formula = fish_eggs ~ env_quality, data = pipe_fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.187 -3.333 -0.859 3.132 13.215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.3920 4.8859 -0.49 0.63
## env_quality 0.5270 0.0979 5.38 5e-07 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.96 on 98 degrees of freedom
## Multiple R-squared: 0.228, Adjusted R-squared: 0.22
## F-statistic: 29 on 1 and 98 DF, p-value: 5.03e-07
```

`lm(fish_eggs ~ env_quality, pipe_fish) %>% anova() `

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## env_quality 1 712 712 29 5e-07 ***
## Residuals 98 2408 25
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Adding the immediate cause into our model**

So, let’s build a model including the immediate cause, fish length.

`<- lm(fish_eggs~ fish_length + env_quality, pipe_fish) fish_lm_w_cause `

Looking at the estimates from the model we see that the answers don’t make a ton of sense

`%>% coef() %>% round(digits = 2) fish_lm_w_cause `

```
## (Intercept) fish_length env_quality
## -2.15 0.36 0.16
```

The stats here again come out a bit funny. A type

`lm(fish_eggs~ fish_length + env_quality, pipe_fish) %>% anova()`

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## fish_length 1 751 751 30.88 2.4e-07 ***
## env_quality 1 8 8 0.31 0.58
## Residuals 97 2361 24
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`lm(fish_eggs~ env_quality + fish_length, pipe_fish) %>% anova()`

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## env_quality 1 712 712 29.25 4.6e-07 ***
## fish_length 1 47 47 1.94 0.17
## Residuals 97 2361 24
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`lm(fish_eggs~ fish_length + env_quality, pipe_fish) %>% Anova(type = "II")`

```
## Anova Table (Type II tests)
##
## Response: fish_eggs
## Sum Sq Df F value Pr(>F)
## fish_length 47 1 1.94 0.17
## env_quality 8 1 0.31 0.58
## Residuals 2361 97
```

**What to do?**

First let’s look at all the relationships in our data

The right thing to do in this case is to just build a model with the fish length.

`lm(fish_eggs ~ fish_length, pipe_fish) %>% summary()`

```
##
## Call:
## lm(formula = fish_eggs ~ fish_length, data = pipe_fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.584 -4.021 -0.517 3.147 13.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1254 4.4914 -0.25 0.8
## fish_length 0.5013 0.0899 5.58 2.2e-07 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.92 on 98 degrees of freedom
## Multiple R-squared: 0.241, Adjusted R-squared: 0.233
## F-statistic: 31.1 on 1 and 98 DF, p-value: 2.17e-07
```

`lm(fish_eggs ~ fish_length, pipe_fish) %>% anova()`

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## fish_length 1 751 751 31.1 2.2e-07 ***
## Residuals 98 2368 24
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

#### 28.4.1.3 Causal model 3: The collider

So now let’s look at a *collider* in which the environment causes fitness and fish length, and fish length causes fitness, but environment itself has has no impact on fitness. First let’s simulate

```
<- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), #simulating the environment
collide_fish fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
fish_eggs = rnorm(n = n_fish, mean = (env_quality/4+ fish_length*3/4)/2, sd = 7) %>% round()) #
```

Now we know that environmental quality increases fish length and both environmental quality and fish length directly cause fish to lay more eggs.

But our models have a bunch of trouble figuring this out. Again, a type one sums of squares puts most of the “blame” on the first thing in the model.

`lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>% summary()`

```
##
## Call:
## lm(formula = fish_eggs ~ env_quality + fish_length, data = collide_fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.36 -5.32 1.46 4.85 21.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.539 6.697 -0.08 0.94
## env_quality 0.412 0.433 0.95 0.34
## fish_length 0.109 0.397 0.28 0.78
##
## Residual standard error: 7.18 on 97 degrees of freedom
## Multiple R-squared: 0.137, Adjusted R-squared: 0.119
## F-statistic: 7.69 on 2 and 97 DF, p-value: 0.000795
```

`lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>% anova() `

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## env_quality 1 789 789 15.30 0.00017 ***
## fish_length 1 4 4 0.08 0.78375
## Residuals 97 4999 52
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`lm(fish_eggs ~ fish_length + env_quality, collide_fish ) %>% anova() `

```
## Analysis of Variance Table
##
## Response: fish_eggs
## Df Sum Sq Mean Sq F value Pr(>F)
## fish_length 1 746 746 14.5 0.00025 ***
## env_quality 1 47 47 0.9 0.34391
## Residuals 97 4999 52
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`lm(fish_eggs ~ env_quality + fish_length, collide_fish ) %>% Anova(type = "II") `

```
## Anova Table (Type II tests)
##
## Response: fish_eggs
## Sum Sq Df F value Pr(>F)
## env_quality 47 1 0.90 0.34
## fish_length 4 1 0.08 0.78
## Residuals 4999 97
```

## 28.6 Wrap up

The examples above show the complexity in deciphering causes without experiments. But they also show us the light about how we can infer causation, because causal diagrams can point to testable hypotheses.

If we can’t do experiments, causal diagrams offer us a glimpse into how we can infer causation.

Perhaps the best way to do this is by matching – if we can match subjects that are identical for all causal paths except the one we are testing, we can then test for a statistical association, ad make a causal claim we can believe in.

The field of causal inference is developing rapidly. If you want to hear more, the popular book, *The Book of Why* (Pearl and Mackenzie 2018) is a good place to start.

### References

*Thorax*72 (11): 1021–27. https://doi.org/10.1136/thoraxjnl-2015-207921.

*The Book of Why: The New Science of Cause and Effect*. New York: Basic Books.