Chapter 30: Cause

A light introduction to causal inference.

We want to (know what we can) learn about causation from observations: We know “correlation does not necessarily imply causation”, and that experiments are our best way to learn about causes. But we also understand that there is some use in observation, and we want to know how we can evaluate causal claims in observational studies.

What is a Cause?

Like much of statistics, understanding causation requires a healthy dose of imagination.

Imagine a series of parallel worlds: in one, a specific treatment is applied (e.g., drinking coffee, receiving a vaccine, raising taxes), and in another, it is not. By observing how an outcome of interest responds in each world, we can start to see whether the treatment influences that outcome. We say that a treatment causes the outcome if changing it would alter the outcome, on average. For quantitative treatments, we can envision multiple worlds where the treatment level varies by degree, observing how each variation impacts the outcome.

This concept of evaluating what would happen if we had changed a treatment is known as counterfactual thinking, and it is foundational to reasoning about causation.

Required reading:

DAGs, confounds, and experiments

THE curious associations with lung cancer found in relation to smoking habits do not, in the minds of some of us, lend themselves easily to the simple conclusion that the products of combustion reaching the surface of the bronchus induce, though after a long interval, the development of a cancer. If, for example, it were possible to infer that smoking cigarettes is a cause of this disease, it would equally be possible to infer on exactly similar grounds that inhaling cigarette smoke was a practice of considerable prophylactic value in preventing the disease, for the practice of inhaling is rarer among patients with cancer of the lung than with others.

Such results suggest that an error has been made,of an old kind, in arguing from correlation to causation, and that the possibility should be explored that the different smoking classes, non-smokers, cigarettesmokers, cigar smokers, pipe smokers, etc., have adopted their habits partly by reason of their personal temperaments and dispositions, and are not lightly to be assumed to be equivalent in their genotypic composition. Such differences in genetic make-up between these classes would naturally be associated with differences of disease incidence without the disease being causally connected with smoking. It would then seem not so paradoxical that the stronger fumes of pipes or cigars should be so much less associated with cancer than those of cigarettes, or that the practice of drawing cigarette smoke in bulk into the lung should have apparently a protective effect.

Cancer and SmokingFisher (1958)

R.A. Fisher—a pipe enthusiast, notorious asshole, eugenicist, and widely regarded as the father of modern statistics and population genetics.

Figure 1: R.A. Fisher—a pipe enthusiast, notorious asshole, eugenicist, and widely regarded as the father of modern statistics and population genetics.

It is now well established that cigarette smoking causes cancer—but this wasn’t always the consensus. One of the most prominent statisticians in history, R.A. Fisher, famously argued that the observed relationship between smoking and cancer was spurious, insisting that smoking did not cause cancer. The extent to which his flawed logic was influenced by his own fondness for smoking, financial ties to the tobacco industry, or simple skepticism is unclear (Stolley (1991)). What is clear, however, is that Fisher raised a plausible argument—and that he was profoundly mistaken.

A cause is neither a guarantee nor the only pathway to an outcome. For example, many smokers have not developed lung cancer, and many non-smokers have (because there are multiple ways to develop lung cancer). We can interpret a cause as a thing that – if changed – increases the probability of some outcome.

Randomized Controlled Experiments

Randomized controlled experiments are our strongest tool for learning about causation. By randomly assigning participants to different treatments, we effectively place them in these alternative realities we imagine and observe the outcomes of each treatment. In other words, we bring our hypothetical scenarios to life.

To distinguish between the claim that smoking causes cancer and Fisher’s claim that genetics is a confounder, we would ideally randomly assign some people to smoke and others not to smoke. However, for both ethical and logistical reasons, this is not feasible. So, we need alternative methods to approach this question, and that is our goal today! The directed acyclic graph is a visual approach that can get us started solving this problem.

DAGs

We represent causal relationships between variables by a Directed Acyclic Graph, or a DAG. Each variable is represented by a node (a point in the graph), and each causal relationship is shown as a directed arrow pointing from the cause to the effect. DAGs are directed because arrows point in the direction of causation. DAGs are Acyclic as arrows can’t loop back to a variable they originated from. This is because causation flows in only one direction (i.e. something cannot cause itself (See the Cosmological_argument from the philosophy of religion literature). DAGs help us map out complex relationships and clarify our assumptions about how variables are related. They can show us where potential biases or confounding might exist and guide us in deciding which variables we need to include or exclude in an analysis.

Above we introduced two simple causal models for the association between smoking and cancer

These are not the only plausible causal models for an association between smoking and cancer. I present three other possibilities in Figure 2.

We could represent this causal claim with the simplest causal graph we can imagine. This is our first formal introduction to a Directed Acyclic Graph (herefater DAG). This is *Directed* because **WE** are pointing a causal arrow from smoking to cancer. It is acyclic because causality in these models only flows in one direction, and its a graph because we are looking at it. These DAGs are the backbone of causal thinking because they allow us to lay our causal models out there for the world to see. Here we will largely use DAGs to consider potential causal paths, but these can be used for mathematical and statistical analyses.

Figure 2: We could represent this causal claim with the simplest causal graph we can imagine. This is our first formal introduction to a Directed Acyclic Graph (herefater DAG). This is Directed because WE are pointing a causal arrow from smoking to cancer. It is acyclic because causality in these models only flows in one direction, and its a graph because we are looking at it. These DAGs are the backbone of causal thinking because they allow us to lay our causal models out there for the world to see. Here we will largely use DAGs to consider potential causal paths, but these can be used for mathematical and statistical analyses.

Collider bias: Colliders can have funny consequences when we condition on an outcome. ay in the world, there is no association between smoking and a genetic propensity to get lung cancer for reasons unrelated to smoking. If we only looked at lung cancer patients, it would appear that there is a negative correlation between smoking and a genetic risk for cancer unrelated to smoking because we do not see non-smokers with low genetic risk for lung cancer. This is known as “selection bias”, “M bias”, or “collider bias”.

When Correlation Is (Not) Good Enough

As we explore causation, we might wonder when it’s essential to understand causes.

Multiple Regression and Causal Inference

So far, we’ve discussed how to construct and interpret causal models. This process is incredibly valuable: creating a causal model clarifies our assumptions and reasoning.

But how can we use these models in statistical analysis? As it turns out, they can be quite useful! To illustrate this, I’ll simulate data under different causal models and apply various linear regressions to the simulated data to examine the outcomes.

I’m demonstrating these concepts through simulation because, with simulated data, we know the true underlying model. This allows us to clearly see what happens when models are misspecified.

Set Up

In evolutionary studies, fitness is the central metric of interest. Although it is challenging to measure and define directly, we can often assess traits related to it, such as the number of offspring an organism produces. For this example, let’s assume this measure is sufficiently representative of fitness.

Now, suppose we are studying a species of fish and want to determine if being larger (in terms of length) increases fitness, measured as the number of eggs produced. To make things more interesting, let’s assume that fish inhabit environments whose quality we can measure. For this example, we’ll assume that all measurements are accurate, unbiased, and follow a normal distribution.

Causal Model 1: The Confound

Let’s begin with a simple confound—where a high-quality environment leads to larger fish and increased fitness, but size itself has no direct effect on fitness. Here’s how we’ll set up the simulation:

In this model, fish length has no causal impact on the number of eggs laid.

We use the rnorm() function to simulate data from a normal distribution as follows:

n_fish          <- 100
confounded_fish <- tibble(env_quality = rnorm(n = n_fish, mean = 50, sd = 5), # simulating environment quality
                          fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
                          fish_eggs   = rnorm(n = n_fish, mean = env_quality / 2, sd = 6) %>% round())

We know that fish length does not cause fish to lay more eggs—since we did not model it this way. Nonetheless, a plot and a statistical test show a strong association between length and egg count if we exclude environmental quality from the model.

Our statistical analysis will not show cause

We can build a simple linear model predicting the number of fish eggs as a function of fish length. We can see that the prediction is good, and makes sense – egg number reliably increases (0.039) with fish length (slope = 0.226, df = (1,98), F = 4.01, p-value = 0.048). But we know this is not a causal relationship (because we didn’t have this cause in our simulation).

Model with confound only
lm(fish_eggs~ fish_length, data = confounded_fish)
term estimate std.error statistic p.value
fish_length 0.226 0.113 2.003 0.048

A Model Including the True Cause

Let’s say we didn’t have a clear idea of which trait causally increased the number of eggs a fish laid. In this example, it’s plausible (although not necessarily true) to believe that both environmental quality and fish length influence the number of eggs laid. So, let’s build a model that includes both variables, including the confounding variable, environmental quality. Then, we’ll find our estimates and use the appropriate Type II sums of squares to evaluate significance. Finally, we’ll use the emtrends() and confint() functions from the emmeans package to estimate uncertainty in our slope estimates.

library(emmeans)
fish_lm_w_confound <- lm(fish_eggs ~ env_quality + fish_length, confounded_fish)  
anova_table        <- Anova(fish_lm_w_confound, type = "II")
env_qual_est       <- emtrends(fish_lm_w_confound, var = "env_quality") %>% confint()
fish_length_est    <- emtrends(fish_lm_w_confound, var = "fish_length") %>% confint()
ANOVA Table: Model with Cause & Confound
Model: lm(fish_eggs ~ env_quality + fish_length, data = confounded_fish)
term sumsq df F_value P_value
env_quality 53.971 1 1.536 0.218
fish_length 6.575 1 0.187 0.666
Residuals 3408.435 97 NA NA
Slope Estimates and Uncertainty: Model with Cause & Confound
Model: lm(fish_eggs ~ env_quality + fish_length, data = confounded_fish)
variable slope SE df lower.CL upper.CL
env_quality 0.403 0.325 97 -0.242 1.048
fish_length -0.135 0.312 97 -0.754 0.484

The Good News is that now that we’ve added the true cause, we no longer find that fish_length is statistically associated with the number of eggs laid.
The Bad News is that, because we still have the wrong variable in our model, we lose the ability to correctly identify that environmental quality strongly predicts the number of eggs laid.

The results above are from a simulation. If you run this multiple times, you’ll see different outcomes. Sometimes, we will correctly reject the false null hypothesis in favor of the truth—that environmental quality increases the number of eggs laid. However, including the confounder in our model reduces our power to identify the true cause.

What to do?

First let’s look at all the relationships in our data.

The right thing to do in this case is to just build a model with the environmental quality.

Model with cause only
lm(fish_eggs~ env_quality, data = confounded_fish)
term estimate std.error statistic p.value
env_quality 0.272 0.117 2.33 0.022
Multicolinearity: This example also shows a statistical problem of multicolinearity – that is our predictors are correlated. This makes building and interpreting a model challenging.

Causal model 2: The pipe

So now let’s look at a pipe in which the environment causes fish length and fish length causes fitness, but environment itself has has no impact on fitness. First let’s simulate:

pipe_fish <- tibble(
  env_quality = rnorm(n = n_fish, mean = 50, sd = 5),  # Simulating environment
  fish_length = rnorm(n = n_fish, mean = env_quality, sd = 2),
  fish_eggs   = rnorm(n = n_fish, mean = fish_length / 2, sd = 5) %>% round()
)

In this scenario, we know that environmental quality does not directly cause fish to lay more eggs, as the model does not include a direct link from quality to egg count. However, both a plot and a statistical test reveal a strong association between environmental quality and egg count when fish length is not included in the model.

Thinking about cause: In a real way you could say that environmental quality was a cause here. If we did an experiment and changed environmental quality we would get more eggs. Of course, we would get more eggs because fish length increased with environmental quality and this increase in length would be the cause of more eggs. If we could somehoe increase fish length other means (e.g., selective breeding) without improving environmental quality, or if environmental quality improved but length remained constant, we would not – according to this DAG – see anincrease in egg numbers.

Our Statistical Analysis Cannot Prove Causation

We can fit a simple linear model predicting egg count based solely on environmental quality. This model will show a positive and reliable association (slope = 0.4596; \(r^2 =\) 0.191) —- egg count appears to increase with environmental quality. Of course, this is not a causal relationship, as we did not model environmental quality as a direct cause of egg production.

Model with indirect cause only
lm(fish_eggs ~ env_quality, pipe_fish)
term estimate std.error statistic p.value
env_quality 0.46 0.095 4.816 0

Adding the immediate cause into our model

So, let’s build a model including both the immediate cause, fish length, as well as the indirect cause, environmental quality. We see pretty strange behavior here - when both variables are in the statistical model, neither is significantly associated with the number of eggs laid.

library(emmeans)
fish_lm_w_cause    <- lm(fish_eggs~ fish_length + env_quality, pipe_fish)  
anova_table        <- Anova(fish_lm_w_cause, type = "II")
fish_length_est    <- emtrends(fish_lm_w_cause, var = "fish_length") %>% confint()
env_qual_est       <- emtrends(fish_lm_w_cause, var = "env_quality") %>% confint()
ANOVA Table: Model with Indirect & Direct Cause
Model: lm(fish_eggs ~ env_quality + fish_length, data = pipe_fish)
term sumsq df F_value P_value
fish_length 85.841 1 2.936 0.090
env_quality 0.472 1 0.016 0.899
Residuals 2835.860 97 NA NA
Slope Estimates and Uncertainty: Model with Indirect & Direct Cause
Model: lm(fish_eggs ~ env_quality + fish_length, data = pipe_fish)
variable slope SE df lower.CL upper.CL
fish_length 0.478 0.279 97 -0.076 1.033
env_quality -0.039 0.306 97 -0.646 0.568

What to do?

First let’s look at all the relationships in our data

The right thing to do in this case is to just build a model with the fish length.

Model with direct cause only
lm(fish_eggs~ fish_length, data = pipe_fish)
term estimate std.error statistic p.value
fish_length 0.445 0.086 5.181 0

Causal model 3: Collider

A collider is something caused by two variables. Say in our example both fish length and environmental quality caused fish to lay more eggs. Here egg number is a colider. Lets simulate it!

n_fish <-1000
collider_fish <- tibble(
  env_quality = rnorm(n = n_fish, mean = 50, sd = 8),  # Simulating environment
  fish_length = rnorm(n = n_fish, mean = 50, sd = 8),
  fish_eggs   = rnorm(n = n_fish, mean = (env_quality+fish_length) / 4, sd = 2) %>% round()
) %>%
  mutate(fish_eggs = ifelse(fish_eggs <0,0,fish_eggs))

Let’s check out our data:

We see everyhting that is expected, and our stats back this up.

library(emmeans)
fish_lm_w_collider <- lm(fish_eggs~ fish_length + env_quality, collider_fish)  
anova_table        <- Anova(fish_lm_w_collider, type = "II")
fish_length_est    <- emtrends(fish_lm_w_collider, var = "fish_length") %>% confint()
env_qual_est       <- emtrends(fish_lm_w_collider, var = "env_quality") %>% confint()
ANOVA Table: Modeling a collider
Model: lm(fish_eggs ~ env_quality + fish_length, data = collider_fish)
term sumsq df F_value P_value
fish_length 4213.963 1 1057.773 0
env_quality 3608.377 1 905.761 0
Residuals 3971.856 997 NA NA
Slope Estimates and Uncertainty: Modeling a collider
Model: lm(fish_eggs ~ env_quality + fish_length, data = collider_fish)
variable slope SE df lower.CL upper.CL
fish_length 0.261 0.008 997 0.245 0.277
env_quality 0.242 0.008 997 0.226 0.258

Here the correct model is to include both causes. exliding one will likely lead to unbiased estimates of the effect of the other, but it will reduce power and increase uncertainty.

Here’s an edited version:

Beware of Collider Bias

Suppose we’re interested in fish that lay a large number of eggs, and we want to explore factors that might help them lay even more. If we focus only on fish that laid more than the average number of eggs (e.g., 30), we might observe an unexpected pattern:

ggplot(collider_fish %>% filter(fish_eggs >= 30),
       aes(x = env_quality, y = fish_length)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Even though our predictors are uncorrelated, we see a significant negative relationship between them. This correlation is artificially induced because we’ve conditioned on a high outcome. This phenomenon, known as collider bias, should be carefully avoided.

Model with direct cause only
lm(fish_length ~ env_quality, data = collider_fish %>% filter(fish_eggs>=30))
term estimate std.error statistic p.value
env_quality -0.432 0.088 -4.892 0

What to do?

The examples above show the complexity in deciphering causes without experiments. But they also show us the light about how we can infer causation, because causal diagrams can point to testable hypotheses. If we cannot do experiments, causal diagrams offer us a glimpse into how we can infer causation. Perhaps the best way to do this is by matching – if we can match subjects that are identical for all causal paths except the one we are testing, we can then test for a statistical association, ad make a causal claim we can believe in.

The field of causal inference is developing rapidly. If you want to hear more, the popular book, The Book of Why (Pearl and Mackenzie 2018) is a good place to start. The field of causal inference offers many exciting techniques for deciphering cause. For now we will just say that a good biological intution and carefull thinking about your model and causation is a great starting place.

Quiz

Be sure to complete the quiz, based on

Figure 3: Quiz on causal inference here

Fisher, A., Ronald. 1958. “Cancer and Smoking.” Nature 182 (4635): 596–96. https://doi.org/10.1038/182596a0.
Hjelmborg, Jacob, Tellervo Korhonen, Klaus Holst, Axel Skytthe, Eero Pukkala, Julia Kutschke, Jennifer R Harris, et al. 2017. “Lung Cancer, Genetic Predisposition and Smoking: The Nordic Twin Study of Cancer.” Thorax 72 (11): 1021–27. https://doi.org/10.1136/thoraxjnl-2015-207921.
Stolley, Paul D. 1991. When Genius Errs: R. A. Fisher and the Lung Cancer Controversy.” American Journal of Epidemiology 133 (5): 416–25. https://doi.org/10.1093/oxfordjournals.aje.a115904.

References