39.2 Good Controls
39.2.1 Omitted Variable Bias Correction
A variable Z is a good control when it blocks all back-door paths from the treatment X to the outcome Y. This is the fundamental criterion from the back-door adjustment theorem in causal inference.
39.2.1.1 Simple Confounder
In this DAG, Z is a common cause of both X and Y, i.e., a confounder.
rm(list = ls())
model <- dagitty("dag{
x -> y
z -> x
z -> y
}")
coordinates(model) <- list(
x = c(x = 1, z = 2, y = 3),
y = c(x = 1, z = 2, y = 1)
)
ggdag(model) + theme_dag()
Controlling for Z removes the bias from the back-door path X←Z→Y.
n <- 1e4
z <- rnorm(n)
causal_coef <- 2
beta2 <- 3
x <- z + rnorm(n)
y <- causal_coef * x + beta2 * z + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | -0.02 | 0.01 |
(0.02) | (0.01) | |
x | 3.49 *** | 2.00 *** |
(0.02) | (0.01) | |
z | 2.98 *** | |
(0.01) | ||
N | 10000 | 10000 |
R2 | 0.82 | 0.97 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
39.2.1.2 Confounding via a Latent Variable
In this structure, U is unobserved but causes both Z and Y, and Z affects X. Even though U is not observed, adjusting for Z helps block the back-door path from X to Y that goes through U.
rm(list = ls())
model <- dagitty("dag{
x -> y
u -> z
z -> x
u -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
x = c(x = 1, z = 2, u = 3, y = 4),
y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
y <- 2 * x + u + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | -0.02 | -0.01 |
(0.01) | (0.01) | |
x | 2.33 *** | 1.99 *** |
(0.01) | (0.01) | |
z | 0.51 *** | |
(0.01) | ||
N | 10000 | 10000 |
R2 | 0.91 | 0.92 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Even though $Z$ appears significant, its inclusion serves to reduce omitted variable bias rather than having a causal interpretation itself.
39.2.1.3 Z is caused by U, but also causes Y
This DAG illustrates a subtle case where Z is on a non-causal path from X to Y and helps block bias through a shared cause U.
rm(list = ls())
model <- dagitty("dag{
x -> y
u -> z
u -> x
z -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
x = c(x = 1, z = 3, u = 2, y = 4),
y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
y <- 2 * x + z + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 0.02 | 0.01 |
(0.02) | (0.01) | |
x | 2.49 *** | 1.98 *** |
(0.01) | (0.01) | |
z | 1.02 *** | |
(0.01) | ||
N | 10000 | 10000 |
R2 | 0.83 | 0.93 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Again, we cannot interpret the coefficient on Z causally, but including Z helps reduce omitted variable bias from the unobserved confounder U.
39.2.1.4 Summary of Omitted Variable Correction
# Model 1: Z is a confounder
model1 <- dagitty("dag{
x -> y
z -> x
z -> y
}")
coordinates(model1) <-
list(x = c(x = 1, z = 2, y = 3), y = c(x = 1, z = 2, y = 1))
# Model 2: Z is on path from U to X
model2 <- dagitty("dag{
x -> y
u -> z
z -> x
u -> y
}")
latents(model2) <- "u"
coordinates(model2) <-
list(x = c(
x = 1,
z = 2,
u = 3,
y = 4
),
y = c(
x = 1,
z = 2,
u = 3,
y = 1
))
# Model 3: Z influenced by U, affects Y
model3 <- dagitty("dag{
x -> y
u -> z
u -> x
z -> y
}")
latents(model3) <- "u"
coordinates(model3) <-
list(x = c(
x = 1,
z = 3,
u = 2,
y = 4
),
y = c(
x = 1,
z = 2,
u = 3,
y = 1
))
par(mfrow = c(1, 3))
ggdag(model1) + theme_dag()
39.2.2 Omitted Variable Bias in Mediation Correction
When a variable Z is a confounder of both the treatment X and a mediator M, controlling for Z helps isolate the indirect and direct effects more accurately.
39.2.2.1 Observed Confounder of Mediator and Treatment
rm(list = ls())
model <- dagitty("dag{
x -> y
z -> x
x -> m
z -> m
m -> y
}")
coordinates(model) <-
list(x = c(
x = 1,
z = 2,
m = 3,
y = 4
),
y = c(
x = 1,
z = 2,
m = 1,
y = 1
))
ggdag(model) + theme_dag()
n <- 1e4
z <- rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | -0.01 | 0.01 |
(0.02) | (0.01) | |
x | 2.50 *** | 1.99 *** |
(0.01) | (0.01) | |
z | 1.01 *** | |
(0.02) | ||
N | 10000 | 10000 |
R2 | 0.84 | 0.87 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
39.2.2.2 Latent Common Cause of Mediator and Treatment
rm(list = ls())
model <- dagitty("dag{
x -> y
u -> z
z -> x
x -> m
u -> m
m -> y
}")
latents(model) <- "u"
coordinates(model) <-
list(x = c(
x = 1,
z = 2,
u = 3,
m = 4,
y = 5
),
y = c(
x = 1,
z = 2,
u = 3,
m = 1,
y = 1
))
ggdag(model) + theme_dag()
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + u + rnorm(n)
y <- m + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 0.01 | 0.01 |
(0.02) | (0.02) | |
x | 2.34 *** | 2.01 *** |
(0.01) | (0.02) | |
z | 0.50 *** | |
(0.02) | ||
N | 10000 | 10000 |
R2 | 0.86 | 0.87 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
39.2.2.3 Z Affects Mediator, U Affects Both X and Z
rm(list = ls())
model <- dagitty("dag{
x -> y
u -> z
z -> m
x -> m
u -> x
m -> y
}")
latents(model) <- "u"
coordinates(model) <-
list(x = c(
x = 1,
z = 3,
u = 2,
m = 4,
y = 5
),
y = c(
x = 1,
z = 2,
u = 3,
m = 1,
y = 1
))
ggdag(model) + theme_dag()
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)
jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 0.00 | 0.00 |
(0.02) | (0.01) | |
x | 2.49 *** | 1.99 *** |
(0.01) | (0.01) | |
z | 1.00 *** | |
(0.01) | ||
N | 10000 | 10000 |
R2 | 0.78 | 0.87 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
39.2.2.4 Summary of Mediation Correction
# Model 4
model4 <- dagitty("dag{
x -> y
z -> x
x -> m
z -> m
m -> y
}")
coordinates(model4) <-
list(x = c(
x = 1,
z = 2,
m = 3,
y = 4
),
y = c(
x = 1,
z = 2,
m = 1,
y = 1
))
# Model 5
model5 <- dagitty("dag{
x -> y
u -> z
z -> x
x -> m
u -> m
m -> y
}")
latents(model5) <- "u"
coordinates(model5) <-
list(x = c(
x = 1,
z = 2,
u = 3,
m = 4,
y = 5
),
y = c(
x = 1,
z = 2,
u = 3,
m = 1,
y = 1
))
# Model 6
model6 <- dagitty("dag{
x -> y
u -> z
z -> m
x -> m
u -> x
m -> y
}")
latents(model6) <- "u"
coordinates(model6) <-
list(x = c(
x = 1,
z = 3,
u = 2,
m = 4,
y = 5
),
y = c(
x = 1,
z = 2,
u = 3,
m = 1,
y = 1
))
par(mfrow = c(1, 3))
ggdag(model4) + theme_dag()
While Z may be statistically significant, this does not imply a causal effect unless Z is directly on the causal path from X to Y. In many valid control scenarios, Z simply serves to isolate the causal effect of X, not to be interpreted as a cause itself.