39.2 Good Controls

39.2.1 Omitted Variable Bias Correction

A variable \(Z\) is a good control when it blocks all back-door paths from the treatment \(X\) to the outcome \(Y\). This is the fundamental criterion from the back-door adjustment theorem in causal inference.

39.2.1.1 Simple Confounder

In this DAG, \(Z\) is a common cause of both \(X\) and \(Y\), i.e., a confounder.

rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> x
  z -> y
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 3),
  y = c(x = 1, z = 2, y = 1)
)
ggdag(model) + theme_dag()

Controlling for \(Z\) removes the bias from the back-door path \(X \leftarrow Z \rightarrow Y\).

n <- 1e4
z <- rnorm(n)
causal_coef <- 2
beta2 <- 3
x <- z + rnorm(n)
y <- causal_coef * x + beta2 * z + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)-0.02    0.01    
(0.02)   (0.01)   
x3.49 ***2.00 ***
(0.02)   (0.01)   
z       2.98 ***
       (0.01)   
N10000       10000       
R20.82    0.97    
*** p < 0.001; ** p < 0.01; * p < 0.05.

39.2.1.2 Confounding via a Latent Variable

In this structure, \(U\) is unobserved but causes both \(Z\) and \(Y\), and \(Z\) affects \(X\). Even though \(U\) is not observed, adjusting for \(Z\) helps block the back-door path from \(X\) to \(Y\) that goes through \(U\).

rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  u -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 2, u = 3, y = 4),
  y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()

n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
y <- 2 * x + u + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)-0.02    -0.01    
(0.01)   (0.01)   
x2.33 ***1.99 ***
(0.01)   (0.01)   
z       0.51 ***
       (0.01)   
N10000       10000       
R20.91    0.92    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Even though $Z$ appears significant, its inclusion serves to reduce omitted variable bias rather than having a causal interpretation itself.

39.2.1.3 \(Z\) is caused by \(U\), but also causes \(Y\)

This DAG illustrates a subtle case where \(Z\) is on a non-causal path from \(X\) to \(Y\) and helps block bias through a shared cause \(U\).

rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  u -> x
  z -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 3, u = 2, y = 4),
  y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()

n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
y <- 2 * x + z + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)0.02    0.01    
(0.02)   (0.01)   
x2.49 ***1.98 ***
(0.01)   (0.01)   
z       1.02 ***
       (0.01)   
N10000       10000       
R20.83    0.93    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Again, we cannot interpret the coefficient on \(Z\) causally, but including \(Z\) helps reduce omitted variable bias from the unobserved confounder \(U\).

39.2.1.4 Summary of Omitted Variable Correction

# Model 1: Z is a confounder
model1 <- dagitty("dag{
  x -> y
  z -> x
  z -> y
}")
coordinates(model1) <-
    list(x = c(x = 1, z = 2, y = 3), y = c(x = 1, z = 2, y = 1))

# Model 2: Z is on path from U to X
model2 <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  u -> y
}")
latents(model2) <- "u"
coordinates(model2) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        y = 1
    ))

# Model 3: Z influenced by U, affects Y
model3 <- dagitty("dag{
  x -> y
  u -> z
  u -> x
  z -> y
}")
latents(model3) <- "u"
coordinates(model3) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        y = 1
    ))

par(mfrow = c(1, 3))
ggdag(model1) + theme_dag()

ggdag(model2) + theme_dag()

ggdag(model3) + theme_dag()

39.2.2 Omitted Variable Bias in Mediation Correction

When a variable \(Z\) is a confounder of both the treatment \(X\) and a mediator \(M\), controlling for \(Z\) helps isolate the indirect and direct effects more accurately.

39.2.2.1 Observed Confounder of Mediator and Treatment

rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> x
  x -> m
  z -> m
  m -> y
}")
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 2,
        m = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

n <- 1e4
z <- rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)-0.01    0.01    
(0.02)   (0.01)   
x2.50 ***1.99 ***
(0.01)   (0.01)   
z       1.01 ***
       (0.02)   
N10000       10000       
R20.84    0.87    
*** p < 0.001; ** p < 0.01; * p < 0.05.

39.2.2.2 Latent Common Cause of Mediator and Treatment

rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  x -> m
  u -> m
  m -> y
}")
latents(model) <- "u"
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + u + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)0.01    0.01    
(0.02)   (0.02)   
x2.34 ***2.01 ***
(0.01)   (0.02)   
z       0.50 ***
       (0.02)   
N10000       10000       
R20.86    0.87    
*** p < 0.001; ** p < 0.01; * p < 0.05.

39.2.2.3 Z Affects Mediator, U Affects Both X and Z

rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> m
  x -> m
  u -> x
  m -> y
}")
latents(model) <- "u"
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
Model 1Model 2
(Intercept)0.00    0.00    
(0.02)   (0.01)   
x2.49 ***1.99 ***
(0.01)   (0.01)   
z       1.00 ***
       (0.01)   
N10000       10000       
R20.78    0.87    
*** p < 0.001; ** p < 0.01; * p < 0.05.

39.2.2.4 Summary of Mediation Correction

# Model 4
model4 <- dagitty("dag{
  x -> y
  z -> x
  x -> m
  z -> m
  m -> y
}")
coordinates(model4) <-
    list(x = c(
        x = 1,
        z = 2,
        m = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        m = 1,
        y = 1
    ))

# Model 5
model5 <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  x -> m
  u -> m
  m -> y
}")
latents(model5) <- "u"
coordinates(model5) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))

# Model 6
model6 <- dagitty("dag{
  x -> y
  u -> z
  z -> m
  x -> m
  u -> x
  m -> y
}")
latents(model6) <- "u"
coordinates(model6) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))

par(mfrow = c(1, 3))
ggdag(model4) + theme_dag()

ggdag(model5) + theme_dag()

ggdag(model6) + theme_dag()

While \(Z\) may be statistically significant, this does not imply a causal effect unless \(Z\) is directly on the causal path from \(X\) to \(Y\). In many valid control scenarios, \(Z\) simply serves to isolate the causal effect of \(X\), not to be interpreted as a cause itself.