Chapter 3 ATE I: Binary treatment

Source RMD file: link

library(lmtest)
library(sandwich)
library(grf)
library(glmnet)
library(splines)
library(ggplot2)
library(reshape2)

3.1 Notation and definitions

Let’s establish some notation. Each data point will be defined by the triple \((X_i, W_i, Y_i)\). The vector \(X_i\) represents covariates that are observed for individual \(i\). Treatment assignment is indicated by \(W_i \in \{0, 1\}\), with 1 representing treatment, and 0 representing control. The scalar \(Y_i\) is the observed outcome, and it can be real or binary. Each observation is drawn independently and from the same distribution. We’ll be interested in assessing the causal effect of treatment on outcome.

A difficulty in estimating causal quantities is that we observe each individual in only one treatment state: either they were treated, or they weren’t. However, it’s often useful to imagine that each individual is endowed with two random variables \((Y_i(1), Y_i(0))\), where \(Y_i(1)\) represents the value of this individual’s outcome if they receive treatment, and \(Y_i(0)\) represents their outcome if they are not treated. These random variables are called potential outcomes. The observed outcome \(Y_i\) corresponds to whichever potential outcome we got to see: \[\begin{equation} Y_i \equiv Y_i(W_i) = \begin{cases} Y_i(1) \qquad \text{if }W_i = 1 \text{ (treated)} \\ Y_i(0) \qquad \text{if }W_i = 0 \text{ (control)}\\ \end{cases} \end{equation}\]

Since we can’t observe both \(Y_i(1)\) and \(Y_i(0)\), we won’t be able to make statistical claims about the individual treatment effect \(Y_i(1) - Y_i(0)\). Instead, our goal will be to estimate the average treatment effect (ATE): \[\begin{equation} \tag{3.1} \tau := \mathop{\mathrm{E}}[Y_i(1) - Y_i(0)]. \end{equation}\]

Here, when we refer to the randomized setting we mean that we have data generated by a randomized control trial. The key characteristic of this setting is that the probability that an individual is assigned to the treatment arm is fixed. In particular, it does not depend on the individual’s potential outcomes: \[\begin{equation} \tag{3.2} Y_i(1), Y_i(0) \perp W_i. \end{equation}\] This precludes situations in which individuals may self-select into or out of treatment. The canonical failure example is a job training program in which workers enroll more often if they are more likely to benefit from treatment because in that case \(W_i\) and \(Y_i(1) - Y_i(0)\) would be positively correlated.

When condition (3.2) is violated, we say that we are in an observational setting. This is a more complex setting encompassing several different scenarios: sometimes it’s still possible to estimate the ATE under additional assumptions, sometimes the researcher can exploit other sources of variation to obtain other interesting estimands. Here, we will focus on ATE estimation under the following assumption:
\[\begin{equation} \tag{3.3} Y_i(1), Y_i(0) \perp W_i \ | \ X_i. \end{equation}\]

In this document, we call this assumption unconfoundeness, though it is also known as no unmeasured confounders, ignorability or selection on observables. It says that all possible sources of self-selection, etc., can be explained by the observable covariates \(X_i\). Continuing the example above, it may be that older or more educated are more likely to self-select into treatment; but when we compare two workers that have the same age and level of education, etc., there’s nothing else that we could infer about their relative potential outcomes if we knew that one went into job training and the other did not.

As we’ll see below, a key quantity of interest will be the treatment assignment probability, or propensity score \(e(X_i) := \mathop{\mathrm{P}}[W_i = 1 | X_i]\). In an experimental setting this quantity is usually known and fixed, and in observational settings it must be estimated from the data. We will often need to assume that the propensity score is bounded away from zero and one. That is, there exists some \(\eta > 0\) such that \[\begin{equation} \tag{3.4} \eta < e(x) < 1 - \eta \qquad \text{for all }x. \end{equation}\] This assumption is known as overlap, and it means that for all types of people in our population (i.e., all values of observable characteristics) we can find some portion of individuals in treatment and some in control. Intuitively, this is necessary because we’d like to will be comparing treatment and control at each level of the covariates and then aggregate those results.

As a running example, in what follows we’ll be using an abridged version of a public dataset from the General Social Survey (GSS) (Smith, 2016). The setting is a randomized control trial. Individuals were asked about their thoughts on government spending on the social safety net. The treatment is the wording of the question: about half of the individuals were asked if they thought government spends too much on “welfare” \((W_i = 1)\), while the remaining half was asked about “assistance to the poor” \((W_i = 0)\). The outcome is binary, with \(Y_i = 1\) corresponding to a positive answer. In the data set below, we also collect a few other demographic covariates.

# Read in data
data <- read.csv("https://docs.google.com/uc?id=1kSxrVci_EUcSr_Lg1JKk1l7Xd5I9zfRC&export=download")
n <- nrow(data)


# Treatment: does the the gov't spend too much on "welfare" (1) or "assistance to the poor" (0)
treatment <- "w"

# Outcome: 1 for 'yes', 0 for 'no'
outcome <- "y"

# Additional covariates
covariates <- c("age", "polviews", "income", "educ", "marital", "sex")

3.2 Difference-in-means estimator

Let’s begin by considering a simple estimator that is available in experimental settings. The difference-in-means estimator is the sample average of outcomes in treatment minus the sample average of outcomes in control. \[\begin{equation} \tag{3.5} \widehat{\tau}^{DIFF} = \frac{1}{n_1} \sum_{i:W_i = 1} Y_i - \frac{1}{n_0} \sum_{i:W_i = 0} Y_i \qquad \text{where} \qquad n_w := |\{i : W_i = w \}| \end{equation}\]

Here’s one way to compute the difference-in-means estimator and its associated statistics (t-stats, p-values, etc.) directly:

# Only valid in the randomized setting. Do not use in observational settings.
Y <- data[,outcome]
W <- data[,treatment]
ate.est <- mean(Y[W==1]) - mean(Y[W==0])
ate.se <- sqrt(var(Y[W == 1]) / sum(W == 1) + var(Y[W == 0]) / sum(W == 0))
ate.tstat <- ate.est / ate.se
ate.pvalue <- 2*(pnorm(1 - abs(ate.est/ate.se)))
ate.results <- c(estimate=ate.est, std.error=ate.se, t.stat=ate.tstat, pvalue=ate.pvalue)
print(ate.results)

##      estimate     std.error        t.stat        pvalue 
##  -0.347115545   0.004895638 -70.903028958   0.000000000

Or, alternatively, via the t.test function:

fmla <- formula(paste(outcome, '~', treatment))  # y ~ w
t.test(fmla, data=data)

## 
##  Welch Two Sample t-test
## 
## data:  y by w
## t = 70.903, df = 20840, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3375197 0.3567114
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4398290       0.0927135

We can also compute the same quantity via linear regression, using the fact that \[\begin{equation} Y_i = Y_i(0) + W_i \left( Y_i(1) - Y_i(0) \right), \end{equation}\]

so that taking expectations conditional on treatment assignment, \[\begin{equation} \mathop{\mathrm{E}}[Y_i | W_i] = \alpha + W_i \tau \qquad \text{where} \qquad \alpha := \mathop{\mathrm{E}}[Y_i(0)] \end{equation}\]

[Exercise: make sure you understand this decomposition. Where is the unconfoundedness assumption (3.2) used?]

This result implies that we can estimate the ATE of a binary treatment via a linear regression of observed outcomes \(Y_i\) on a vector consisting of intercept and treatment assignment \((1, W_i)\).

# Do not use! standard errors are not robust to heteroskedasticity! (See below)
fmla <- formula(paste0(outcome, '~', treatment))
ols <- lm(fmla, data=data)
coef(summary(ols))[2,]

##      Estimate    Std. Error       t value      Pr(>|t|) 
##  -0.347115545   0.004732343 -73.349622060   0.000000000

The point estimate we get is the same as we got above by computing the ATE “directly” via mean(Y[W==1]) - mean(Y[W==0]). However, the standard errors are different. This is because in R the command lm does not compute heteroskedasticity-robust standard errors. This is easily solvable.

# Use this instead. Standard errors are heteroskedasticity-robust.
# Only valid in randomized setting.
fmla <- formula(paste0(outcome, '~', treatment))
ols <- lm(fmla, data=data)
coeftest(ols, vcov=vcovHC(ols, type='HC2'))[2,]

##      Estimate    Std. Error       t value      Pr(>|t|) 
##  -0.347115545   0.004895638 -70.903028958   0.000000000

The difference-in-means estimator is a simple, easily computable, unbiased, “model-free” estimator of the treatment effect. It should always be reported when dealing with data collected in a randomized setting. However, as we’ll see later in this chapter, there are other estimators that have smallest variance. Therefore,

3.3 Turning an experimental dataset into an observational one

In what follows we’ll purposely create a biased sample to show how the difference in means estimator fails in observational settings. We’ll focus on two covariates: age and political views (polviews). Presumably, younger or more liberal individuals are less affected by the change in wording in the question. So let’s see what happens when we make these individuals more prominent in our sample of treated individuals, and less prominent in our sample of untreated individuals.

# Probabilistically dropping observations in a manner that depends on x

# copying old dataset, just in case
data.exp <- data

# defining the group that we will be dropped with some high probability
grp <- ((data$w == 1) &  # if treated AND...
        (
            (data$age > 45) |     # belongs an older group OR
            (data$polviews < 5)   # more conservative
        )) | # OR
        ((data$w == 0) &  # if untreated AND...
        (
            (data$age < 45) |     # belongs a younger group OR
            (data$polviews > 4)   # more liberal
        )) 

# Individuals in the group above have a small chance of being kept in the sample
prob.keep <- ifelse(grp, .15, .85)
keep.idx <- as.logical(rbinom(n=nrow(data), prob=prob.keep, size = 1))

# Dropping
data <- data[keep.idx,]

Let’s see what happens to our sample before and after the change. This next figure shows the original dataset. Each observation is represented by a point on either the left or right scatterplots. An outcome of \(Y_i=1\) is denoted by a blue circle, and \(Y_i=0\) is denoted by a red square, but let’s not focus on the outcome at the moment. For now, what’s important to note is that the covariate distributions for treated and untreated populations are very similar. This is what we should expect in an experimental setting under unconfoundedness.

X <- model.matrix(formula("~ 0 + age + polviews"), data.exp)  # old 'experimental' dataset
W <- data.exp$w
Y <- data.exp$y
par(mfrow=c(1,2))
for (w in c(0, 1)) {
  plot(X[W==w,1] + rnorm(n=sum(W==w), sd=.1), X[W==w,2] + rnorm(n=sum(W==w), sd=.1), 
       pch=ifelse(Y, 23, 21), cex=1, col=ifelse(Y, rgb(1,0,0,1/4), rgb(0,0,1,1/4)),
       bg=ifelse(Y, rgb(1,0,0,1/4), rgb(0,0,1,1/4)), main=ifelse(w, "Treated", "Untreated"), xlab="age", ylab="polviews")
}

On the other hand, this is what the modified dataset looks like. The treated population is much younger and more liberal, while the untreated population is older and more conservative.

X <- model.matrix(formula("~ 0 + age + polviews"), data)
W <- data$w
Y <- data$y
par(mfrow=c(1,2))
for (w in c(0, 1)) {
  plot(X[W==w,1] + rnorm(n=sum(W==w), sd=.1), X[W==w,2] + rnorm(n=sum(W==w), sd=.1), 
       pch=ifelse(Y, 23, 21), cex=1, col=ifelse(Y, rgb(1,0,0,1/4), rgb(0,0,1,1/4)),
       bg=ifelse(Y, rgb(1,0,0,1/4), rgb(0,0,1,1/4)), main=ifelse(w, "Treated", "Untreated"), xlab="age", ylab="polviews")
}

As we would expect the difference-in-means estimate is biased toward zero because in this new dataset we mostly treated individuals for which we expect the effect to be smaller.

# Do not use in observational settings.
# This is only to show how the difference-in-means estimator is biased in that case.
fmla <- formula(paste0(outcome, '~', treatment))
ols <- lm(fmla, data=data)
coeftest(ols, vcov=vcovHC(ols, type='HC2'))[2,]

##       Estimate     Std. Error        t value       Pr(>|t|) 
##  -2.887039e-01   8.630364e-03  -3.345211e+01  2.645275e-231

Note that the dataset created above still satisfies unconfoundedness (3.3), since discrepancies in treatment assignment probability are described by observable covariates (age and polviews). Moreover, it also satisfies the overlap assumption (3.4), since never completely dropped all treated or all untreated observations in any region of the covariate space. This is important because, in what follows, we’ll consider different estimators of the ATE that are available in observational settings under unconfoundedness and overlap.

3.4 Direct estimation

Our first estimator is suggested by the following decomposition of the ATE, which is possible due to unconfoundedness (3.3).

\[\begin{equation} \mathop{\mathrm{E}}[Y_i(1) - Y_i(0)] = \mathop{\mathrm{E}}[\mathop{\mathrm{E}}[Y_i | X_i, W_i=1]] - \mathop{\mathrm{E}}[\mathop{\mathrm{E}}[Y_i | X_i, W_i=0]] \end{equation}\]

[Exercise: make sure you understand this. Where is the unconfoundedness assumption used?]

The decomposition above suggests the following procedure, sometimes called the direct estimate of the ATE:

Estimate \(\mu(x, w) := E[Y_i|X_i = x,W_i=w]\), preferably using nonparametric methods.
Predict \(\hat{\mu}(X_i, 1)\) and \(\hat{\mu}(X_i, 0)\) for each observation in the data.
Average out the predictions and subtract them. \[\begin{equation} \widehat{\tau}^{DM} := \frac{1}{n} \sum_{i=1}^{n} \hat{\mu}(X_i, 1) - \hat{\mu}(X_i, 0) \end{equation}\]

# Do not use! We'll see a better estimator below.

# Fitting some model of E[Y|X,W]
fmla <- as.formula(paste0(outcome, "~ ", paste("bs(", covariates, ", df=3)", "*", treatment, collapse="+")))
model <- lm(fmla, data=data)  

# Predicting E[Y|X,W=w] for w in {0, 1}
data.1 <- data
data.1[,treatment] <- 1
data.0 <- data
data.0[,treatment] <- 0
muhat.treat <- predict(model, newdata=data.1)
muhat.ctrl <- predict(model, newdata=data.0)

# Averaging predictions and taking their difference
ate.est <- mean(muhat.treat) - mean(muhat.ctrl)
print(ate.est)

## [1] -0.3423029

This estimator allows us to leverage regression techniques to estimate the ATE, so the resulting estimate should have smaller root-mean-squared error. However, it has several disadvantages that make it undesirable. First, its properties will rely heavily on the model \(\hat{\mu}(x, w)\) being correctly specified: it will be an unbiased and/or consistent estimate of the ATE provided that \(\hat{\mu}(x, w)\) is an unbiased and/or consistent estimator of \(\mathop{\mathrm{E}}[Y|X=x, W=w]\). In practice, having a well-specified model is not something we want to rely upon. In general, it will also not be asymptotically normal, which means that we can’t easily compute t-statistics and p-values for it.

A technical note. Step 1 above can be done by regressing \(Y_i\) on \(X_i\) using only treated observations to get an estimate \(\hat{\mu}(x, 1)\) first, and then repeating the same to obtain \(\hat{\mu}(x, 0)\) from the control observations. Or it can be done by regression \(Y_i\) on both covariates \((X_i, W_i)\) together and obtaining a function \(\hat{\mu}(x, w)\). Both have advantages and disadvantages, and we refer to Künzel, Sekhon, Bickel, Yu (2019) for a discussion.

3.5 Inverse propensity-weighted estimator

To understand this estimator, let’s first consider a toy problem. Suppose that we’d like to estimate the average effect of a certain learning intervention on students’ grades, measured on a scale of 0-100. We run two separate experiments in schools of type A and B. Suppose that, unknown to us, grades among treated students in schools of type A are approximately distributed as \(Y_i(1) | A \sim N(60, 5^2)\), whereas in schools of type B they are distributed as \(Y_i(1) | B \sim N(70, 5^2)\). Moreover, for simplicity both schools have the same number of students. If we could treat the same number of students in both types of schools, we’d get an unbiased estimate of the population mean grade among treated individuals: \((1/2)60 + (1/2)70 = 75\).

However, suppose that enrollment in the treatment is voluntary. In school A, only 5% enroll in the study, whereas in school B the number is 40%. Therefore, if we take an average of treated students’ grades without taking school membership into account, school B’s students would be over-represented, and therefore our estimate of treated student’s grades would be biased upward.

# Simulating the scenario above a large number of times
A.mean <- 60
B.mean <- 70
pop.mean <- .5 * A.mean + .5 * B.mean  # both schools have the same size

# simulating the scenario about a large number of times
sample.means <- replicate(1000, {
    school <- sample(c("A", "B"), p=c(.5, .5), size=100, replace=TRUE)
    treated <- ifelse(school == 'A', rbinom(100, 1, .05), rbinom(100, 1, .4))
    grades <- ifelse(school == 'A', rnorm(100, A.mean, 5), rnorm(100, B.mean, 5))
    mean(grades[treated == 1])  # average grades among treated students, without taking school into account
  })
hist(sample.means, freq=F, main="Sample means of treated students' grades", xlim=c(55, 75), col=rgb(0,0,1,1/8), ylab="", xlab="")
abline(v=pop.mean, lwd=3, lty=2)
legend("topleft", "truth", lwd=3, lty=2, bty="n")

To solve this problem, we can consider each school separately, and then aggregate the results. Denote by \(n_A\) the number of students from school \(A\), and \(n_{A,1}\) denote the number of treated students in that school. Likewise, denote by \(n_B\) and \(n_{B,1}\) the same quantities for school \(B\). Then our aggregated means estimator can be written as: \[\begin{equation} \tag{3.6} \overbrace{ \left( \frac{n_{A}}{n} \right) }^{\text{Proportion of A}} \underbrace{ \frac{1}{n_{A, 1}} \sum_{\substack{i \in A \\ i \text{ treated}}} Y_i }_{\text{avg. among treated in A}} + \overbrace{ \left( \frac{n_{B}}{n} \right) }^{\text{Proportion of B}} \underbrace{ \frac{1}{n_{B, 1}} \sum_{\substack{i \in B \\ i \text{ treated}}} Y_i }_{\text{avg. among treated in B}}. \end{equation}\]

# simulating the scenario about a large number of times
agg.means <- replicate(1000, {
    school <- sample(c("A", "B"), p=c(.5, .5), size=100, replace=TRUE)
    treated <- ifelse(school == 'A', rbinom(100, 1, .05), rbinom(100, 1, .4))
    grades <- ifelse(school == 'A', rnorm(100, A.mean, 5), rnorm(100, B.mean, 5))
    # average grades among treated students in each school
    mean.treated.A <- mean(grades[(treated == 1) & (school == 'A')])
    mean.treated.B <- mean(grades[(treated == 1) & (school == 'B')])
    # probability of belonging to each school
    prob.A <- mean(school == 'A')
    prob.B <- mean(school == 'B')
    prob.A * mean.treated.A + prob.B * mean.treated.B
  })
hist(agg.means, freq=F, main="Aggregated sample means of treated students' grades", xlim=c(50, 75), col=rgb(0,0,1,1/8), ylab="", xlab="")
abline(v=pop.mean, lwd=3, lty=2)
legend("topright", "truth", lwd=3, lty=2, bty="n")

Next, we’ll manipulate the expression (3.6). This next derivation can be a bit overwhelming, but please keep in mind that we’re simply doing algebraic manipulations, as well as establishing some common and useful notation. First, note that we can rewrite the average for school A in (3.6) as \[\begin{equation} \tag{3.7} \frac{1}{n_{A, 1}} \sum_{\substack{i \in A \\ i \text{ treated}}} Y_i = \frac{1}{n_A} \sum_{\substack{i \in A \\ i \text{ treated}}} \frac{1}{(n_{A,1} / n_A)} Y_i = \frac{1}{n_A} \sum_{i \in A} \frac{W_i}{(n_{A,1} / n_A)} Y_i, \end{equation}\]

where \(W_i\) is the treatment indicator. Plugging (3.7) back into (3.6), \[\begin{equation} \left( \frac{n_{A}}{n} \right) \frac{1}{n_A} \sum_{i \in A} \frac{W_i}{(n_{A,1} / n_A)} Y_i + \left( \frac{n_{B}}{n} \right) \frac{1}{n_B} \sum_{i \in B} \frac{W_i}{(n_{B,1} / n_B)} Y_i. \end{equation}\]

Note that the \(n_A\) and \(n_B\) will cancel out. Finally, if we denote the sample proportion of treated students in school \(A\) by \(\hat{e}(A_i) \approx n_{A,1}/n_A\) and similar for school \(B\), (3.6) can be written compactly as \[\begin{equation} \tag{3.8} \frac{1}{n} \sum_{i=1}^{n} \frac{W_i}{\hat{e}(X_i)} Y_i, \end{equation}\] where \(X_i \in \{A, B\}\) denotes the school membership. Quantity \(\hat{e}(X_i)\) is an estimate of the probability of treatment given the control variable \(e(X_i) = \mathop{\mathrm{P}}[W_i=1 | X_i ]\), also called the propensity score. Because (3.8) is a weighted average with weights \(W_i/\hat{e}(X_i)\), when written in this form we say that it is an inverse propensity-weighted estimator (IPW).

The IPW estimator is also defined when there are continuous covariates. In that case, we must estimate the assignment probability given some model, for example using logistic regression, forests, etc. With respect to modeling, the behavior of the IPW estimator is similar to the direct estimator: if \(\hat{e}(X_i)\) is an unbiased estimate of \(e(X_i)\), then (3.8) is an unbiased estimate of \(E[Y(1)]\); and we can show that if \(\hat{e}(X_i)\) is a consistent estimator of \(e(X_i)\), then (3.8) is consistent for \(E[Y(1)]\).

When the treatment propensity \(\hat{e}(X_i)\) is small, the summands in (3.8) can be very large. In particular, if \(\hat{e}(x)\) is exactly zero, then this estimator is undefined. That is why, in addition to requiring conditional unconfoundedness (3.3), it also requires the overlap condition (3.4). In any case, when overlap is small (i.e., \(\hat{e}(x)\) is very close to zero for many \(x\)), IPW becomes an unattractive estimator due to its high variance. We’ll see in the next section an improved estimator that builds upon IPW but is strictly superior and should be used instead.

Finally, we just derived an estimator of the average treated outcome, but we could repeat the argument for control units instead. Subtracting the two estimators leads to the following inverse propensity-weighted estimate of the treatment effect: \[\begin{equation} \tag{3.9} \widehat{\tau}^{IPW} := \frac{1}{n} \sum_{i=1}^{n} Y_i \frac{W_i}{\widehat{e}(X_i)} - \frac{1}{n} \sum_{i=1}^{n} Y_i \frac{(1 - W_i) }{1-\widehat{e}(X_i)}. \end{equation}\]

The argument above suggests the following algorithm:

Estimate the propensity scores \(e(X_i)\) by regressing \(W_i\) on \(X_i\), preferably using a non-parametric method.
Compute the IPW estimator summand: \[Z_i = Y_i \times \left(\frac{W_i}{\hat{e}(X_i)} - \frac{(1-W_i)}{(1-\hat{e}(X_i)} \right)\]
Compute the mean and standard error of the new variable \(Z_i\)

# Available in randomized settings and observational settings with unconfoundedness+overlap

# Estimate the propensity score e(X) via logistic regression using splines
fmla <- as.formula(paste0("~", paste0("bs(", covariates, ", df=3)", collapse="+")))
W <- data[,treatment]
Y <- data[,outcome]
XX <- model.matrix(fmla, data)
logit <- cv.glmnet(x=XX, y=W, family="binomial")
e.hat <- predict(logit, XX, s = "lambda.min", type="response")

# Using the fact that
z <- Y * (W/e.hat - (1-W)/(1-e.hat))
ate.est <- mean(z)
ate.se <- sd(z) / sqrt(length(z))
ate.tstat <- ate.est / ate.se
ate.pvalue <- 2*(pnorm(1 - abs(ate.est/ate.se)))
ate.results <- c(estimate=ate.est, std.error=ate.se, t.stat=ate.tstat, pvalue=ate.pvalue)
print(ate.results)

##       estimate      std.error         t.stat         pvalue 
##  -3.393831e-01   1.385475e-02  -2.449579e+01  4.503613e-122

3.6 Augmented inverse propensity-weighted (AIPW) estimator

The next augmented inverse propensity-weighted estimator (AIPW) of the treatment effect is available under unconfoundedness (3.3) and overlap (3.4): \[\begin{equation} \tag{3.10} \begin{aligned} \widehat{\tau}^{AIPW} &:= \frac{1}{n} \sum_{i=1}^{n} \hat{\mu}^{-i}(X_i, 1) - \hat{\mu}^{-i}(X_i, 0) \\ &+ \frac{W}{\widehat{e}^{-i}(X_i)} \left( Y_i - \hat{\mu}^{-i}(X_i, 1) \right) - \frac{1-W}{1-\widehat{e}^{-i}(X_i)} \left( Y_i - \hat{\mu}^{-i}(X_i, 0) \right) \end{aligned} \end{equation}\]

Let’s parse (3.10). Ignoring the superscripts for now, the first two terms correspond to an estimate of the treatment effect obtained via direct estimation. The next two terms resemble the IPW estimator, except that we replaced the outcome \(Y_i\) with the residuals \(Y_i - \hat{\mu}(X_i, W_i)\). At a high level, the last two terms estimate and subtract an estimate of the bias of the first.

The superscripts have to do with a technicality called cross-fitting that is required to prevent a specific form of overfitting and allow certain desirable asymptotic properties to hold. That is, we need to fit the outcome and propensity models using one portion of the data, and compute predictions on the remaining portion. An easy way of accomplishing this is to divide the data into K folds, and then estimate \(K\) models that are fitted on \(K-1\) folds, and then compute predictions on the remaining Kth fold. The example below does that with \(K=5\). Note that this is different from cross-validation: the goal of cross-validation is to obtain accurate estimates of the loss function for model selection, whereas the goal of cross-fitting is simply to not use the same observation to fit the model and produce predictions.

The AIPW estimator (3.10) has several desirable properties. First, it will be unbiased provided that either \(\hat{\mu}(X_i, w)\) or \(\widehat{e}(X_i, w)\) is unbiased. Second, it will be consistent provided that either \(\hat{\mu}(X_i, w)\) or \(\widehat{e}(X_i, w)\) is consistent. This property is called double robustness (DR), because only one of the outcome or propensity score models needs to be correctly specified, so the estimator is “robust” to misspecification in the remaining model. Under mild assumptions, the AIPW estimator is also asymptotically normal and efficient — that is has the smallest asymptotic variance among a large class of estimators that includes the IPW estimator above. This is an attractive property because smaller variance implies smaller confidence intervals, and more power to test hypotheses about this parameter. In fact, it even has smaller variance than the difference-in-means estimators in the randomized setting. Therefore, we recommend the following. In randomized settings, report the difference-in-means estimator (for simplicity) along with AIPW-based estimates (for efficiency); in observational settings with unconfoundedness and overlap, report AIPW-based estimates.

The next snippet provides an implementation of the AIPW estimator where outcome and propensity models are estimated using generalized linear models with splines (via glmnet and splines packages).

# Available in randomized settings and observational settings with unconfoundedness+overlap

# A list of vectors indicating the left-out subset
n <- nrow(data)
n.folds <- 5
indices <- split(seq(n), sort(seq(n) %% n.folds))

# Preparing data
W <- data[,treatment]
Y <- data[,outcome]

# Matrix of (transformed) covariates used to estimate E[Y|X,W]
fmla.xw <- formula(paste("~ 0 +", paste0("bs(", covariates, ", df=3)", "*", treatment, collapse=" + ")))
XW <- model.matrix(fmla.xw, data)
# Matrix of (transformed) covariates used to predict E[Y|X,W=w] for each w in {0, 1}
data.1 <- data
data.1[,treatment] <- 1
XW1 <- model.matrix(fmla.xw, data.1)  # setting W=1
data.0 <- data
data.0[,treatment] <- 0
XW0 <- model.matrix(fmla.xw, data.0)  # setting W=0

# Matrix of (transformed) covariates used to estimate and predict e(X) = P[W=1|X]
fmla.x <- formula(paste(" ~ 0 + ", paste0("bs(", covariates, ", df=3)", collapse=" + ")))
XX <- model.matrix(fmla.x, data)

# (Optional) Not penalizing the main effect (the coefficient on W)
penalty.factor <- rep(1, ncol(XW))
penalty.factor[colnames(XW) == treatment] <- 0

# Cross-fitted estimates of E[Y|X,W=1], E[Y|X,W=0] and e(X) = P[W=1|X]
mu.hat.1 <- rep(NA, n)
mu.hat.0 <- rep(NA, n)
e.hat <- rep(NA, n)
for (idx in indices) {
  # Estimate outcome model and propensity models
  # Note how cross-validation is done (via cv.glmnet) within cross-fitting! 
  outcome.model <- cv.glmnet(x=XW[-idx,], y=Y[-idx], family="gaussian", penalty.factor=penalty.factor)
  propensity.model <- cv.glmnet(x=XX[-idx,], y=W[-idx], family="binomial")

  # Predict with cross-fitting
  mu.hat.1[idx] <- predict(outcome.model, newx=XW1[idx,], type="response")
  mu.hat.0[idx] <- predict(outcome.model, newx=XW0[idx,], type="response")
  e.hat[idx] <- predict(propensity.model, newx=XX[idx,], type="response")
}

# Commpute the summand in AIPW estimator
aipw.scores <- (mu.hat.1 - mu.hat.0
                + W / e.hat * (Y -  mu.hat.1)
                - (1 - W) / (1 - e.hat) * (Y -  mu.hat.0))

# Tally up results
ate.aipw.est <- mean(aipw.scores)
ate.aipw.se <- sd(aipw.scores) / sqrt(n)
ate.aipw.tstat <- ate.aipw.est / ate.aipw.se
ate.aipw.pvalue <- 2*(pnorm(1 - abs(ate.aipw.tstat)))
ate.aipw.results <- c(estimate=ate.aipw.est, std.error=ate.aipw.se, t.stat=ate.aipw.tstat, pvalue=ate.aipw.pvalue)
print(ate.aipw.results)

##       estimate      std.error         t.stat         pvalue 
##  -3.341320e-01   9.564019e-03  -3.493636e+01  1.939058e-252

Here’s another example of AIPW-based estimation using random forests via the package grf. The function average_treatment_effect computes the AIPW estimate of the treatment effect, and uses forest-based estimates of the outcome model and propensity scores (unless those are passed directly via the arguments Y.hat and W.hat). Also, because forests are an ensemble method, cross-fitting is accomplished via out-of-bag predictions – that is, predictions for observation \(i\) are computed using trees that were not constructed using observation \(i\).

# Available in randomized settings and observational settings with unconfoundedness+overlap

# Input covariates need to be numeric. 
XX <- model.matrix(formula(paste0("~", paste0(covariates, collapse="+"))), data=data)

# Estimate a causal forest.
forest <- causal_forest(
              X=XX,  
              W=data[,treatment],
              Y=data[,outcome],
              #W.hat=...,  # In randomized settings, set W.hat to the (known) probability of assignment
              num.trees = 100)
forest.ate <- average_treatment_effect(forest)
print(forest.ate)

##    estimate     std.err 
## -0.33233019  0.01273536

Finally, even when overlap holds, when the propensity score is small AIPW can have unstable performance. In that case, other estimators of the average treatment effect are available. For example, see the approximate residual balancing (ARB) method in Athey, Imbens and Wager (2016) and its accompanying R package balanceHD.

3.7 Diagnostics

Here we show some basic diagnostics and best practices that should always be conducted. Although failing these tests should usually cause concern, passing them does not mean that the analysis is entirely free from issues. For example, even if our estimated propensity scores are bounded away from zero and one (and therefore seem to satisfy the overlap assumption), they may still be incorrect – for example, if we rely on modeling assumptions that are simply not true. Also, although we should expect to pass all the diagnostic tests below in randomized settings, it’s good practice to check them anyway, as one may find problems with the experimental design.

3.7.1 Assessing balance

As we saw in Section 3.3, in observational settings the covariate distributions can be very different for treated and untreated individuals. Such discrepancies can lead to biased estimates of the ATE, but as hinted in Section 3.5, one would hope that by up- or down-weighting observations based on inverse propensity weights their averages should be similar. In fact, we should also expect that averages of basis functions of the covariates should be similar after reweighting. This property is called balance. One way to check it is as follows. Given some variable \(Z_i\) (e.g., a covariate \(X_{i1}\), or an interaction between covariates \(X_{i1}X_{i2}\), or a polynomial in covariates \(X_{i1}^2\), etc), we can check the absolute standardized mean difference (ASMD) of \(Z_i\) between treated and untreated individuals in our data, \[\begin{equation} \frac{|\bar{Z}_1 - \bar{Z}_0|}{\sqrt{s_1^2 + s_0^2}}, \end{equation}\] where \(\bar{Z}_1\) and \(\bar{Z}_0\) are sample averages of \(Z_i\), and \(s_1\) and \(s_0\) are standard deviations of \(Z_i\) for the two samples of treated and untreated individuals. Next, we can check the same quantity for their weighted counterparts \(Z_{i}W_i/\hat{e}(X_i)\) and \(Z_i(1-W_i)/(1-\hat{e}(X_i))\). If our propensity scores are well-calibrated, the ASMD for the weighted version should be close to zero.

# Here, adding covariates and their interactions, though there are many other possibilities.
fmla <- formula(paste("~ 0 +", paste(apply(expand.grid(covariates, covariates), 1, function(x) paste0(x, collapse="*")), collapse="+")))

# Using the propensity score estimated above
e.hat <- forest$W.hat

XX <- model.matrix(fmla, data)
W <- data[,treatment]
pp <- ncol(XX)

# Unadjusted covariate means, variances and standardized abs mean differences
means.treat <- apply(XX[W == 1,], 2, mean)
means.ctrl <- apply(XX[W == 0,], 2, mean)
abs.mean.diff <- abs(means.treat - means.ctrl)

var.treat <- apply(XX[W == 1,], 2, var)
var.ctrl <- apply(XX[W == 0,], 2, var)
std <- sqrt(var.treat + var.ctrl)

# Adjusted covariate means, variances and standardized abs mean differences
means.treat.adj <- apply(XX*W/e.hat, 2, mean)
means.ctrl.adj <- apply(XX*(1-W)/(1-e.hat), 2, mean)
abs.mean.diff.adj <- abs(means.treat.adj - means.ctrl.adj)

var.treat.adj <- apply(XX*W/e.hat, 2, var)
var.ctrl.adj <- apply(XX*(1-W)/(1-e.hat), 2, var)
std.adj <- sqrt(var.treat.adj + var.ctrl.adj)

# Plotting
par(oma=c(0,4,0,0))
plot(-2, xaxt="n", yaxt="n", xlab="", ylab="", xlim=c(-.01, 1.3), ylim=c(0, pp+1), main="Standardized absolute mean differences")
axis(side=1, at=c(-1, 0, 1), las=1)
lines(abs.mean.diff / std, seq(1, pp), type="p", col="blue", pch=19)
lines(abs.mean.diff.adj / std.adj, seq(1, pp), type="p", col="orange", pch=19)
legend("topright", c("Unadjusted", "Adjusted"), col=c("blue", "orange"), pch=19)
abline(v = seq(0, 1, by=.25), lty = 2, col = "grey", lwd=.5)
abline(h = 1:pp,  lty = 2, col = "grey", lwd=.5)
mtext(colnames(XX), side=2, cex=0.7, at=1:pp, padj=.4, adj=1, col="black", las=1, line=.3)
abline(v = 0)

Note above how in particular age and polviews – the variables we chose to introduce imbalance in Section 3.3 – are far from balanced before adjustment, as are other variables that correlate with them.

In addition to the above, we can check the entire distribution of covariates (and their transformations). The next snippet plots histograms for treated and untreated individuals with and without adjustment. Note how the adjusted histograms are very similar for untreated (red) and treated (green) samples – that is what we should expect.

# change this if working on different data set
XX <- model.matrix(fmla, data)[,c("age", "polviews", "age:polviews", "educ")] 
W <- data[,treatment]
pp <- ncol(XX)

plot.df <- data.frame(XX, W = as.factor(W), IPW = ifelse(W == 1, 1 / e.hat, 1 / (1 - e.hat)))
plot.df <- reshape(plot.df, varying = list(1:pp), direction = "long", v.names = "X",
                   times = factor(colnames(XX), levels = colnames(XX)))

ggplot(plot.df, aes(x = X, fill = W)) +
  geom_histogram(alpha = 0.5, position = "identity", bins=30) +
  facet_wrap( ~ time, ncol = 2, scales = "free") +
  ggtitle("Covariate histograms (unajusted)")

ggplot(plot.df, aes(x = X, weight=IPW, fill = W)) +
  geom_histogram(alpha = 0.5, position = "identity", bins=30) +
  facet_wrap( ~ time, ncol = 2, scales = "free") +
  ggtitle("Covariate histograms (adjusted)")

There exist other types of balance checks. We recommend checking out the R package cobalt. This vignette is a good place to start.

3.7.2 Assessing overlap

It’s also important to check the estimated propensity scores. If they seem to cluster at zero or one, we should expect IPW and AIPW estimators to behave very poorly. Here, the propensity score is trimodal because of our sample modification procedure in Section 3.3: some observations are untouched and therefore remain with assignment probability \(0.5\), some are dropped (kept) with probability 15% (85%).

e.hat <- forest$W.hat
hist(e.hat, main="Estimated propensity scores (causal forest)", breaks=100, freq=FALSE, xlab="", ylab="", xlim=c(-.1, 1.1))

When overlap fails, the methods described above will not produce good estimates. In that case, one may consider changing the estimand and targeting the average treatment effect on the treated (ATT) \(\mathop{\mathrm{E}}[Y_i(1) - Y_i(0) | W_i=1]\), or trimming the sample and focusing on some subgroup \(G_i\) with bounded propensity scores \(\mathop{\mathrm{E}}[Y_i(1) - Y_i(0) | G_i]\), as discussed in
Crump, Hotz, Imbens and Mitnik (Biometrika, 2009).

3.8 Further reading

Besides the references noted in the text, for more information on the theory behind the estimators studied in this tutorial we recommend Stefan Wager’s lecture notes (Lectures 1-4 and 7).