4 ATE I: Binary treatment

Before you jump into this tutorial, please make sure you’ve read.

4.1 Setup

Our goal will be to estimate the average treatment effect (ATE) of a binary treatment.

As an example we will be using this experimental dataset [Update?].

data <- read.csv(file = "https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Welfare/ProcessedData/welfarenolabel3.csv")
n <- nrow(data)
outcome <- "y"
treatment <- "w"
## Loading required package: Matrix
## Loaded glmnet 4.1

4.2 Difference-in-means estimator

The difference-in-means estimator of the average treatment effect is very simple. It’s the sample average of outcomes in treatment minus the sample average of outcomes in control.

\[\begin{equation} \widehat{\tau}^{DIFF} = \frac{1}{n_1} \sum_{i: W_i = 1} Y_i - \frac{1}{n_0} \sum_{i: W_i = 0} Y_i \qquad \text{where} \quad n_w := |\{ i: W_i = w \}| \end{equation}\]

Here’s one way to compute the difference-in-means estimator and its associated statistics (t-stats, p-values, etc.) directly.

ate.est <- mean(data$y[data$w == 1]) - mean(data$y[data$w == 0])
ate.se <- sqrt(var(data$y[data$w == 1]) / sum(data$w == 1) + var(data$y[data$w == 0]) / sum(data$w == 0))
ate.tstat <- ate.est/ate.se
ate.pvalue <- 2*(pnorm(1 - abs(ate.est/ate.se)))
ate.results <- c(estimate=ate.est, std.error=ate.se, t.stat=ate.tstat, pvalue=ate.pvalue)
print(ate.results)
##     estimate    std.error       t.stat       pvalue 
##  -0.33795434   0.00431672 -78.28959748   0.00000000

We can also compute the same quantity via linear regression, using the fact that

\[Y_i = Y_i(0) + W_i (Y_i(1) - Y(0)).\]

Therefore, taking expectations conditional on treatment assignment,

\[E[Y_i \, | \, W_i] = \alpha + W_i \tau \qquad \text{where } \alpha = E[Y_i(0)].\]

[Exercise: make sure you understand this decomposition. Where is the unconfoundedness assumption used?] This result implies that we can estimate the ATE of a binary treatment via a linear regression of observed outcomes \(Y_i\) on a vector consisting of intercept and treatment assignment, \((1, W_i)\).

# Do not use! standard errors are not robust to heteroskedasticity! (See below)
ols <- lm(y ~ w, data=data)
coef(summary(ols))[2,]
##     Estimate   Std. Error      t value     Pr(>|t|) 
##  -0.33795434   0.00417836 -80.88211212   0.00000000

The point estimate we get is the same as we got above by computing the ATE “directly” via mean(Y[W==1]) - mean(Y[W==0]). However, the standard errors are different. This is because in R the command lm does not compute heteroskedasticity-robust standard errors. This is easily solvable.

# Use this instead. Standard errors are heteroskedasticity-robust!
# Only valid for experimental data!
ols <- lm(y ~ w, data=data)
lmtest::coeftest(ols, vcov=sandwich::vcovHC(ols, type='HC2'))[2,]
##     Estimate   Std. Error      t value     Pr(>|t|) 
##  -0.33795434   0.00431672 -78.28959748   0.00000000

Takeaways

  • The difference-in-means estimator is a simple, easily computable, unbiased, “model-free” estimator of the treatment effect. It should always be reported.

  • In a binary randomized experiment, one can estimate the average treatment effect (ATE) by running a linear regression of outcomes on an intercept and treatment assignment. The slope on this regression will be a “difference-in-means” estimate of the ATE. Make sure to use robust standard errors.

  • The difference-in-means estimator will be biased and inconsistent if the unconfoundedness assumption is violated; for example, if observations are allowed to self-select into treatment. In that case, one should use the AIPW estimators in the next section.

4.3 AIPW estimators

The estimators presented in this section can be used even in observational settings (provided the assumptions below are satisfied). They also leverage information about how average outcomes \(Y_i\) change depending on observable characteristics \(X_i\).

4.3.1 Inverse propensity-weighted estimator

Let’s consider two possible estimators of the ATE that have particular advantages and disadvantages, and then combine them in a way that allows us to use the best properties of both.

The inverse-propensity estimator (IPW)\[ \widehat{\tau}^{IPW} := \frac{1}{n} \sum_{i=1}^n \frac{Y_i \mathbb{I}\{W_i = 1\}}{\hat{e}(X_i)} - \frac{1}{n} \sum_{i=1}^n \frac{Y_i \mathbb{I}\{W_i = 0\}}{1 - \hat{e}(X_i)} \]

4.3.2 Direct estimation

Another estimators is suggested by the following decomposition of the ATE. \[\mathbb{E}[Y_i(1) - Y(0)] = \mathbb{E}[\mathbb{E}[Y_i | X_i,W_i=1]] - \mathbb{E}[\mathbb{E}[Y_i | X_i,W_i=0]]\]

[Exercise: make sure you understand this. Where is the conditional unconfoundedness assumption used?]

The decomposition above suggests the procedure. First, estimate a model the inner conditional expectations \(\mathbb{E}[Y | X,W=1]\) and \(\mathbb{E}[Y | X,W=0]\) via regression, and then average out the predictions over the data. Denoting the estimated model by \(\hat{\mu}(x, w)\), this leads to the following estimate of the ATE, sometimes called the direct estimate: \[ \widehat{\tau}^{DM} := \frac{1}{n} \sum_{i=1}^n \hat{\mu}(X_i, 1) - \frac{1}{n} \sum_{i=1}^n \hat{\mu}(X_i, 0). \]

# Do not use! We'll see a better estimator below.
covariates <- c("income", "educ", "marital", "sex", "race")
fmla <- as.formula(paste0("y ~ ", paste(covariates, "* w", collapse="+")))
model <- lm(fmla, data=data)  # Fitting some model of E[Y|X,W]
muhat.treat <- predict(model, newdata=transform(data, w=1))
muhat.ctrl <- predict(model, newdata=transform(data, w=0))
ate.est <- mean(muhat.treat) - mean(muhat.ctrl)
print(ate.est)
## [1] -0.336924

The advantages of this estimator is that it allows us to leverage regression techniques to estimate the ATE, so the resulting estimate should have smaller variance than alternatives. However, it has several disadvantages make it undesirable. First, its properties will rely heavily on the model \(\hat{\mu}(x, w)\) being well-specified: it will be an unbiased and/or consistent estimate of the ATE provided that \(\hat{\mu}(x, w)\) is an unbiased and/or consistent estimator of \(\mathbb{E}[Y | X=x, W = w]\). In practice, having a well-specified model is not something we want to rely upon. [1] In general, its will also not be asymptotically normal, which means that we can’t easily compute t-statistics and p-values.

4.3.3 Augmented inverse propensity-weighted (AIPW) estimator

\[\begin{equation} \tag{4.1} \widehat{\tau}^{DR} := \frac{1}{n} \sum_{i=1}^n \hat{\mu}^{-i}(X_i, 1) - \hat{\mu}^{-i}(X_i, 0) + \frac{\mathbb{I}\{W_i = 1\}}{\hat{e}^{-i}(X_i)} \left( Y_i - \hat{\mu}^{-i}(X_i, 1) \right) - \frac{\mathbb{I}\{W_i = 0\}}{1 - \hat{e}^{-i}(X_i)} \left( Y_i - \hat{\mu}^{-i}(X_i, 0) \right) \end{equation}\]

Let’s parse (4.1). Ignoring the superscripts for now, the first two terms correspond to an estimate of the treatment effect obtained via direct estimation. The next two terms resemble the IPW estimator, except that we replaced the outcome \(Y_i\) with the residuals \(Y_i - \hat{\mu}(X_i, w)\).

This estimator is called the augmented inverse propensity-weighted estimator. It has several desirable properties. First, it will be unbiased provided that either \(\hat{\mu}(X_i, w)\) or \(\hat{e}(X_i, w)\) is unbiased. Second, it will be consistent provided that either \(\hat{\mu}(X_i, w)\) or \(\hat{e}(X_i, w)\) is consistent. This property is called double robustness (DR), because only one of the outcome or propensity score models needs to be correctly specified, so the estimator is “robust” to misspecification in the remaining model. Under mild assumptions, it is also asymptotically normal and it is efficient, that is has the smallest asymptotic variance than any other estimator [2].

The superscripts have to do with a technicality called cross-fitting. For the properties above to be true, we need to fit the outcome and propensity models using one portion of the data, and compute predictions on the remaining portion. An easy way of accomplishing this is to divide the data into K folds, and then estimate K models that are fitted on K-1 folds and then compute predictions on the remaining Kth fold. The example below does that with K=2.

4.3.3.1 Using GRF

Here’s another example using random forests via the package grf. The function average_treatment_effect computes the AIPW estimate of the treatment effect, and uses forest-based estimates of the outcome model and propensity scores (unless those are passed directly via the arguments Y.hat and W.hat). Also, because forests are an ensemble method, cross-fitting is accomplished via out-of_bag predictions – that is, predictions for observation \(i\) are computed using trees that were not constructed using observation \(i\).

covariates <- c("income", "educ", "marital", "sex", "race")
forest <- grf::causal_forest(
              X=data[,covariates],  # Note all covariates need to be numeric
              W=data$w,
              Y=data$y,
              W.hat=rep(.5, nrow(data)),  # Remove this line in an observational setting
              num.trees=200, # Using few tree for speed; in practice use num.trees approx. equal to nrow(data) if you can
              seed=1)
forest.ate <- grf::average_treatment_effect(forest)
print(forest.ate)
##    estimate     std.err 
## -0.33540108  0.00414159

4.3.3.2 Using LASSO

Fitting a linear model with splines.

fmla.xw <- formula(paste("~ 0 +", paste0("bs(", covariates, ")", "*", treatment, collapse=" + ")))
fmla.x <- formula(paste(" ~ 0 + ", paste0("bs(", covariates, ")", collapse=" + ")))

W <- data[[treatment]]
Y <- data[[outcome]]
XW <- model.matrix(fmla.xw, data)
XW1 <- model.matrix(fmla.xw, transform(data, w=1))  # setting W=1
XW0 <- model.matrix(fmla.xw, transform(data, w=0))  # setting W=0
XX <- model.matrix(fmla.x, data)

penalty.factor <- rep(1, ncol(XW))
penalty.factor[colnames(XW) == treatment] <- 0

n.folds <- 5
indices <- split(seq(n), sort(seq(n) %% n.folds))

mu.hat.1 <- rep(NA, n)
mu.hat.0 <- rep(NA, n)
e.hat <- rep(NA, n)
for (idx in indices) {
  # Estimate outcome model and propensity models
  outcome.model <- cv.glmnet(x=XW[-idx,], y=Y[-idx], family="gaussian", penalty.factor=penalty.factor)
  propensity.model <- cv.glmnet(x=XX[-idx,], y=W[-idx], family="binomial")

  # Predict with cross-fitting
  mu.hat.1[idx] <- predict(outcome.model, newx=XW1[idx,], type="response")
  mu.hat.0[idx] <- predict(outcome.model, newx=XW0[idx,], type="response")
  e.hat[idx] <- predict(propensity.model, newx=XX[idx,], type="response")
}

# aipw
aipw.scores <- (mu.hat.1 - mu.hat.0
                + W / e.hat * (Y -  mu.hat.1)
                - (1 - W) / (1 - e.hat) * (Y -  mu.hat.0))

# tally up results
ate.aipw.est <- mean(aipw.scores)
ate.aipw.se <- sd(aipw.scores) / sqrt(n)
ate.aipw.tstat <- ate.aipw.est / ate.aipw.se
ate.aipw.pvalue <- 2*(pnorm(1 - abs(ate.aipw.tstat)))

ate.aipw.results <- c(estimate=ate.aipw.est, std.error=ate.aipw.se, t.stat=ate.aipw.tstat, pvalue=ate.aipw.pvalue)
ate.aipw.results
##    estimate   std.error      t.stat      pvalue 
##  -0.3395912   0.0042817 -79.3122294   0.0000000

Takeaways

  • Under additional assumptions, there exist estimators of the ATE that have smaller asymptotic variance relative to the simple “difference-in-means” estimator. In addition, these alternative estimators will also be available to us even in observational settings, where the “difference-in-means” estimator would be biased and inconsistent.

  • The two assumptions are unconfoundeness and overlap.

4.4 Technical notes

[1] In practice, one should use non-parametric regression methods such as forests or sieve estimators (i.e., flexible “machine learning” methods). These methods increase in complexity with the size of the dataset, and often under mild conditions they will be consistent. However, non-parametric methods need regularization, which introduces “regularization bias.” Therefore, the direct method will often be consistent, but not unbiased. It will also not be asymptotically normal in general.

[2] When is double-robustness actually useful? In practice, if the outcome model \(\hat{\mu}(x, w)\) or the propensity score model \(\hat{e}(x, w)\) are unknown, we should use non-parametric estimators, so “consistency” is always guaranteed. Also, finite-sample bias may often vanish fast enough that we may not care about it. However, this property is very convenient in experiments in which propensity scores are known, since in that case the estimator is guaranteed to be unbiased.

[3] Fitting one or two models….

4.5 Read more

Stefan Wager’s lecture notes: Lectures 1-4: https://web.stanford.edu/~swager/stats361.pdf

library(grf)
library(lmtest)
library(sandwich)
require(MASS)