5 HTE I: Binary treatment
Before you jump into this tutorial, please make sure you’ve read:
5.1 Setup
<- read.csv(file = "https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Welfare/ProcessedData/welfarenolabel3.csv") data
<- c("income", "educ", "marital", "sex", "polviews")
covariates <- "y"
outcome <- "w"
treatment <- data[1:10000,] # for now
data <- 5
num.groups <- nrow(data) n
5.2 Pre-specified hypotheses
We are interested in testing whether pre-specified null hypotheses of the form \[\begin{equation} \tag{5.1} H_0: \mathbb{E}[Y(1) - Y(0) | G_{i}=1] = \mathbb{E}[Y(1) - Y(0) | G_{i}=0]. \end{equation}\]
That is, that the treatment effect is the same regardless of membership to some group \(G_i\). Importantly, for now we’ll assume that the group \(G_i\) was decided before running the experiment.
In experimental data, if the both the treatment \(W_i\) and group membership \(G_i\) are binary, we can write \[\begin{equation} \tag{5.2} \mathbb{E}[Y_i | W_i, G_i] = \beta_0 + \beta_w W_i + \beta_g G_i + \beta_{wg} X_i G_i. \end{equation}\]
This decomposition is true without loss of generality. Why?
This allows us to write the average effects of \(W_i\) and \(G_i\) on \(Y_i\) in the following way, \[\begin{equation} \tag{5.3} \begin{aligned} \mathbb{E}[Y(1) | G_i=1] &= \mathbb{E}[Y| W_i=1, G_i=1] = \beta_0 + \beta_w + \beta_g + \beta_{wg} \\ \mathbb{E}[Y(1) | G_i=0] &= \mathbb{E}[Y| W_i=1, G_i=0] = \beta_0 + \beta_w \\ \mathbb{E}[Y(0) | G_i=1] &= \mathbb{E}[Y| W_i=1, G_i=1] = \beta_0 + \ \ \ \ + \beta_g \\ \mathbb{E}[Y(0) | G_i=0] &= \mathbb{E}[Y| W_i=1, G_i=1] = \beta_0. \end{aligned} \end{equation}\]
Rewriting the null hypothesis (5.1) in terms of the decomposition (5.3), we see that it boils down to a test about the coefficient in the interaction: \(\beta_{xw} = 0\). Here’s an example:
# Example: suppose this group was defined prior to collecting the data.
$conservative <- factor(data$polviews < 4) # binary
data<- 'conservative' group
# Only valid for experimental data!
<- formula(paste(outcome, ' ~ ', treatment, '*', group))
fmla <- lm(fmla, data=data)
ols coeftest(ols, vcov=vcovHC(ols, type='HC2'))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.441134 0.008446 52.230 < 2e-16 ***
## w -0.347980 0.009635 -36.115 < 2e-16 ***
## conservativeTRUE -0.121201 0.015928 -7.609 3.01e-14 ***
## w:conservativeTRUE 0.082977 0.017658 4.699 2.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpret the results above.
When there are many groups, we need to correct for the fact that we are testing for multiple hypotheses, or we will end up with many false positives. Here’s a null hypothesis that our \[\begin{equation} \tag{5.1} H_0: \mathbb{E}[Y(1) - Y(0) | G_{i}=g] = \mathbb{E}[Y(1) - Y(0) | G_{i}=1] \qquad \text{(for many values of }g\text{)}. \end{equation}\]
The Bonferroni correction is a very conservative method for dealing with multiple hypothesis testing.
# Example: these groups must be defined prior to collecting the data.
<- 'polviews' group
# Only valid for experimental data!
# Likely too conservative to be useful.
<- formula(paste(outcome, ' ~ ', treatment, '*', 'factor(', group, ')'))
fmla <- lm(fmla, data=data)
ols <- which(sapply(names(coef(ols)), function(x) grepl("w:", x)))
interact
<- coeftest(ols, vcov=vcovHC(ols, type='HC2'))
ols.res <- ols.res[interact, 4]
unadj.p.value <- p.adjust(unadj.p.value, method='bonferroni')
adj.p.value data.frame(estimate=coef(ols)[interact],
std.err=ols.res[interact,2],
unadj.p.value, adj.p.value)
## estimate std.err unadj.p.value adj.p.value
## w:factor(polviews)2 0.0144885 0.0504350 0.77391115 1.0000000
## w:factor(polviews)3 -0.0305790 0.0501349 0.54191895 1.0000000
## w:factor(polviews)4 -0.0851243 0.0470780 0.07061270 0.4942889
## w:factor(polviews)4.1220088 -0.0148913 0.0566752 0.79275065 1.0000000
## w:factor(polviews)5 -0.1335512 0.0496252 0.00713144 0.0499201
## w:factor(polviews)6 -0.0763852 0.0511340 0.13525321 0.9467724
## w:factor(polviews)7 -0.1558322 0.0720950 0.03068162 0.2147713
Often the Bonferroni correction is likely to be too stringent for most purposes. Another option is to use the Romano-Wolf correction. The code below performs this correction.
# Computes adjusted p-values following the Romano-Wolf method.
# For a reference, see http://ftp.iza.org/dp12845.pdf page 8
# t.orig: vector of t-statistics from original model
# t.boot: matrix of t-statistics from bootstrapped models
<- function(t.orig, t.boot) {
romano_wolf_correction <- abs(t.orig)
abs.t.orig <- abs(t.boot)
abs.t.boot <- sort(abs.t.orig, decreasing = TRUE)
abs.t.sorted
<- order(abs.t.orig, decreasing = TRUE)
max.order <- order(max.order)
rev.order
<- nrow(t.boot)
M <- ncol(t.boot)
S
<- rep(0, S)
p.adj 1] <- mean(apply(abs.t.boot, 1, max) > abs.t.sorted[1])
p.adj[for (s in seq(2, S)) {
<- max.order[s:S]
cur.index <- mean(apply(abs.t.boot[, cur.index, drop=FALSE], 1, max) > abs.t.sorted[s])
p.init <- max(p.init, p.adj[s-1])
p.adj[s]
}
p.adj[rev.order]
}
# Computes adjusted p-values for linear regression (lm) models.
<- function(model, indices=NULL, cov.type="HC2", num.boot=10000, seed=2020) {
summary_rw_lm
if (is.null(indices)) {
<- 1:nrow(coef(summary(model)))
indices
}# Grab the original t values.
<- coef(summary(model))[indices,,drop=FALSE]
summary <- summary[, "t value"]
t.orig
# Null resampling.
# This is a bit of trick to speed up bootstrapping linear models.
# Here, we don't really need to re-fit linear regressions, which would be a bit slow.
# We know that betahat ~ N(beta, Sigma), and we have an estimate Sigmahat.
# So we can approximate "null t-values" by
# - Draw beta.boot ~ N(0, Sigma-hat) note the 0 here, this is what makes it a *null* t-value.
# - Compute t.boot = beta.boot / sqrt(diag(Sigma.hat))
<- sandwich::vcovHC(model, type=cov.type)[indices, indices]
Sigma.hat <- sqrt(diag(Sigma.hat))
se.orig <- length(se.orig)
num.coef <- MASS::mvrnorm(n=num.boot, mu=rep(0, num.coef), Sigma=Sigma.hat)
beta.boot <- sweep(beta.boot, 2, se.orig, "/")
t.boot <- romano_wolf_correction(t.orig, t.boot)
p.adj
<- cbind(summary[,c(1,2,4),drop=F], p.adj)
result colnames(result) <- c('Estimate', 'Std. Error', 'Orig. p-value', 'Adj. p-value')
result }
# Only valid for experimental data!
<- formula(paste(outcome, ' ~ ', treatment, '*', 'factor(', group, ')'))
fmla <- lm(fmla, data=data)
ols <- which(sapply(names(coef(ols)), function(x) grepl("w:", x)))
interact summary_rw_lm(ols, indices=interact)
## Estimate Std. Error Orig. p-value Adj. p-value
## w:factor(polviews)2 0.0144885 0.0560185 0.7959202 0.9412
## w:factor(polviews)3 -0.0305790 0.0553014 0.5803085 0.8479
## w:factor(polviews)4 -0.0851243 0.0524521 0.1046434 0.2484
## w:factor(polviews)4.1220088 -0.0148913 0.0613334 0.8081713 0.9412
## w:factor(polviews)5 -0.1335512 0.0543942 0.0140957 0.0556
## w:factor(polviews)6 -0.0763852 0.0549884 0.1648285 0.3417
## w:factor(polviews)7 -0.1558322 0.0705879 0.0272926 0.0935
When working with either experimental or observational data, one can use AIPW scores \(\widehat{\Gamma}_i\) in place of the raw outcomes \(Y_i\). Here’s how to do it using GRF.
# Valid for both experimental and observational data.
<- causal_forest(X=data[,covariates],
cf Y=data[,outcome],
W=data[,treatment],
W.hat=NULL, # use true propensity score if known
num.trees=200) # in practice use num.trees ~ nrow(data)
<- formula(paste('scores ~ ', treatment, '* factor(', group, ')'))
fmla <- lm(fmla, data=transform(data, scores=get_scores(cf)))
ols <- which(sapply(names(coef(ols)), function(x) grepl("w:", x)))
interact summary_rw_lm(ols, indices=interact)
## Estimate Std. Error Orig. p-value Adj. p-value
## w:factor(polviews)2 -0.00792069 0.122445 0.948424 0.9999
## w:factor(polviews)3 -0.00971677 0.120877 0.935932 0.9999
## w:factor(polviews)4 0.01698675 0.114650 0.882218 0.9999
## w:factor(polviews)4.1220088 -0.02304272 0.134062 0.863535 0.9999
## w:factor(polviews)5 0.01035451 0.118895 0.930602 0.9999
## w:factor(polviews)6 -0.01676302 0.120193 0.889084 0.9999
## w:factor(polviews)7 0.20342214 0.154291 0.187389 0.4891
One can also construct AIPW scores using any other method. In ATE I we showed how to construct AIPW scores using Lasso and splines via glmnet
and splines
.
5.3 Data-driven hypotheses
Instead of coming up with hypotheses before running the experiment…
5.3.1 Via causal trees
When dealing with observational data, we can use the honest.causalTree
function from the causalTree
package.
# Only valid for experimental data!
# remotes::install_github('susanathey/causalTree') # Uncomment this to install the causalTree package
library(causalTree)
<- paste(outcome, " ~", paste(covariates, collapse = " + "))
fmla
# Dividing data into three subsets
<- split(seq(nrow(data)), sort(seq(nrow(data)) %% 3))
indices names(indices) <- c('split', 'est', 'test')
# Fitting the forest
<- honest.causalTree(
ct.unpruned formula=fmla, # Define the model
data=data[indices$split,],
treatment=data[indices$split, treatment],
est_data=data[indices$est,],
est_treatment=data[indices$est, treatment],
minsize=2, # Min. number of treatment and control cases in each leaf
HonestSampleSize=length(indices$est), # Num obs used in estimation after splitting
# Recommended parameters.
split.Rule="CT", # Define the splitting option
cv.option="TOT", # Cross validation options
cp=0, # Complexity parameter
split.Honest=TRUE, # Use honesty when splitting
cv.Honest=TRUE # Use honesty when performing cross-validation
)
# Table of cross-validated values by tuning parameter.
<- as.data.frame(ct.unpruned$cptable)
ct.cptable
# Obtain optimal complexity parameter to prune tree.
<- which.min(ct.cptable$xerror)
cp.selected <- ct.cptable[cp.selected, "CP"]
cp.optimal
# Prune the tree at optimal complexity parameter.
<- prune(tree=ct.unpruned, cp=cp.optimal)
ct.pruned
# Predict point estimates (on estimation sample)
<- predict(ct.pruned, newdata=data[indices$est,])
tau.hat.est
# Create a factor column 'leaf' indicating leaf assignment in the estimation set
<- length(unique(tau.hat.est))
num.leaves <- factor(tau.hat.est, levels=sort(unique(tau.hat.est)), labels = seq(num.leaves)) leaf
Note: if your tree is not splitting at all, try decreasing the parameter minsize
that controls the minimum size of each leaf. For more details, please refer to the package documentation.
rpart.plot(
x=ct.pruned, # Pruned tree
type=3, # Draw separate split labels for the left and right directions
fallen=TRUE, # Position the leaf nodes at the bottom of the graph
leaf.round=1, # Rounding of the corners of the leaf node boxes
extra=100, # Display the percentage of observations in the node
branch=.1, # Shape of the branch lines
box.palette="RdBu") # Palette for coloring the node
Following the same logic as above, we can test if the average treatment effect within each leaf is different from the leaf with smallest (or most negative) tratment effect. This is valid because of our careful sample-splitting scheme.
# Only for experimental data!
<- paste0(outcome, ' ~ ', treatment, '*leaf')
fmla <- lm(fmla, data=transform(data[indices$est,], leaf=leaf))
ols <- which(sapply(names(coef(ols)), function(x) grepl("w:", x)))
interact if (num.leaves > 2) {
# Use Romano-Wolf test correction if there are three or more leaves
summary_rw_lm(ols, indices=interact, cov.type = 'HC2')
else if (num.leaves == 2) {
} # Case of exactly two leaves
coeftest(ols, vcov=vcovHC(ols, 'HC2'))[interact,,drop=F]
else {
} print("Skipping since there's a single leaf.")
}
## Estimate Std. Error t value Pr(>|t|)
## w:leaf2 0.0894818 0.0296649 3.01642 0.00257716
5.3.2 Via grf
Assessing systematic variation in treatment effect heterogeneity is a difficult task. Let’s begin by looking at some popular measures and see how they may lead to subtle pitfalls.
First, having fit a non-parametric method such as a causal forest, a researcher may (incorrectly) start by looking at the distribution of its predictions of the treatment effect. One might be tempted to think: “if the histogram is concentrated at a point, then there is no heterogeneity; if the histogram is spread out, then our estimator has found interesting heterogeneity.” However, both of these assertions may be false!
If the histogram is concentrated at a point, we may simply be underpowered: our method was not able to detect any heterogeneity, but maybe it would detect it if we had more data. If the histogram is spread out, we may be simply overfitting: our model is producing very noisy estimates \(\widehat{\tau}(X_i)\), but in fact the truth \(\tau_i\) could be a lot smoother.
# Do not use!
<- data[,covariates]
X <- data[[treatment]]
W <- data[[outcome]]
Y
<- causal_forest(X, Y, W, num.trees=200)
cf <- predict(cf)$predictions
tau.hat hist(tau.hat, main="CATE estimates", freq=F)
Next, the grf package also produces a measure of variable importance that indicates how often a variable was used in a tree split. Again, much like the histogram aboove, this can be a rough diagnostic, but it should not be interpreted as indicating that, for example, variable with low importance is not related to heterogeneity. The reasoning is the same as the one presented in the causal trees section: if two covariates are highly correlated, the trees might split on one covariate but not the other. However, if one was removed, the trees might split on the one remaining, and the leaf definitions might be unchanged.
<- c(variable_importance(cf))
var_imp names(var_imp) <- covariates
<- sort(var_imp, decreasing = TRUE)
sorted_var_imp 1:5] sorted_var_imp[
## income polviews marital educ sex
## 0.4967509 0.1616885 0.1574553 0.1483933 0.0357121
5.3.2.1 Finding subgroups via grf
# Valid for both observational and experimental data!
<- data[,covariates]
X <- data[[treatment]]
W <- data[[outcome]]
Y
<- 5
n.folds <- split(seq(n), sort(seq(n) %% n.folds))
indices
<- lapply(indices, function(idx) {
res
<- regression_forest(X[-idx,], Y[-idx], num.trees=200) # increase num.trees
forest.m <- regression_forest(X[-idx,], W[-idx], num.trees=200) # increase num.trees
forest.e
<- predict(forest.m, X[idx,])$predictions
m.hat <- predict(forest.e, X[idx,])$predictions
e.hat
<- causal_forest(X[-idx,], Y[-idx], W[-idx], num.trees=200) # increase num.trees
forest.tau <- predict(forest.tau, X[idx,])$predictions
tau.hat
.0 <- m.hat - e.hat * tau.hat
mu.hat.1 <- m.hat + (1 - e.hat) * tau.hat
mu.hat
<- (mu.hat.1 - mu.hat.0
aipw.scores + W[idx] / e.hat * (Y[idx] - mu.hat.1)
- (1 - W[idx]) / (1 - e.hat) * (Y[idx] - mu.hat.0))
<- quantile(tau.hat, probs=seq(0, 1, length.out=num.groups+1))
tau.hat.quantiles <- cut(tau.hat, tau.hat.quantiles, include.lowest = TRUE,
ranking labels = paste0("Q", seq(num.groups)))
data.frame(aipw.scores, tau.hat, ranking=factor(ranking))
})<- do.call(rbind, res) res
Having run the code above we can test, e.g., if the prediction within each group is larger than the one in the first group.
<- lm(aipw.scores ~ ranking, data=res)
ols coeftest(ols, vcov=vcovHC(ols, type='HC2'))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.38157 0.01964 -19.431 < 2e-16 ***
## rankingQ2 0.01162 0.02754 0.422 0.6730
## rankingQ3 0.04652 0.02745 1.694 0.0902 .
## rankingQ4 0.06669 0.02702 2.469 0.0136 *
## rankingQ5 0.16100 0.02567 6.272 3.7e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And as before we may want to use correct for multiple hypothesis testing using the same methods as above.
<- which(sapply(names(coef(ols)), function(x) grepl("Q", x)))
interact summary_rw_lm(ols, indices=interact)
## Estimate Std. Error Orig. p-value Adj. p-value
## rankingQ2 0.0116239 0.0264047 6.59786e-01 0.6623
## rankingQ3 0.0465169 0.0263482 7.75158e-02 0.1346
## rankingQ4 0.0666940 0.0263714 1.14531e-02 0.0291
## rankingQ5 0.1609990 0.0263780 1.07604e-09 0.0000
We can also check if different groups have different average covariate levels across n-tiles of estimated conditional treatment effects. Recalling the warning against using variable importance measures, it is possible that one covariate is not “important” in splitting, but yet it varies strongly with treatment effects. The approach of comparing all covariates across n-tiles of treatment effects presents a fuller picture of how high-treatment-effect individuals differ fom low-treatment-effect individuals.
mapply(function(covariate) {
<- formula(paste0(covariate, "~ 0 + ranking"))
fmla <- lm(fmla, data=transform(data, ranking=res$ranking))
ols.x t(coef(summary(ols.x))[,1:2])
1:3], SIMPLIFY = F) # Showing only first three }, covariates[
## $income
## rankingQ1 rankingQ2 rankingQ3 rankingQ4 rankingQ5
## Estimate 10.10907 -47.73729 -104.15519 -135.99099 -139.01554
## Std. Error 6.38224 6.43024 6.40291 6.41412 6.41734
##
## $educ
## rankingQ1 rankingQ2 rankingQ3 rankingQ4 rankingQ5
## Estimate 12.61775 11.85606 10.56587 8.94492 2.98897
## Std. Error 1.27211 1.28168 1.27623 1.27846 1.27910
##
## $marital
## rankingQ1 rankingQ2 rankingQ3 rankingQ4 rankingQ5
## Estimate 1.75310 1.920483 0.437126 2.322984 3.305263
## Std. Error 0.38724 0.390152 0.388494 0.389174 0.389369
5.3.2.2 Partial dependence plots
5.3.2.3 Best linear projection
This function provides a (doubly robust) fit to the linear model \(\widehat{tau}(Xi) = \beta_0 + A_i' \beta_1\), where \(A_i\) could be a subset of the covariate columns.
::best_linear_projection(cf, X[,1:2]) grf
##
## Best linear projection of the conditional average treatment effect.
## Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.219e-01 9.029e-03 -35.649 <2e-16 ***
## income -2.008e-05 2.898e-05 -0.693 0.488
## educ -1.934e-04 1.605e-04 -1.205 0.228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.3.2.4 Assessing heterogeneity
test_calibration(cf)
##
## Best linear fit using forest predictions (on held-out data)
## as well as the mean forest prediction as regressors, along
## with one-sided heteroskedasticity-robust (HC3) SEs:
##
## Estimate Std. Error t value Pr(>t)
## mean.forest.prediction 0.9979 0.0246 40.564 < 2e-16 ***
## differential.forest.prediction 0.6928 0.0918 7.547 2.42e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1