19 Experimental Design

Randomized Control Trials (RCT) or Experiments have always been and are likely to continue in the future to be the holy grail of causal inference, because of
- unbiased estimates
- elimination of confounding factors on average (covariate imbalance is always possible. Hence, you want to do Rerandomization to achieve platinum standard set by (Tukey 1993))
RCT means you have two group treatment (or experimental) gorp and control group. Hence, as you introduce the treatment (your exogenous variable) to the treatment group, the only expected difference in the outcomes of the two group should be due to the treatment.
Subjects from the same population will be randomly assigned to either treatment or control group. This random assignment give us the confidence that changes in the outcome variable will be due only the treatment, not any other source (variable).
It can be easier for hard science to have RCT because they can introduce the treatment, and have control environments. But it’s hard for social scientists because their subjects are usually human, and some treatment can be hard to introduce, or environments are uncontrollable. Hence, social scientists have to develop different tools (Quasi-experimental) to recover causal inference or to recreate the treatment and control group environment.
With RCT, you can easily establish internal validity
Even though random assignment is not the same thing as ceteris paribus (i.e., holding everything else constant), it should have the same effect (i.e., under random manipulation, other things equal can be observed, on average, across treatment and control groups).

Selection Problem

Assume we have

binary treatment \(D_i =(0,1)\)
outcome of interest \(Y_i\) for individual \(i\)
- \(Y_{0i}\) are those were not treated
- \(Y_{1i}\) are those were treated

\[ \text{Potential Outcome} = \begin{cases} Y_{1i} \text{ if } D_i = 1 \\ Y_{0i} \text{ if } D_i = 0 \end{cases} \]

Then, what we observe in the outcome variable is

\[ Y_i = Y_{0i} + (Y_{1i} - Y_{0i})D_i \]

It’s likely that \(Y_{1i}\) and \(Y_{0i}\) both have their own distributions (i.e., different treatment effect for different people). Since we can’t see both outcomes for the same individual (unless we have a time machine), then we can only make inference regarding the average outcome of those who were treated and those who were not.

\[ \begin{aligned} E[Y_i | D_i = 1] - E[Y_i | D_i = 0] &= (E[Y_{1i} | D_i = 1] - E[Y_{0i}|D_i = 1] ) + (E[Y_{0i} |D_i = 1] - E[Y_{0i} |D_i = 0]) \\ &= (E[Y_{1i}-Y_{0i}|D_i = 1] ) + (E[Y_{0i} |D_i = 1] - E[Y_{0i} |D_i = 0]) \\ \text{Observed difference in treatment} &= \text{Average treatment effect on the treated} + \text{Selection bias} \end{aligned} \]

The average treatment effect is the average between between a person who is treated and the same person (in another parallel universe) who is not treated
The selection bias is the difference between those who were treated and those who weren’t treated

With random assignment of treatment (\(D_i\)) under Experimental Design, we can have \(D_i\) independent of potential outcomes

\[ \begin{aligned} E[Y_i | D_i = 1] - E[Y_i|D_i = 0] &= E[Y_{1i}|D_i = 1]-E[Y_{0i}|D_i = 0)]\\ &= E[Y_{1i}|D_i = 1]-E[Y_{0i}|D_i = 0)] && D_i \perp Y_i \\ &= E[Y_{1i} - Y_{0i}|D_i = 1] \\ &= E[Y_{1i} - Y_{0i}] \end{aligned} \]

Another representation under regression

Suppose that you know the effect is

\[ Y_{1i} - Y_{0i} = \rho \]

The observed outcome variable (for an individual) can be rewritten as

\[ \begin{aligned} Y_i &= E(Y_{0i}) + (Y_{1i}-Y_{0i})D_i + [Y_{0i} - E(Y_{0i})]\\ &= \alpha + \rho D_i + \eta_i \end{aligned} \]

where \(\eta_i\) = random variation of \(Y_{0i}\)

Hence, the conditional expectation of an individual outcome on treatment status is

\[ \begin{aligned} E[Y_i |D_i = 1] &= \alpha + \rho &+ E[\eta_i |D_i = 1] \\ E[Y_i |D_i = 0] &= \alpha &+ E[\eta_i |D_i = 0] \end{aligned} \]

Thus,

\[ E[Y_i |D_i = 1] - E[Y_i |D_i = 0] = \rho + E[\eta_i |D_i = 1] -E[\eta_i |D_i = 0] \]

where \(E[\eta_i |D_i = 1] -E[\eta_i |D_i = 0]\) is the selection bias - correlation between the regression error term (\(\eta_i\)), and the regressor (\(D_i\))

Under regression, we have

\[ E[\eta_i |D_i = 1] -E[\eta_i |D_i = 0] = E[Y_{0i} |D_i = 1] -E[Y_{0i}|D_i = 0] \]

which is the difference in outcomes between those who weren’t treated get treated and those who weren’t treated stay untreated

Say you have control variables (\(X_i\)), that is uncorrelated with the treatment (\(D_i\)), then you can include in your model, and it won’t (in principle) affect your estimate of the treatment effect (\(\rho\)) with an added benefit of reducing the residual variance, which subsequently reduces the standard error of other estimates.

\[ Y_i = \alpha + \rho D_i + X_i'\gamma + \eta_i \]

Examples:

Bertrand and Mullainathan (2004) randomly assign race to a job application to study the effect of race on callbacks.

19.1 Notes

For outcomes with 0s, we can’t use log-like transformation, because it’s sensitive to outcome unit (J. Chen and Roth 2023). For info on this issue, check [Zero-valued Outcomes]. We should use:

Percentage changes in the Average: by using Poisson QMLE, we can interpret the coefficients of the effect of treatment on the treated group relative to the mean of the control group.
Extensive vs. Intensive Margins: Distinguish the treatment effect on the intensive (outcome: 10 to 11) vs. extensive margins (outcome: 0 to 1).
- To get the bounds for the intensive-margin, use Lee (2009) (assuming that treatment has a monotonic effect on outcome)

set.seed(123) # For reproducibility
library(tidyverse)

n <- 1000 # Number of observations
p_treatment <- 0.5 # Probability of being treated

# Step 1: Generate the treatment variable D
D <- rbinom(n, 1, p_treatment)

# Step 2: Generate potential outcomes
# Untreated potential outcome (mostly zeroes)
Y0 <- rnorm(n, mean = 0, sd = 1) * (runif(n) < 0.3)

# Treated potential outcome (shifting both the probability of being positive - extensive margin and its magnitude - intensive margin)
Y1 <- Y0 + rnorm(n, mean = 2, sd = 1) * (runif(n) < 0.7)

# Step 3: Combine effects based on treatment
Y_observed <- (1 - D) * Y0 + D * Y1

# Add explicit zeroes to model situations with no effect
Y_observed[Y_observed < 0] <- 0


data <-
    data.frame(
        ID = 1:n,
        Treatment = D,
        Outcome = Y_observed,
        X = rnorm(n)
    ) |>
    # whether outcome is positive
    dplyr::mutate(positive = Outcome > 0)

# Viewing the first few rows of the dataset
head(data)
#>   ID Treatment   Outcome          X positive
#> 1  1         0 0.0000000  1.4783345    FALSE
#> 2  2         1 2.2369379 -1.4067867     TRUE
#> 3  3         0 0.0000000 -1.8839721    FALSE
#> 4  4         1 3.2192276 -0.2773662     TRUE
#> 5  5         1 0.6649693  0.4304278     TRUE
#> 6  6         0 0.0000000 -0.1287867    FALSE

hist(data$Outcome)

Percentage changes in the Average

library(fixest)
res_pois <-
    fepois(
        fml = Outcome ~ Treatment + X,
        data = data, 
        vcov = "hetero"
    )
etable(res_pois)
#>                           res_pois
#> Dependent Var.:            Outcome
#>                                   
#> Constant        -2.223*** (0.1440)
#> Treatment        2.579*** (0.1494)
#> X                  0.0235 (0.0406)
#> _______________ __________________
#> S.E. type       Heteroskedas.-rob.
#> Observations                 1,000
#> Squared Cor.               0.33857
#> Pseudo R2                  0.26145
#> BIC                        1,927.9
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To calculate the proportional effect

# proportional effect
exp(coefficients(res_pois)["Treatment"]) - 1
#> Treatment 
#>  12.17757

# SE
exp(coefficients(res_pois)["Treatment"]) *
    sqrt(res_pois$cov.scaled["Treatment", "Treatment"])
#> Treatment 
#>  1.968684

Hence, we conclude that the treatment effect is 1215% higher for the treated group as compared to the control group.

Extensive vs. Intensive Margins

Here, we can estimate the intensive-margin treatment effect (i.e., the treatment effect for “always-takers”).

res <- causalverse::lee_bounds(
    df = data,
    d = "Treatment",
    m = "positive",
    y = "Outcome",
    numdraws = 10
) |> 
    causalverse::nice_tab(2)
print(res)
#>          term estimate std.error
#> 1 Lower bound    -0.22      0.09
#> 2 Upper bound     2.77      0.14

Since in this case, the bounds contains 0, we can’t say much about the intensive margin for always-takers.

If we aim to examine the sensitivity of always-takers, we should consider scenarios where the average outcome of compliers are \(100 \times c\%\) lower or higher than that of always-takers.

We assume that \(E(Y(1)|Complier) = (1-c)E(Y(1)|Always-taker)\)

set.seed(1)
c_values = c(.1, .5, .7)

combined_res <- bind_rows(lapply(c_values, function(c) {
    res <- causalverse::lee_bounds(
        df = data,
        d = "Treatment",
        m = "positive",
        y = "Outcome",
        numdraws = 10,
        c_at_ratio = c
    )
    
    res$c_value <- as.character(c)
    return(res)
}))

combined_res |> 
    dplyr::select(c_value, everything()) |> 
    causalverse::nice_tab()
#>   c_value           term estimate std.error
#> 1     0.1 Point estimate     6.60      0.71
#> 2     0.5 Point estimate     2.54      0.13
#> 3     0.7 Point estimate     1.82      0.08

If we assume \(c = 0.1\) (i.e., under treatment, compliers would have an outcome equal to 10% of the outcome of always-takers), then the intensive-margin effect for always-takers is 6.6 more in the unit of the outcome.
If we assume \(c = 0.5\) (i.e., under treatment, compliers would have an outcome equal to 50% of the outcome of always-takers), then the intensive-margin effect for always-takers is 2.54 more in the unit of the outcome.

19.2 Semi-random Experiment

Chicago Open Enrollment Program (Cullen, Jacob, and Levitt 2005)

Students can apply to “choice” schools
Many schools are oversubscribed (Demand > Supply)
Resolve scarcity via random lotteries
Non-random enrollment, we only have random lottery which mean the above

Let

\[ \delta_j = E(Y_i | Enroll_{ij} = 1; Apply_{ij} = 1) - E(Y_i | Enroll_{ij} = 0; Apply_{ij} = 1) \]

and

\[ \theta_j = E(Y_i | Win_{ij} = 1; Apply_{ij} = 1) - E(Y_i | Win_{ij} = 0; Apply_{ij} = 1) \]

Hence, we can clearly see that \(\delta_j \neq \theta_j\) because you can only enroll, but you cannot ensure that you will win. Thus, intention to treat is different from treatment effect.

Non-random enrollment, we only have random lottery which means we can only estimate \(\theta_j\)

To recover the true treatment effect, we can use

\[ \delta_j = \frac{E(Y_i|W_{ij} = 1; A_{ij} = 1) - E(Y_i | W_{ij}=0; A_{ij} = 1)}{P(Enroll_{ij} = 1| W_{ij}= 1; A_{ij}=1) - P(Enroll_{ij} = 1| W_{ij}=0; A_{ij}=1)} \]

where

\(\delta_j\) = treatment effect
\(W\) = Whether students win the lottery
\(A\) = Whether student apply for the lottery
i = application
j = school

Say that we have

10 win

Number students	Type	Selection effect	Treatment effect	Total effect
1	Always Takers	+0.2	+1	+1.2
2	Compliers	0	+1	+1
7	Never Takers	-0.1	0	-0.1

10 lose

Number students	Type	Selection effect	Treatment effect	Total effect
1	Always Takers	+0.2	+1	+1.2
2	Compliers	0	0	0
7	Never Takers	-0.1	0	-0.1

Intent to treatment = Average effect of who you give option to choose

\[ \begin{aligned} E(Y_i | W_{ij}=1; A_{ij} = 1) &= \frac{1*(1.2)+ 2*(1) + 7 * (-0.1)}{10}\\ &= 0.25 \end{aligned} \]

\[ \begin{aligned} E(Y_i | W_{ij}=0; A_{ij} = 1) &= \frac{1*(1.2)+ 2*(0) + 7 * (-0.1)}{10}\\ &= 0.05 \end{aligned} \]

Hence,

\[ \begin{aligned} \text{Intent to treatment} &= 0.25 - 0.05 = 0.2 \\ \text{Treatment effect} &= 1 \end{aligned} \]

\[ \begin{aligned} P(Enroll_{ij} = 1 | W_{ij} = 1; A_{ij}=1 ) &= \frac{1+2}{10} = 0.3 \\ P(Enroll_{ij} = 1 | W_{ij} = 0; A_{ij}=1 ) &= \frac{1}{10} = 0.1 \end{aligned} \]

\[ \text{Treatment effect} = \frac{0.2}{0.3-0.1} = 1 \]

After knowing how to recover the treatment effect, we turn our attention to the main model

\[ Y_{ia} = \delta W_{ia} + \lambda L_{ia} + e_{ia} \]

where

\(W\) = whether a student wins a lottery
\(L\) = whether student enrolls in the lottery
\(\delta\) = intent to treat

Hence,

Conditional on lottery, the \(\delta\) is valid
But without lottery, your \(\delta\) is not random
Winning and losing are only identified within lottery
Each lottery has multiple entries. Thus, we can have within estimator

We can also include other control variables (\(X_i \theta\))

\[ Y_{ia} = \delta_1 W_{ia} + \lambda_1 L_{ia} + X_i \theta + u_{ia} \]

\[ \begin{aligned} E(\delta) &= E(\delta_1) \\ E(\lambda) &\neq E(\lambda_1) && \text{because choosing a lottery is not random} \end{aligned} \]

Including \((X_i \theta)\) just shifts around control variables (i.e., reweighting of lottery), which would not affect your treatment effect \(E(\delta)\)

19.3 Rerandomization

Since randomization only balances baseline covariates on average, imbalance in variables due to random chance can still happen.
In case that you have a “bad” randomization (i.e., imbalance for important baseline covariates), (Morgan and Rubin 2012) introduce the idea of rerandomization.
Rerandomization is checking balance during the randomization process (before the experiment), to eliminate bad allocation (i.e., those with unacceptable balance).
The greater the number of variables, the greater the likelihood that at least one covariate would be imbalanced across treatment groups.
- Example: For 10 covariates, the probability of a significant difference at \(\alpha = .05\) for at least one covariate is \(1 - (1-.05)^{10} = 0.4 = 40\%\)
Rerandomization increase treatment effect estimate precision if the covariates are correlated with the outcome.
- Improvement in precision for treatment effect estimate depends on (1) improvement in covariate balance and (2) correlation between covariates and the outcome.
You also need to take into account rerandomization into your analysis when making inference.
Rerandomization is equivalent to increasing our sample size.
Alternatives include
- Stratified randomization (Johansson and Schultzberg 2022)
- Matched randomization (Greevy et al. 2004; Kapelner and Krieger 2014)
- Minimization (Pocock and Simon 1975)

Rerandomization Criterion

Acceptable randomization is based on a function of covariate matrix \(\mathbf{X}\) and vector of treatment assignments \(\mathbf{W}\)

\[ W_i = \begin{cases} 1 \text{ if treated} \\ 0 \text{ if control} \end{cases} \]

Mahalanobis Distance, \(M\), can be used as criteria for acceptable balance

Let \(M\) be the multivariate distance between groups means

\[ \begin{aligned} M &= (\bar{\mathbf{X}}_T - \bar{\mathbf{X}}_C)' cov(\bar{\mathbf{X}}_T - \bar{\mathbf{X}}_C)^{-1} (\bar{\mathbf{X}}_T - \bar{\mathbf{X}}_C) \\ &= (\frac{1}{n_T}+ \frac{1}{n_C})^{-1}(\bar{\mathbf{X}}_T - \bar{\mathbf{X}}_C)' cov(\mathbf{X})^{-1}(\bar{\mathbf{X}}_T - \bar{\mathbf{X}}_C) \end{aligned} \]

With large sample size and “pure” randomization \(M \sim \chi^2_k\) where \(k\) is the number of covariates to be balanced

Then let \(p_a\) be the probability of accepting a randomization. Choosing appropriate \(p_a\) is a tradeoff between balance and time.

Then the rule of thumb is re-randomize when \(M > a\)

19.4 Two-Stage Randomized Experiments with Interference and Noncompliance

(Imai, Jiang, and Malani 2021)

18 Causal Inference

20 Sampling