22.3 Selection Problem

A fundamental challenge in causal inference is that we never observe both potential outcomes for the same individual—only one or the other. This creates the selection problem, which we formalize below.

Assume we have:

A binary treatment variable:
$D_i \in \{0,1\}$ , where:
- $D_i = 1$ indicates that individual $i$ receives the treatment.
- $D_i = 0$ indicates that individual $i$ does not receive the treatment.
The outcome of interest:
$Y_i$ , which depends on whether the individual is treated or not:
- $Y_{0i}$ : The outcome if not treated.
- $Y_{1i}$ : The outcome if treated.

Thus, the potential outcomes framework is defined as:

$\text{Potential Outcome} = \begin{cases} Y_{1i}, & \text{if } D_i = 1 \quad (\text{Treated}) \\ Y_{0i}, & \text{if } D_i = 0 \quad (\text{Untreated}) \end{cases}$

However, we only observe one outcome per individual:

$Y_i = Y_{0i} + (Y_{1i} - Y_{0i})D_i$

This means that for any given person, we either observe $Y_{1i}$ or $Y_{0i}$ , but never both. Since we cannot observe counterfactuals (unless we invent a time machine), we must rely on statistical inference to estimate treatment effects.

22.3.1 The Observed Difference in Outcomes

The goal is to estimate the difference in expected outcomes between treated and untreated individuals:

$E[Y_i | D_i = 1] - E[Y_i | D_i = 0]$

Expanding this equation:

$\begin{aligned} E[Y_i | D_i = 1] - E[Y_i | D_i = 0] &= (E[Y_{1i} | D_i = 1] - E[Y_{0i}|D_i = 1] ) \\ &+ (E[Y_{0i} |D_i = 1] - E[Y_{0i} |D_i = 0]) \\ &= (E[Y_{1i}-Y_{0i}|D_i = 1] ) \\ &+ (E[Y_{0i} |D_i = 1] - E[Y_{0i} |D_i = 0]) \end{aligned}$

This equation decomposes the observed difference into two components:

Treatment Effect on the Treated: $E[Y_{1i} - Y_{0i} |D_i = 1]$ , which represents the causal impact of the treatment on those who are treated.
Selection Bias:
$E[Y_{0i} |D_i = 1] - E[Y_{0i} |D_i = 0]$ , which captures systematic differences between treated and untreated groups even in the absence of treatment.

Thus, the observed difference in outcomes is:

$\text{Observed Difference} = \text{ATT} + \text{Selection Bias}$

22.3.2 Eliminating Selection Bias with Random Assignment

With random assignment of treatment, $D_i$ is independent of potential outcomes:

$E[Y_i | D_i = 1] - E[Y_i|D_i = 0] = E[Y_{1i} - Y_{0i}]$

This works because, under true randomization:

$E[Y_{0i} | D_i = 1] = E[Y_{0i} | D_i = 0]$

which eliminates selection bias. Consequently, the observed difference now directly estimates the true causal effect:

$E[Y_i | D_i = 1] - E[Y_i | D_i = 0] = E[Y_{1i} - Y_{0i}]$

Thus, randomized controlled trials provide an unbiased estimate of the average treatment effect.

22.3.3 Another Representation Under Regression

So far, we have framed the selection problem using expectations and potential outcomes. Another way to represent treatment effects is through regression models, which provide a practical framework for estimation.

Suppose the treatment effect is constant across individuals:

$Y_{1i} - Y_{0i} = \rho$

This implies that each treated individual experiences the same treatment effect ( $\rho$ ), though their baseline outcomes ( $Y_{0i}$ ) may vary.

Since we only observe one of the potential outcomes, the observed outcome can be expressed as:

$\begin{aligned} Y_i &= E(Y_{0i}) + (Y_{1i} - Y_{0i}) D_i + [Y_{0i} - E(Y_{0i})] \\ &= \alpha + \rho D_i + \eta_i \end{aligned}$

where:

$\alpha = E(Y_{0i})$ , the expected outcome for untreated individuals.
$\rho$ represents the causal treatment effect.
$\eta_i = Y_{0i} - E(Y_{0i})$ , capturing individual deviations from the mean untreated outcome.

Thus, the regression model provides an intuitive way to express treatment effects.

22.3.3.1 Conditional Expectations and Selection Bias

Taking expectations conditional on treatment status:

$\begin{aligned} E[Y_i |D_i = 1] &= \alpha + \rho + E[\eta_i |D_i = 1] \\ E[Y_i |D_i = 0] &= \alpha + E[\eta_i |D_i = 0] \end{aligned}$

The observed difference in means between treated and untreated groups is:

$E[Y_i |D_i = 1] - E[Y_i |D_i = 0] = \rho + E[\eta_i |D_i = 1] - E[\eta_i |D_i = 0]$

Here, the term $E[\eta_i |D_i = 1] -E[\eta_i |D_i = 0]$ represents selection bias—the correlation between the regression error term ( $\eta_i$ ) and the treatment variable ( $D_i$ ).

Under random assignment, we assume that potential outcomes are independent of treatment ( $D_i$ ):

$E[\eta_i |D_i = 1] -E[\eta_i |D_i = 0] = E[Y_{0i} |D_i = 1] -E[Y_{0i}|D_i = 0] = 0$

Thus, under true randomization, selection bias disappears, and the observed difference directly estimates the causal effect $\rho$ .

22.3.3.2 Controlling for Additional Variables

In many real-world scenarios, random assignment is imperfect, and selection bias may still exist. To mitigate this, we introduce control variables ( $X_i$ ), such as demographic characteristics, firm size, or prior purchasing behavior.

If $X_i$ is uncorrelated with treatment ( $D_i$ ), including it in our regression model does not bias the estimate of $\rho$ but has two advantages:

It reduces residual variance ( $\eta_i$ ), improving the precision of $\rho$ .
It accounts for additional sources of variability, making the model more robust.

Thus, our regression model extends to:

$Y_i = \alpha + \rho D_i + X_i'\gamma + \eta_i$

where:

$X_i$ represents a vector of control variables.
$\gamma$ captures the effect of $X_i$ on the outcome.

22.3.3.3 Example: Racial Discrimination in Hiring

A famous study by Bertrand and Mullainathan (2004) examined racial discrimination in hiring by randomly assigning Black- and White-sounding names to identical job applications. By ensuring that names were assigned randomly, the authors eliminated confounding factors like education and experience, allowing them to estimate the causal effect of race on callback rates.

References

Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review 94 (4): 991–1013.