3 Introduction to Causal Inference

3.1 Notation and definitions

Just copied-and-pasted from old tutorial. Needs update

Each unit in our data set will be represented by:

  • \(X_{i}\): is a \(p\)-dimensional vector of observable pre-treatment characteristics
  • \(W_{i} \in \{0, 1\}\): is a binary variable indicating whether the individual was treated (\(1\)) or not (\(0\))
  • \(Y_{i}^{obs} \in \mathbb{R}\): a real variable indicating the observed outcome for that individual

Throughout our analysis, we will often be talking in terms of counterfactuals questions, e.g. “what would have happened if we had assigned the treatment to certain control units?” In order to express this mathematically, we will make use of the potential outcome framework of Rubin (1974). For example, for the case of a binary treatment (treat or not), we can define the following random variables:

  • \(Y_{i}(1)\): the outcome unit \(i\) would attain if they received the treatment
  • \(Y_{i}(0)\): the outcome unit \(i\) would attain if they were part of the control group

Naturally, we only ever get to observe one of these two for each unit, but it’s convenient to define so that we can think about counterfactuals. In fact, we can think of much of causal inference as a “missing value” problem: there exists an underlying data process generating random variables \((X_{i}, Y_{i}(0), Y_{i}(1))\), but we can only observe the realization of \((X_{i}, Y_{i}(0))\) (for control units) or \((X_{i}, Y_{i}(1))\) (for treated units), with the remaining outcome being missing.

\(X_{i}\) \(Y_{i}(0)\) \(Y_{i}(1)\)
\(X_{1}\) \(Y_{1}(0)\) \(Y_{1}(1)\)
\(X_{2}\) \(Y_{2}(0)\) \(Y_{2}(1)\)
\(X_{3}\) \(Y_{3}(0)\) \(Y_{3}(1)\)
\(\cdots\) \(\cdots\) \(\cdots\)
\(X_{n}\) \(Y_{n}(0)\) \(Y_{n}(1)\)

Using the potential outcome notation above, the observed outcome can also be written as

\[Y_{i}^{obs} = W_{i}Y_{i}(1) + (1-W_{i})Y_{i}(0)\]

In order to avoid clutter, we’ll from now own denote \(Y_{i}^{obs}\) simply by \(Y_{i}\).

3.2 Settings

In this tutorial we will assume we have one of two types of data:

3.2.0.1 Experimental setting

Main feature: randomized treatment assignment

Often known probabilities of assignment

The crucial assumption is unconfoundedness (also called ignorability), which states that treatment assignment is independent from potential outcomes,

\[W_i \perp Y_i(0), Y_i(1).\]

This assumption rules out situations where observations self-select into treatment. It will usually not be true in an observational setting. Under unconfoundedness, the difference-in-means estimator is unbiased for the ATE.

3.2.0.2 Observational setting

The first assumption that is required is (conditional) unconfoundedness (also called selection on observables or (conditional) ignorability). This assumption states that, conditional on observable characteristics, treatment assignment is independent from potential outcomes. \[W_i \perp (Y_i(0), Y_i(1)) \, \big| \, X_i.\]

This assumption is satisfied if treatment is assigned solely based on observable covariates. For example, if age is observable, then this allows for situations where younger people are more (or less) likely to receive treatment than older people.

The second assumption,…the propensity score \(e(X_i) := P[W_i = 1 \, | \, X_i = x]\)….

overlap is that everyone in our data had at least some probability of being in the control or treatment groups. That is, there isn’t anyone who is deterministically sent to treatment or control, \[\eta < P[W_i = 1 \, | \, X_i = x] < 1 - \eta \quad \text{for all }\quad x.\]

Under these two assumptions, we will be able to derive an estimator of the ATE that has several good statistical properties.