20.2 Mathematical Setup

20.2.1 Probability Space and Data

We posit a probability space \((\Omega, \mathcal{F}, P)\) and random variables \((X, Y)\) on it. We typically have an i.i.d. sample \(\{(X_i, Y_i)\}_{i=1}^n\) from the true distribution \(\mathcal{D}\). Let:

\[ (X, Y) \sim \mathcal{D}, \quad (X_i, Y_i) \overset{\text{i.i.d.}}{\sim} \mathcal{D}. \]

In prediction, we train on \(\{(X_i, Y_i)\}_{i=1}^n\) to obtain \(\hat{f}\), and we evaluate on a test point \((\tilde{X}, \tilde{Y})\) drawn from \(\mathcal{D}\). In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:

  • Potential outcomes \(\{Y_i(0), Y_i(1)\}\) for treatment effect settings.
  • Unconfoundedness or randomization assumptions.

20.2.2 Loss Functions and Risk

A general framework for both tasks is the risk minimization approach. For a function \(f\), define:

  • The population (or expected) risk: \[ \mathcal{R}(f) = \mathbb{E}[L(f(X), Y)]. \]
  • The empirical risk (on a sample of size \(n\)): \[ \hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i). \]

Prediction: We often solve the empirical risk minimization (ERM) problem:

\[ \hat{f} = \arg \min_{f \in \mathcal{F}} \hat{\mathcal{R}}_n(f), \]

possibly with regularization. The measure of success is \(\mathcal{R}(\hat{f})\), i.e., how well \(\hat{f}\) generalizes beyond the training sample.

Causal/Parameter Estimation: We might define an \(M\)-estimator for \(\beta\) (Newey and McFadden 1994). Consider a function \(\psi(\beta; X, Y)\) such that the true parameter \(\beta_0\) satisfies:

\[ \mathbb{E}[\psi(\beta_0; X, Y)] = 0. \]

The empirical \(M\)-estimator solves

\[ \hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|, \]

or equivalently sets it to zero in a method-of-moments sense:

\[ \frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0. \]

Properties like consistency (\(\hat{\beta} \overset{p}{\to} \beta_0\)) or asymptotic normality (\(\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)\)) are central. The emphasis is on uncovering the true \(\beta_0\) rather than purely predictive accuracy.


References

Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics 4: 2111–2245.