20.2 Mathematical Setup

20.2.1 Probability Space and Data

We posit a probability space (Ω,F,P) and random variables (X,Y) on it. We typically have an i.i.d. sample {(Xi,Yi)}ni=1 from the true distribution D. Let:

(X,Y)D,(Xi,Yi)i.i.d.D.

In prediction, we train on {(Xi,Yi)}ni=1 to obtain ˆf, and we evaluate on a test point (˜X,˜Y) drawn from D. In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:

  • Potential outcomes {Yi(0),Yi(1)} for treatment effect settings.
  • Unconfoundedness or randomization assumptions.

20.2.2 Loss Functions and Risk

A general framework for both tasks is the risk minimization approach. For a function f, define:

  • The population (or expected) risk: R(f)=E[L(f(X),Y)].
  • The empirical risk (on a sample of size n): ˆRn(f)=1nni=1L(f(Xi),Yi).

Prediction: We often solve the empirical risk minimization (ERM) problem:

ˆf=argminfFˆRn(f),

possibly with regularization. The measure of success is \mathcal{R}(\hat{f}), i.e., how well \hat{f} generalizes beyond the training sample.

Causal/Parameter Estimation: We might define an M-estimator for \beta (Newey and McFadden 1994). Consider a function \psi(\beta; X, Y) such that the true parameter \beta_0 satisfies:

\mathbb{E}[\psi(\beta_0; X, Y)] = 0.

The empirical M-estimator solves

\hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|,

or equivalently sets it to zero in a method-of-moments sense:

\frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0.

Properties like consistency (\hat{\beta} \overset{p}{\to} \beta_0) or asymptotic normality (\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)) are central. The emphasis is on uncovering the true \beta_0 rather than purely predictive accuracy.


References

Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics 4: 2111–2245.