20.2 Mathematical Setup
20.2.1 Probability Space and Data
We posit a probability space (Ω,F,P) and random variables (X,Y) on it. We typically have an i.i.d. sample {(Xi,Yi)}ni=1 from the true distribution D. Let:
(X,Y)∼D,(Xi,Yi)i.i.d.∼D.
In prediction, we train on {(Xi,Yi)}ni=1 to obtain ˆf, and we evaluate on a test point (˜X,˜Y) drawn from D. In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:
- Potential outcomes {Yi(0),Yi(1)} for treatment effect settings.
- Unconfoundedness or randomization assumptions.
20.2.2 Loss Functions and Risk
A general framework for both tasks is the risk minimization approach. For a function f, define:
- The population (or expected) risk: R(f)=E[L(f(X),Y)].
- The empirical risk (on a sample of size n): ˆRn(f)=1nn∑i=1L(f(Xi),Yi).
Prediction: We often solve the empirical risk minimization (ERM) problem:
ˆf=argminf∈FˆRn(f),
possibly with regularization. The measure of success is \mathcal{R}(\hat{f}), i.e., how well \hat{f} generalizes beyond the training sample.
Causal/Parameter Estimation: We might define an M-estimator for \beta (Newey and McFadden 1994). Consider a function \psi(\beta; X, Y) such that the true parameter \beta_0 satisfies:
\mathbb{E}[\psi(\beta_0; X, Y)] = 0.
The empirical M-estimator solves
\hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|,
or equivalently sets it to zero in a method-of-moments sense:
\frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0.
Properties like consistency (\hat{\beta} \overset{p}{\to} \beta_0) or asymptotic normality (\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)) are central. The emphasis is on uncovering the true \beta_0 rather than purely predictive accuracy.