20.2 Mathematical Setup
20.2.1 Probability Space and Data
We posit a probability space \((\Omega, \mathcal{F}, P)\) and random variables \((X, Y)\) on it. We typically have an i.i.d. sample \(\{(X_i, Y_i)\}_{i=1}^n\) from the true distribution \(\mathcal{D}\). Let:
\[ (X, Y) \sim \mathcal{D}, \quad (X_i, Y_i) \overset{\text{i.i.d.}}{\sim} \mathcal{D}. \]
In prediction, we train on \(\{(X_i, Y_i)\}_{i=1}^n\) to obtain \(\hat{f}\), and we evaluate on a test point \((\tilde{X}, \tilde{Y})\) drawn from \(\mathcal{D}\). In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:
- Potential outcomes \(\{Y_i(0), Y_i(1)\}\) for treatment effect settings.
- Unconfoundedness or randomization assumptions.
20.2.2 Loss Functions and Risk
A general framework for both tasks is the risk minimization approach. For a function \(f\), define:
- The population (or expected) risk: \[ \mathcal{R}(f) = \mathbb{E}[L(f(X), Y)]. \]
- The empirical risk (on a sample of size \(n\)): \[ \hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i). \]
Prediction: We often solve the empirical risk minimization (ERM) problem:
\[ \hat{f} = \arg \min_{f \in \mathcal{F}} \hat{\mathcal{R}}_n(f), \]
possibly with regularization. The measure of success is \(\mathcal{R}(\hat{f})\), i.e., how well \(\hat{f}\) generalizes beyond the training sample.
Causal/Parameter Estimation: We might define an \(M\)-estimator for \(\beta\) (Newey and McFadden 1994). Consider a function \(\psi(\beta; X, Y)\) such that the true parameter \(\beta_0\) satisfies:
\[ \mathbb{E}[\psi(\beta_0; X, Y)] = 0. \]
The empirical \(M\)-estimator solves
\[ \hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|, \]
or equivalently sets it to zero in a method-of-moments sense:
\[ \frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0. \]
Properties like consistency (\(\hat{\beta} \overset{p}{\to} \beta_0\)) or asymptotic normality (\(\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)\)) are central. The emphasis is on uncovering the true \(\beta_0\) rather than purely predictive accuracy.