20.2 Mathematical Setup

20.2.1 Probability Space and Data

We posit a probability space $(\Omega, \mathcal{F}, P)$ and random variables $(X, Y)$ on it. We typically have an i.i.d. sample $\{(X_i, Y_i)\}_{i=1}^n$ from the true distribution $\mathcal{D}$ . Let:

$(X, Y) \sim \mathcal{D}, \quad (X_i, Y_i) \overset{\text{i.i.d.}}{\sim} \mathcal{D}.$

In prediction, we train on $\{(X_i, Y_i)\}_{i=1}^n$ to obtain $\hat{f}$ , and we evaluate on a test point $(\tilde{X}, \tilde{Y})$ drawn from $\mathcal{D}$ . In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:

Potential outcomes $\{Y_i(0), Y_i(1)\}$ for treatment effect settings.
Unconfoundedness or randomization assumptions.

20.2.2 Loss Functions and Risk

A general framework for both tasks is the risk minimization approach. For a function $f$ , define:

The population (or expected) risk: $\mathcal{R}(f) = \mathbb{E}[L(f(X), Y)].$
The empirical risk (on a sample of size $n$ ): $\hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i).$

Prediction: We often solve the empirical risk minimization (ERM) problem:

$\hat{f} = \arg \min_{f \in \mathcal{F}} \hat{\mathcal{R}}_n(f),$

possibly with regularization. The measure of success is $\mathcal{R}(\hat{f})$ , i.e., how well $\hat{f}$ generalizes beyond the training sample.

Causal/Parameter Estimation: We might define an $M$ -estimator for $\beta$ (Newey and McFadden 1994). Consider a function $\psi(\beta; X, Y)$ such that the true parameter $\beta_0$ satisfies:

$\mathbb{E}[\psi(\beta_0; X, Y)] = 0.$

The empirical $M$ -estimator solves

$\hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|,$

or equivalently sets it to zero in a method-of-moments sense:

$\frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0.$

Properties like consistency ( $\hat{\beta} \overset{p}{\to} \beta_0$ ) or asymptotic normality ( $\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)$ ) are central. The emphasis is on uncovering the true $\beta_0$ rather than purely predictive accuracy.

References

Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics 4: 2111–2245.