20.3 Prediction in Detail
20.3.1 Empirical Risk Minimization and Generalization
In supervised learning, the goal is to find a function f from a class of candidate models F (e.g., linear models, neural networks, tree-based models) that accurately predicts an outcome Y given an input X. This is typically formulated as an Empirical Risk Minimization problem, where we seek to minimize the average loss over the training data:
ˆf=argmin
where L(\cdot, \cdot) is a loss function that quantifies the error between predictions and actual values. Common choices include:
- Squared Error (Regression): L(\hat{y}, y) = (\hat{y} - y)^2.
- Absolute Error (Regression): L(\hat{y}, y) = |\hat{y} - y|.
- Logistic Loss (Classification): L(\hat{p}, y) = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})].
By minimizing empirical risk, we find a function \hat{f} that best fits the observed data. However, minimizing training error does not guarantee good generalization—the ability of \hat{f} to perform well on unseen data.
20.3.1.1 Overfitting and Regularization
If \mathcal{F} is very large or expressive (e.g., deep neural networks with millions of parameters), \hat{f} can become too complex, learning patterns that exist in the training set but do not generalize to new data. This is called overfitting.
To mitigate overfitting, we introduce regularization, modifying the optimization objective to penalize complex models:
\hat{f}_\lambda = \arg \min_{f \in \mathcal{F}} \left\{ \hat{\mathcal{R}}_n(f) + \lambda \Omega(f) \right\}.
where:
\hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^{n} L(f(X_i), Y_i) is the empirical risk.
\Omega(f) is a complexity penalty that discourages overly flexible models.
\lambda controls the strength of regularization.
Common choices of \Omega(f) include:
LASSO penalty: \|\beta\|_1 (sparsity constraint in linear models).
Ridge penalty: \|\beta\|_2^2 (shrinking coefficients to reduce variance).
Neural network weight decay: \sum w^2 (prevents exploding weights).
Regularization encourages simpler models, which are more likely to generalize well.
20.3.1.2 Generalization and Statistical Learning Theory
A fundamental question in machine learning is: How well does \hat{f} perform on unseen data? This is captured by the expected risk:
R(f) = \mathbb{E}[L(f(X), Y)].
Ideally, we want to minimize the gap between the true risk R(\hat{f}) and the best possible risk R(f^*) within \mathcal{F}:
R(\hat{f}) - \min_{f \in \mathcal{F}} R(f).
This difference, called the excess risk, measures how well \hat{f} generalizes beyond the training sample. Statistical Learning Theory provides theoretical tools to analyze this gap (Vapnik 2013; Hastie et al. 2009). In particular, it establishes generalization bounds that depend on the capacity of the function class \mathcal{F}.
20.3.1.3 Complexity Measures
Two important ways to quantify the complexity of \mathcal{F} are
20.3.1.3.1 VC Dimension
The VC dimension measures the ability of a hypothesis class \mathcal{F} to fit arbitrary labels. Formally, the VC dimension of \mathcal{F}, denoted as \operatorname{VC}(\mathcal{F}), is the largest number of points that can be shattered by some function in \mathcal{F}.
- A set of points is shattered by \mathcal{F} if, for every possible labeling of these points, there exists a function f \in \mathcal{F} that perfectly classifies them.
Example 1: Linear Classifiers in 2D
Consider a set of points in \mathbb{R}^2 (the plane).
If \mathcal{F} consists of linear decision boundaries, we can shatter at most three points in general position (because a single line can separate them in any way).
However, four points cannot always be shattered (e.g., if arranged in an XOR pattern). - Thus, the VC dimension of linear classifiers in \mathbb{R}^2 is 3.
Key Property:
A higher VC dimension means a more expressive model class (higher capacity).
If \operatorname{VC}(\mathcal{F}) is too large, the model can memorize the training set, leading to poor generalization.
20.3.1.3.2 Rademacher Complexity
VC dimension is a combinatorial measure, but Rademacher complexity is a more refined, data-dependent measure of function class flexibility.
Intuition: Rademacher complexity quantifies how well functions in \mathcal{F} can correlate with random noise. If a function class can fit random labels well, it is too flexible and likely to overfit.
Definition:
Given n training samples, let \sigma_1, \dots, \sigma_n be independent Rademacher variables (i.e., random variables taking values \pm1 with equal probability). The empirical Rademacher complexity of \mathcal{F} is:
\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(X_i) \right].
Interpretation:
If \hat{\mathcal{R}}_n(\mathcal{F}) is large, then \mathcal{F} can fit random noise well \Rightarrow high risk of overfitting.
If \hat{\mathcal{R}}_n(\mathcal{F}) is small, then \mathcal{F} is more stable \Rightarrow better generalization.
Example 2: Linear Models with Bounded Norm
Suppose \mathcal{F} consists of linear models f(X) = w^\top X, where \|w\| \leq C.
The Rademacher complexity of this class scales as \mathcal{O}(C/\sqrt{n}).
This suggests that controlling the norm of w (e.g., via Ridge Regression) improves generalization.
20.3.2 Bias-Variance Decomposition
For a regression problem with squared-error loss, a classic decomposition is:
\mathbb{E}_{\text{train}}[(\hat{f}(X) - Y)^2] = \underbrace{(\mathbb{E}[\hat{f}(X)] - f^*(X))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2]}_{\text{Variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{Irreducible Error}}
where f^*(X) = \mathbb{E}[Y \mid X]. Minimizing the sum of bias^2 and variance is key.
In prediction, a small increase in bias is often acceptable if it yields a large reduction in variance—this can improve out-of-sample performance. However, for causal inference, any added bias is problematic if it distorts the interpretation of parameters.
20.3.3 Example: Linear Regression for Prediction
Consider a linear predictor:
\hat{y} = x^\top \hat{\beta}.
We choose \hat{\beta} to minimize:
\sum_{i=1}^n (y_i - x_i^\top \beta)^2 \quad \text{or with a penalty:} \quad \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2.
Goal: Achieve minimal prediction error on unseen data (\tilde{x}, \tilde{y}).
The estimated \hat{\beta} might be biased if we use regularization (e.g., ridge). But from a purely predictive lens, that bias can be advantageous if it lowers variance substantially and thus lowers expected prediction error.
20.3.4 Applications in Economics
In economics (and related social sciences), prediction plays an increasingly prominent role (Mullainathan and Spiess 2017; Athey and Imbens 2019):
- Measure Variables: Predicting missing or proxy variables (e.g., predicting income from observable covariates, or predicting individual preferences from online behaviors).
- Embed Prediction Tasks Within Parameter Estimation or Treatment Effects: Sometimes, a first-stage prediction (e.g., imputing missing data or generating prognostic scores) is used as an input for subsequent causal analyses.
- Control for Observed Confounders: Machine learning methods—such as LASSO, random forests, or neural nets—can be used to control for high-dimensional X when doing partial-out adjustments or residualizing outcomes (Belloni, Chernozhukov, and Hansen 2014; Chernozhukov et al. 2018).