20.3 Prediction in Detail

20.3.1 Empirical Risk Minimization and Generalization

In supervised learning, the goal is to find a function $f$ from a class of candidate models $\mathcal{F}$ (e.g., linear models, neural networks, tree-based models) that accurately predicts an outcome $Y$ given an input $X$ . This is typically formulated as an Empirical Risk Minimization problem, where we seek to minimize the average loss over the training data:

$\hat{f} = \arg \min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i).$

where $L(\cdot, \cdot)$ is a loss function that quantifies the error between predictions and actual values. Common choices include:

Squared Error (Regression): $L(\hat{y}, y) = (\hat{y} - y)^2$ .
Absolute Error (Regression): $L(\hat{y}, y) = |\hat{y} - y|$ .
Logistic Loss (Classification): $L(\hat{p}, y) = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]$ .

By minimizing empirical risk, we find a function $\hat{f}$ that best fits the observed data. However, minimizing training error does not guarantee good generalization—the ability of $\hat{f}$ to perform well on unseen data.

20.3.1.1 Overfitting and Regularization

If $\mathcal{F}$ is very large or expressive (e.g., deep neural networks with millions of parameters), $\hat{f}$ can become too complex, learning patterns that exist in the training set but do not generalize to new data. This is called overfitting.

To mitigate overfitting, we introduce regularization, modifying the optimization objective to penalize complex models:

$\hat{f}_\lambda = \arg \min_{f \in \mathcal{F}} \left\{ \hat{\mathcal{R}}_n(f) + \lambda \Omega(f) \right\}.$

where:

$\hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^{n} L(f(X_i), Y_i)$ is the empirical risk.
$\Omega(f)$ is a complexity penalty that discourages overly flexible models.
$\lambda$ controls the strength of regularization.

Common choices of $\Omega(f)$ include:

LASSO penalty: $\|\beta\|_1$ (sparsity constraint in linear models).
Ridge penalty: $\|\beta\|_2^2$ (shrinking coefficients to reduce variance).
Neural network weight decay: $\sum w^2$ (prevents exploding weights).

Regularization encourages simpler models, which are more likely to generalize well.

20.3.1.2 Generalization and Statistical Learning Theory

A fundamental question in machine learning is: How well does $\hat{f}$ perform on unseen data? This is captured by the expected risk:

$R(f) = \mathbb{E}[L(f(X), Y)].$

Ideally, we want to minimize the gap between the true risk $R(\hat{f})$ and the best possible risk $R(f^*)$ within $\mathcal{F}$ :

$R(\hat{f}) - \min_{f \in \mathcal{F}} R(f).$

This difference, called the excess risk, measures how well $\hat{f}$ generalizes beyond the training sample. Statistical Learning Theory provides theoretical tools to analyze this gap (Vapnik 2013; Hastie, Tibshirani, and Friedman 2009). In particular, it establishes generalization bounds that depend on the capacity of the function class $\mathcal{F}$ .

20.3.1.3 Complexity Measures

Two important ways to quantify the complexity of $\mathcal{F}$ are

20.3.1.3.1 VC Dimension

The VC dimension measures the ability of a hypothesis class $\mathcal{F}$ to fit arbitrary labels. Formally, the VC dimension of $\mathcal{F}$ , denoted as $\operatorname{VC}(\mathcal{F})$ , is the largest number of points that can be shattered by some function in $\mathcal{F}$ .

A set of points is shattered by $\mathcal{F}$ if, for every possible labeling of these points, there exists a function $f \in \mathcal{F}$ that perfectly classifies them.

Example 1: Linear Classifiers in 2D

Consider a set of points in $\mathbb{R}^2$ (the plane).
If $\mathcal{F}$ consists of linear decision boundaries, we can shatter at most three points in general position (because a single line can separate them in any way).
However, four points cannot always be shattered (e.g., if arranged in an XOR pattern). - Thus, the VC dimension of linear classifiers in $\mathbb{R}^2$ is 3.

Key Property:

A higher VC dimension means a more expressive model class (higher capacity).
If $\operatorname{VC}(\mathcal{F})$ is too large, the model can memorize the training set, leading to poor generalization.

20.3.1.3.2 Rademacher Complexity

VC dimension is a combinatorial measure, but Rademacher complexity is a more refined, data-dependent measure of function class flexibility.

Intuition: Rademacher complexity quantifies how well functions in $\mathcal{F}$ can correlate with random noise. If a function class can fit random labels well, it is too flexible and likely to overfit.

Definition:
Given $n$ training samples, let $\sigma_1, \dots, \sigma_n$ be independent Rademacher variables (i.e., random variables taking values $\pm1$ with equal probability). The empirical Rademacher complexity of $\mathcal{F}$ is:

$\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(X_i) \right].$

Interpretation:

If $\hat{\mathcal{R}}_n(\mathcal{F})$ is large, then $\mathcal{F}$ can fit random noise well $\Rightarrow$ high risk of overfitting.
If $\hat{\mathcal{R}}_n(\mathcal{F})$ is small, then $\mathcal{F}$ is more stable $\Rightarrow$ better generalization.

Example 2: Linear Models with Bounded Norm

Suppose $\mathcal{F}$ consists of linear models $f(X) = w^\top X$ , where $\|w\| \leq C$ .
The Rademacher complexity of this class scales as $\mathcal{O}(C/\sqrt{n})$ .
This suggests that controlling the norm of $w$ (e.g., via Ridge Regression) improves generalization.

20.3.2 Bias-Variance Decomposition

For a regression problem with squared-error loss, a classic decomposition is:

$\mathbb{E}_{\text{train}}[(\hat{f}(X) - Y)^2] = \underbrace{(\mathbb{E}[\hat{f}(X)] - f^*(X))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2]}_{\text{Variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{Irreducible Error}}$

where $f^*(X) = \mathbb{E}[Y \mid X]$ . Minimizing the sum of bias $^2$ and variance is key.

In prediction, a small increase in bias is often acceptable if it yields a large reduction in variance—this can improve out-of-sample performance. However, for causal inference, any added bias is problematic if it distorts the interpretation of parameters.

20.3.3 Example: Linear Regression for Prediction

Consider a linear predictor:

$\hat{y} = x^\top \hat{\beta}.$

We choose $\hat{\beta}$ to minimize:

$\sum_{i=1}^n (y_i - x_i^\top \beta)^2 \quad \text{or with a penalty:} \quad \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2.$

Goal: Achieve minimal prediction error on unseen data $(\tilde{x}, \tilde{y})$ .

The estimated $\hat{\beta}$ might be biased if we use regularization (e.g., ridge). But from a purely predictive lens, that bias can be advantageous if it lowers variance substantially and thus lowers expected prediction error.

20.3.4 Applications in Economics

In economics (and related social sciences), prediction plays an increasingly prominent role (Mullainathan and Spiess 2017; Athey and Imbens 2019):

Measure Variables: Predicting missing or proxy variables (e.g., predicting income from observable covariates, or predicting individual preferences from online behaviors).
Embed Prediction Tasks Within Parameter Estimation or Treatment Effects: Sometimes, a first-stage prediction (e.g., imputing missing data or generating prognostic scores) is used as an input for subsequent causal analyses.
Control for Observed Confounders: Machine learning methods—such as LASSO, random forests, or neural nets—can be used to control for high-dimensional $X$ when doing partial-out adjustments or residualizing outcomes (Belloni, Chernozhukov, and Hansen 2014; Chernozhukov et al. 2018).

References

———. 2019. “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics 11: 685–725.

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives 28 (2): 29–50.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68.

Hastie, Trevor, Robert Tibshirani, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.

Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives 31 (2): 87–106.

Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer science & business media.