20.7 Extended Mathematical Points

We now delve deeper into some mathematical nuances that are especially relevant when distinguishing between predictive vs. causal modeling.

20.7.1 M-Estimation and Asymptotic Theory

$M$ -Estimators unify many approaches: maximum likelihood, method of moments, generalized method of moments, and quasi-likelihood estimators. Let $\beta_0$ be the true parameter and define the population criterion function:

$Q(\beta) = \mathbb{E}[m(\beta; X, Y)],$

for some function $m$ . The M-estimator $\hat{\beta}$ solves:

$\hat{\beta} = \arg \max_{\beta \in \Theta} \frac{1}{n} \sum_{i=1}^n m(\beta; X_i, Y_i).$

(Or $\arg \min$ , depending on convention.)

Under regularity conditions (Newey and McFadden 1994; White 1980), we have:

Consistency: $\hat{\beta} \overset{p}{\to} \beta_0$ .
Asymptotic Normality: $\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)$ ,

where $\Sigma$ is derived from derivatives of $m(\cdot; \cdot, \cdot)$ and the distribution of $(X, Y)$ .

For prediction, such classical asymptotic properties may be of less interest unless we want to build confidence intervals around predictions. For causal inference, the entire enterprise revolves around these properties to ensure valid inference about $\beta_0$ .

20.7.2 The Danger of Omitted Variables

Consider a structural equation:

$Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon, \quad \mathbb{E}[\varepsilon \mid X_1, X_2] = 0.$

If we ignore $X_2$ and regress $Y$ on $X_1$ only, the resulting $\hat{\beta}_1$ can be severely biased:

$\hat{\beta}_1 = \arg\min_{b} \sum_{i=1}^n \bigl(y_i - b\,x_{i1}\bigr)^2.$

The expected value of $\hat{\beta}_1$ in large samples is:

$\beta_1 \;+\; \beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)}.$

This extra term $\displaystyle \beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)}$ is the omitted variables bias. For prediction, omitting $X_2$ might sometimes be acceptable if $X_2$ has little incremental predictive value or if we only care about accuracy in some domain. However, for inference on $\beta_1$ , ignoring $X_2$ invalidates the causal interpretation.

20.7.3 Cross-Validation vs. Statistical Testing

Cross-Validation: Predominantly used in prediction tasks. We split the data into training and validation sets, measure out-of-sample error, and select hyperparameters that minimize CV error.
Statistical Testing: Predominantly used in inference tasks. We compute test statistics (e.g., $t$ -test, Wald test), form confidence intervals, or test hypotheses about parameters ( $H_0: \beta_j = 0$ ).

They serve different objectives:

CV is about predictive model selection.
Testing is about scientific or policy conclusions on whether $\beta_j$ differs from zero (i.e., “Does a particular variable have a causal effect?”).

References

Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics 4: 2111–2245.

White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica: Journal of the Econometric Society, 817–38.