20.7 Extended Mathematical Points

We now delve deeper into some mathematical nuances that are especially relevant when distinguishing between predictive vs. causal modeling.

20.7.1 M-Estimation and Asymptotic Theory

M-Estimators unify many approaches: maximum likelihood, method of moments, generalized method of moments, and quasi-likelihood estimators. Let β0 be the true parameter and define the population criterion function:

Q(β)=E[m(β;X,Y)],

for some function m. The M-estimator ˆβ solves:

ˆβ=argmax

(Or \arg \min, depending on convention.)

Under regularity conditions (Newey and McFadden 1994; White 1980), we have:

  • Consistency: \hat{\beta} \overset{p}{\to} \beta_0.
  • Asymptotic Normality: \sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma),

where \Sigma is derived from derivatives of m(\cdot; \cdot, \cdot) and the distribution of (X, Y).

For prediction, such classical asymptotic properties may be of less interest unless we want to build confidence intervals around predictions. For causal inference, the entire enterprise revolves around these properties to ensure valid inference about \beta_0.

20.7.2 The Danger of Omitted Variables

Consider a structural equation:

Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon, \quad \mathbb{E}[\varepsilon \mid X_1, X_2] = 0.

If we ignore X_2 and regress Y on X_1 only, the resulting \hat{\beta}_1 can be severely biased:

\hat{\beta}_1 = \arg\min_{b} \sum_{i=1}^n \bigl(y_i - b\,x_{i1}\bigr)^2.

The expected value of \hat{\beta}_1 in large samples is:

\beta_1 \;+\; \beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)}.

This extra term \displaystyle \beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)} is the omitted variables bias. For prediction, omitting X_2 might sometimes be acceptable if X_2 has little incremental predictive value or if we only care about accuracy in some domain. However, for inference on \beta_1, ignoring X_2 invalidates the causal interpretation.

20.7.3 Cross-Validation vs. Statistical Testing

  • Cross-Validation: Predominantly used in prediction tasks. We split the data into training and validation sets, measure out-of-sample error, and select hyperparameters that minimize CV error.

  • Statistical Testing: Predominantly used in inference tasks. We compute test statistics (e.g., t-test, Wald test), form confidence intervals, or test hypotheses about parameters (H_0: \beta_j = 0).

They serve different objectives:

  1. CV is about predictive model selection.
  2. Testing is about scientific or policy conclusions on whether \beta_j differs from zero (i.e., “Does a particular variable have a causal effect?”).

References

Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics 4: 2111–2245.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica: Journal of the Econometric Society, 817–38.