20.6 Illustrative Equations and Mathematical Contrasts

Below, we showcase a few derivations that highlight how predictive modeling vs. causal inference differ in their mathematical structure and interpretation.

20.6.1 Risk Minimization vs. Consistency

Consider a real-valued outcome Y and predictors X. Let (y,ˆy) be a loss function, and define the Bayes regressor f as:

f=argmin

For squared error loss, the Bayes regressor is f^*(x) = \mathbb{E}[Y \mid X = x].

A learning algorithm tries to approximate f^*. If we parametrize f_\beta(x) = x^\top \beta and do empirical risk minimization with a large enough sample, \beta converges to the minimizer of:

\beta^* = \arg \min_\beta \mathbb{E}[(Y - X^\top \beta)^2].

Note that \beta^* is the solution to \mathbb{E}[XX^\top] \beta = \mathbb{E}[XY]. If \text{Cov}(X, X) is invertible, then

\beta^* = \text{Cov}(X, X)^{-1} \text{Cov}(X, Y).

This \beta^* is not necessarily the same as the “true” \beta_0 from a structural equation Y = X\beta_0 + \varepsilon unless \mathbb{E}[\varepsilon \mid X] = 0.

From a predictive standpoint, \beta^* is the best linear predictor in the sense of mean squared error. From a causal standpoint, we want \beta_0 such that \varepsilon is mean-independent of X. If that fails, \beta^* \neq \beta_0.

20.6.2 Partial Derivatives vs. Predictions

A powerful way to see the difference is to compare:

  • \frac{\partial}{\partial x} f^*(x) – The partial derivative of the best predictor w.r.t. x. This is about how the model’s prediction changes with x.
  • \frac{\partial}{\partial x} m_\beta(x) – The partial derivative of the structural function m_\beta(\cdot). This is about how the true outcome Y changes with x, i.e., a causal effect if m_\beta is indeed structural.

Unless the model was identified and the assumptions hold (exogeneity, no omitted variables, etc.), the partial derivative from a purely predictive model does not represent the causal effect.

In short: “slopes” from a black-box predictive model are not guaranteed to reflect how interventions on X would shift Y.

20.6.3 Example: High-Dimensional Regularization

Suppose we have a large number of predictors p, possibly p \gg n. A common approach in both prediction and inference is LASSO:

\hat{\beta}_{\text{LASSO}} = \arg \min_\beta \left\{ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1 \right\}.

  • Prediction: Choose \lambda to optimize out-of-sample MSE. Some bias is introduced in \hat{\beta}, but the final model might predict extremely well, especially if many true coefficients are near zero.
  • Causal Estimation: We must worry about whether the LASSO is shrinking or zeroing out confounders. If a crucial confounder’s coefficient is set to zero, the resulting estimate for a treatment variable’s coefficient will be biased. Therefore, special procedures (like the double/debiased machine learning approach (Chernozhukov et al. 2018)) are introduced to correct for the selection bias or to do post-selection inference (Belloni, Chernozhukov, and Hansen 2014).

The mathematics of “best subset” for prediction vs. valid coverage intervals for parameters diverges significantly.

20.6.4 Potential Outcomes Notation

Let D \in \{0, 1\} be a treatment indicator, and define potential outcomes:

Y_i(0), Y_i(1).

The observed outcome is:

Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).

  • Prediction: One might train a model \hat{Y} = \hat{f}(X, D) to guess Y from (X, D). That model could be a black box with no guarantee that \hat{Y}(1) - \hat{Y}(0) is an unbiased estimate of Y_i(1) - Y_i(0).
  • Causal Inference: We want to estimate \mathbb{E}[Y(1) - Y(0)] or \mathbb{E}[Y(1) - Y(0) \mid X = x]. Identification typically requires \{Y(0), Y(1)\} \perp D \mid X, i.e., after conditioning on X, the treatment assignment is as-if random. Under such an assumption, the difference \hat{f}(x, 1) - \hat{f}(x, 0) can be interpreted as a causal effect.

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives 28 (2): 29–50.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68.