20.6 Illustrative Equations and Mathematical Contrasts

Below, we showcase a few derivations that highlight how predictive modeling vs. causal inference differ in their mathematical structure and interpretation.

20.6.1 Risk Minimization vs. Consistency

Consider a real-valued outcome Y and predictors X. Let (y,ˆy) be a loss function, and define the Bayes regressor f as:

f=argminfE[(Y,f(X))].

For squared error loss, the Bayes regressor is f(x)=E[YX=x].

A learning algorithm tries to approximate f. If we parametrize fβ(x)=xβ and do empirical risk minimization with a large enough sample, β converges to the minimizer of:

β=argminβE[(YXβ)2].

Note that β is the solution to E[XX]β=E[XY]. If Cov(X,X) is invertible, then

β=Cov(X,X)1Cov(X,Y).

This β is not necessarily the same as the “true” β0 from a structural equation Y=Xβ0+ε unless E[εX]=0.

From a predictive standpoint, β is the best linear predictor in the sense of mean squared error. From a causal standpoint, we want β0 such that ε is mean-independent of X. If that fails, ββ0.

20.6.2 Partial Derivatives vs. Predictions

A powerful way to see the difference is to compare:

  • xf(x) – The partial derivative of the best predictor w.r.t. x. This is about how the model’s prediction changes with x.
  • xmβ(x) – The partial derivative of the structural function mβ(). This is about how the true outcome Y changes with x, i.e., a causal effect if mβ is indeed structural.

Unless the model was identified and the assumptions hold (exogeneity, no omitted variables, etc.), the partial derivative from a purely predictive model does not represent the causal effect.

In short: “slopes” from a black-box predictive model are not guaranteed to reflect how interventions on X would shift Y.

20.6.3 Example: High-Dimensional Regularization

Suppose we have a large number of predictors p, possibly pn. A common approach in both prediction and inference is LASSO:

ˆβLASSO=argminβ{1nni=1(yixiβ)2+λ

  • Prediction: Choose \lambda to optimize out-of-sample MSE. Some bias is introduced in \hat{\beta}, but the final model might predict extremely well, especially if many true coefficients are near zero.
  • Causal Estimation: We must worry about whether the LASSO is shrinking or zeroing out confounders. If a crucial confounder’s coefficient is set to zero, the resulting estimate for a treatment variable’s coefficient will be biased. Therefore, special procedures (like the double/debiased machine learning approach (Chernozhukov et al. 2018)) are introduced to correct for the selection bias or to do post-selection inference (Belloni, Chernozhukov, and Hansen 2014).

The mathematics of “best subset” for prediction vs. valid coverage intervals for parameters diverges significantly.

20.6.4 Potential Outcomes Notation

Let D \in \{0, 1\} be a treatment indicator, and define potential outcomes:

Y_i(0), Y_i(1).

The observed outcome is:

Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).

  • Prediction: One might train a model \hat{Y} = \hat{f}(X, D) to guess Y from (X, D). That model could be a black box with no guarantee that \hat{Y}(1) - \hat{Y}(0) is an unbiased estimate of Y_i(1) - Y_i(0).
  • Causal Inference: We want to estimate \mathbb{E}[Y(1) - Y(0)] or \mathbb{E}[Y(1) - Y(0) \mid X = x]. Identification typically requires \{Y(0), Y(1)\} \perp D \mid X, i.e., after conditioning on X, the treatment assignment is as-if random. Under such an assumption, the difference \hat{f}(x, 1) - \hat{f}(x, 0) can be interpreted as a causal effect.

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives 28 (2): 29–50.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68.