20.6 Illustrative Equations and Mathematical Contrasts
Below, we showcase a few derivations that highlight how predictive modeling vs. causal inference differ in their mathematical structure and interpretation.
20.6.1 Risk Minimization vs. Consistency
Consider a real-valued outcome Y and predictors X. Let ℓ(y,ˆy) be a loss function, and define the Bayes regressor f∗ as:
f∗=argminfE[ℓ(Y,f(X))].
For squared error loss, the Bayes regressor is f∗(x)=E[Y∣X=x].
A learning algorithm tries to approximate f∗. If we parametrize fβ(x)=x⊤β and do empirical risk minimization with a large enough sample, β converges to the minimizer of:
β∗=argminβE[(Y−X⊤β)2].
Note that β∗ is the solution to E[XX⊤]β=E[XY]. If Cov(X,X) is invertible, then
β∗=Cov(X,X)−1Cov(X,Y).
This β∗ is not necessarily the same as the “true” β0 from a structural equation Y=Xβ0+ε unless E[ε∣X]=0.
From a predictive standpoint, β∗ is the best linear predictor in the sense of mean squared error. From a causal standpoint, we want β0 such that ε is mean-independent of X. If that fails, β∗≠β0.
20.6.2 Partial Derivatives vs. Predictions
A powerful way to see the difference is to compare:
- ∂∂xf∗(x) – The partial derivative of the best predictor w.r.t. x. This is about how the model’s prediction changes with x.
- ∂∂xmβ(x) – The partial derivative of the structural function mβ(⋅). This is about how the true outcome Y changes with x, i.e., a causal effect if mβ is indeed structural.
Unless the model was identified and the assumptions hold (exogeneity, no omitted variables, etc.), the partial derivative from a purely predictive model does not represent the causal effect.
In short: “slopes” from a black-box predictive model are not guaranteed to reflect how interventions on X would shift Y.
20.6.3 Example: High-Dimensional Regularization
Suppose we have a large number of predictors p, possibly p≫n. A common approach in both prediction and inference is LASSO:
ˆβLASSO=argminβ{1nn∑i=1(yi−x⊤iβ)2+λ‖
- Prediction: Choose \lambda to optimize out-of-sample MSE. Some bias is introduced in \hat{\beta}, but the final model might predict extremely well, especially if many true coefficients are near zero.
- Causal Estimation: We must worry about whether the LASSO is shrinking or zeroing out confounders. If a crucial confounder’s coefficient is set to zero, the resulting estimate for a treatment variable’s coefficient will be biased. Therefore, special procedures (like the double/debiased machine learning approach (Chernozhukov et al. 2018)) are introduced to correct for the selection bias or to do post-selection inference (Belloni, Chernozhukov, and Hansen 2014).
The mathematics of “best subset” for prediction vs. valid coverage intervals for parameters diverges significantly.
20.6.4 Potential Outcomes Notation
Let D \in \{0, 1\} be a treatment indicator, and define potential outcomes:
Y_i(0), Y_i(1).
The observed outcome is:
Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).
- Prediction: One might train a model \hat{Y} = \hat{f}(X, D) to guess Y from (X, D). That model could be a black box with no guarantee that \hat{Y}(1) - \hat{Y}(0) is an unbiased estimate of Y_i(1) - Y_i(0).
- Causal Inference: We want to estimate \mathbb{E}[Y(1) - Y(0)] or \mathbb{E}[Y(1) - Y(0) \mid X = x]. Identification typically requires \{Y(0), Y(1)\} \perp D \mid X, i.e., after conditioning on X, the treatment assignment is as-if random. Under such an assumption, the difference \hat{f}(x, 1) - \hat{f}(x, 0) can be interpreted as a causal effect.