20.6 Illustrative Equations and Mathematical Contrasts

Below, we showcase a few derivations that highlight how predictive modeling vs. causal inference differ in their mathematical structure and interpretation.

20.6.1 Risk Minimization vs. Consistency

Consider a real-valued outcome \(Y\) and predictors \(X\). Let \(\ell(y, \hat{y})\) be a loss function, and define the Bayes regressor \(f^*\) as:

\[ f^* = \arg \min_f \mathbb{E}[\ell(Y, f(X))]. \]

For squared error loss, the Bayes regressor is \(f^*(x) = \mathbb{E}[Y \mid X = x]\).

A learning algorithm tries to approximate \(f^*\). If we parametrize \(f_\beta(x) = x^\top \beta\) and do empirical risk minimization with a large enough sample, \(\beta\) converges to the minimizer of:

\[ \beta^* = \arg \min_\beta \mathbb{E}[(Y - X^\top \beta)^2]. \]

Note that \(\beta^*\) is the solution to \(\mathbb{E}[XX^\top] \beta = \mathbb{E}[XY]\). If \(\text{Cov}(X, X)\) is invertible, then

\[ \beta^* = \text{Cov}(X, X)^{-1} \text{Cov}(X, Y). \]

This \(\beta^*\) is not necessarily the same as the “true” \(\beta_0\) from a structural equation \(Y = X\beta_0 + \varepsilon\) unless \(\mathbb{E}[\varepsilon \mid X] = 0\).

From a predictive standpoint, \(\beta^*\) is the best linear predictor in the sense of mean squared error. From a causal standpoint, we want \(\beta_0\) such that \(\varepsilon\) is mean-independent of \(X\). If that fails, \(\beta^* \neq \beta_0\).

20.6.2 Partial Derivatives vs. Predictions

A powerful way to see the difference is to compare:

\(\frac{\partial}{\partial x} f^*(x)\) – The partial derivative of the best predictor w.r.t. \(x\). This is about how the model’s prediction changes with \(x\).
\(\frac{\partial}{\partial x} m_\beta(x)\) – The partial derivative of the structural function \(m_\beta(\cdot)\). This is about how the true outcome \(Y\) changes with \(x\), i.e., a causal effect if \(m_\beta\) is indeed structural.

Unless the model was identified and the assumptions hold (exogeneity, no omitted variables, etc.), the partial derivative from a purely predictive model does not represent the causal effect.

In short: “slopes” from a black-box predictive model are not guaranteed to reflect how interventions on \(X\) would shift \(Y\).

20.6.3 Example: High-Dimensional Regularization

Suppose we have a large number of predictors \(p\), possibly \(p \gg n\). A common approach in both prediction and inference is LASSO:

\[ \hat{\beta}_{\text{LASSO}} = \arg \min_\beta \left\{ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1 \right\}. \]

Prediction: Choose \(\lambda\) to optimize out-of-sample MSE. Some bias is introduced in \(\hat{\beta}\), but the final model might predict extremely well, especially if many true coefficients are near zero.
Causal Estimation: We must worry about whether the LASSO is shrinking or zeroing out confounders. If a crucial confounder’s coefficient is set to zero, the resulting estimate for a treatment variable’s coefficient will be biased. Therefore, special procedures (like the double/debiased machine learning approach (Chernozhukov et al. 2018)) are introduced to correct for the selection bias or to do post-selection inference (Belloni, Chernozhukov, and Hansen 2014).

The mathematics of “best subset” for prediction vs. valid coverage intervals for parameters diverges significantly.

20.6.4 Potential Outcomes Notation

Let \(D \in \{0, 1\}\) be a treatment indicator, and define potential outcomes:

\[ Y_i(0), Y_i(1). \]

The observed outcome is:

\[ Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0). \]

Prediction: One might train a model \(\hat{Y} = \hat{f}(X, D)\) to guess \(Y\) from \((X, D)\). That model could be a black box with no guarantee that \(\hat{Y}(1) - \hat{Y}(0)\) is an unbiased estimate of \(Y_i(1) - Y_i(0)\).
Causal Inference: We want to estimate \(\mathbb{E}[Y(1) - Y(0)]\) or \(\mathbb{E}[Y(1) - Y(0) \mid X = x]\). Identification typically requires \(\{Y(0), Y(1)\} \perp D \mid X\), i.e., after conditioning on \(X\), the treatment assignment is as-if random. Under such an assumption, the difference \(\hat{f}(x, 1) - \hat{f}(x, 0)\) can be interpreted as a causal effect.

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives 28 (2): 29–50.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68.