20.4 Parameter Estimation and Causal Inference
20.4.1 Estimation in Parametric Models
In a simple parametric form:
\[ Y = X\beta + \varepsilon, \quad \mathbb{E}[\varepsilon \mid X] = 0, \quad \text{Var}(\varepsilon \mid X) = \sigma^2 I. \]
The Ordinary Least Squares estimator is:
\[ \hat{\beta}_{\text{OLS}} = \arg \min_\beta \|Y - X\beta\|_2^2 = (X^\top X)^{-1} X^\top Y. \]
Under classical assumptions (e.g., no perfect collinearity, homoskedastic errors), \(\hat{\beta}_{\text{OLS}}\) is BLUE—the Best Linear Unbiased Estimator.
In a more general form, parameter estimation, denoted \(\hat{\beta}\), focuses on estimating the relationship between \(y\) and \(x\), often with a view toward causality. In many econometric or statistical settings, we write:
\[ y = x^\top \beta + \varepsilon, \]
or more generally \(y = g\bigl(x;\beta\bigr) + \varepsilon,\) where \(\beta\) encodes the structural or causal parameters we wish to recover.
The core aim is consistency—that is, for large \(n\), we want \(\hat{\beta}\) to converge to the true \(\beta\) that defines the underlying relationship. In other words:
\[ \hat{\beta} \xrightarrow{p} \beta, \quad \text{as } n \to \infty. \]
Some texts phrase it informally as requiring that
\[ \mathbb{E}\bigl[\hat{f}\bigr] = f, \]
meaning the estimator is (asymptotically) unbiased for the true function or parameters.
However, consistency alone may not suffice for scientific inference. One often also examines:
- Asymptotic Normality: \(\sqrt{n}(\hat{\beta} - \beta) \;\;\xrightarrow{d}\;\; \mathcal{N}(0,\Sigma).\)
- Confidence Intervals: \(\hat{\beta}_j \;\pm\; z_{\alpha/2}\,\mathrm{SE}\bigl(\hat{\beta}_j\bigr).\)
- Hypothesis Tests: \(H_0\colon \beta_j = 0 \quad\text{vs.}\quad H_1\colon \beta_j \neq 0.\)
20.4.2 Causal Inference Fundamentals
To interpret \(\beta\) in \(Y = X\beta + \varepsilon\) as “causal,” we typically require that changes in \(X\) (or at least in one component of \(X\)) lead to changes in \(Y\) that are not confounded by omitted variables or simultaneity. In a prototypical potential-outcomes framework (for a binary treatment \(D\)):
- \(Y_i(1)\): outcome if unit \(i\) receives treatment \(D = 1\).
- \(Y_i(0)\): outcome if unit \(i\) receives no treatment \(D = 0\).
The observed outcome \(Y_i\) is
\[ Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0). \]
The Average Treatment Effect (ATE) is:
\[ \tau = \mathbb{E}[Y(1) - Y(0)]. \]
Identification of \(\tau\) requires an assumption like unconfoundedness:
\[ \{Y(0), Y(1)\} \perp D \mid X, \]
i.e., after conditioning on \(X\), the treatment assignment is as-if random. Estimation strategies then revolve around properly adjusting for \(X\).
Such assumptions are not necessary for raw prediction of \(Y\): a black-box function can yield \(\hat{Y} \approx Y\) without ensuring that \(\hat{Y}(1) - \hat{Y}(0)\) is an unbiased estimate of \(\tau\).
20.4.3 Role of Identification
Identification means that the parameter of interest (\(\beta\) or \(\tau\)) is uniquely pinned down by the distribution of observables (under assumptions). If \(\beta\) is not identified (e.g., because of endogeneity or insufficient variation in \(X\)), no matter how large the sample, we cannot estimate \(\beta\) consistently.
In prediction, “identification” is not usually the main concern. The function \(\hat{f}(x)\) could be a complicated ensemble method that just fits well, without guaranteeing any structural or causal interpretation of its parameters.
20.4.4 Challenges
- High-Dimensional Spaces: With large \(p\) (number of predictors), covariance among variables (multicollinearity) can hamper classical estimation. This is the setting of the well-known bias-variance tradeoff (Hastie et al. 2009; Bishop and Nasrabadi 2006).
- Endogeneity: If \(x\) is correlated with the error term \(\varepsilon\), ordinary least squares (OLS) is biased. Causal inference demands identifying exogenous variation in \(x\), which requires additional assumptions or designs (e.g., randomization).
- Model Misspecification: If the functional form \(g\bigl(x;\beta\bigr)\) is incorrect, parameter estimates can systematically deviate from capturing the true underlying mechanism.