20.4 Parameter Estimation and Causal Inference

20.4.1 Estimation in Parametric Models

In a simple parametric form:

$Y = X\beta + \varepsilon, \quad \mathbb{E}[\varepsilon \mid X] = 0, \quad \text{Var}(\varepsilon \mid X) = \sigma^2 I.$

The Ordinary Least Squares estimator is:

$\hat{\beta}_{\text{OLS}} = \arg \min_\beta \|Y - X\beta\|_2^2 = (X^\top X)^{-1} X^\top Y.$

Under classical assumptions (e.g., no perfect collinearity, homoskedastic errors), $\hat{\beta}_{\text{OLS}}$ is BLUE—the Best Linear Unbiased Estimator.

In a more general form, parameter estimation, denoted $\hat{\beta}$ , focuses on estimating the relationship between $y$ and $x$ , often with a view toward causality. In many econometric or statistical settings, we write:

$y = x^\top \beta + \varepsilon,$

or more generally $y = g\bigl(x;\beta\bigr) + \varepsilon,$ where $\beta$ encodes the structural or causal parameters we wish to recover.

The core aim is consistency—that is, for large $n$ , we want $\hat{\beta}$ to converge to the true $\beta$ that defines the underlying relationship. In other words:

$\hat{\beta} \xrightarrow{p} \beta, \quad \text{as } n \to \infty.$

Some texts phrase it informally as requiring that

$\mathbb{E}\bigl[\hat{f}\bigr] = f,$

meaning the estimator is (asymptotically) unbiased for the true function or parameters.

However, consistency alone may not suffice for scientific inference. One often also examines:

Asymptotic Normality: $\sqrt{n}(\hat{\beta} - \beta) \;\;\xrightarrow{d}\;\; \mathcal{N}(0,\Sigma).$
Confidence Intervals: $\hat{\beta}_j \;\pm\; z_{\alpha/2}\,\mathrm{SE}\bigl(\hat{\beta}_j\bigr).$
Hypothesis Tests: $H_0\colon \beta_j = 0 \quad\text{vs.}\quad H_1\colon \beta_j \neq 0.$

20.4.2 Causal Inference Fundamentals

To interpret $\beta$ in $Y = X\beta + \varepsilon$ as “causal,” we typically require that changes in $X$ (or at least in one component of $X$ ) lead to changes in $Y$ that are not confounded by omitted variables or simultaneity. In a prototypical potential-outcomes framework (for a binary treatment $D$ ):

$Y_i(1)$ : outcome if unit $i$ receives treatment $D = 1$ .
$Y_i(0)$ : outcome if unit $i$ receives no treatment $D = 0$ .

The observed outcome $Y_i$ is

$Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).$

The Average Treatment Effect (ATE) is:

$\tau = \mathbb{E}[Y(1) - Y(0)].$

Identification of $\tau$ requires an assumption like unconfoundedness:

$\{Y(0), Y(1)\} \perp D \mid X,$

i.e., after conditioning on $X$ , the treatment assignment is as-if random. Estimation strategies then revolve around properly adjusting for $X$ .

Such assumptions are not necessary for raw prediction of $Y$ : a black-box function can yield $\hat{Y} \approx Y$ without ensuring that $\hat{Y}(1) - \hat{Y}(0)$ is an unbiased estimate of $\tau$ .

20.4.3 Role of Identification

Identification means that the parameter of interest ( $\beta$ or $\tau$ ) is uniquely pinned down by the distribution of observables (under assumptions). If $\beta$ is not identified (e.g., because of endogeneity or insufficient variation in $X$ ), no matter how large the sample, we cannot estimate $\beta$ consistently.

In prediction, “identification” is not usually the main concern. The function $\hat{f}(x)$ could be a complicated ensemble method that just fits well, without guaranteeing any structural or causal interpretation of its parameters.

20.4.4 Challenges

High-Dimensional Spaces: With large $p$ (number of predictors), covariance among variables (multicollinearity) can hamper classical estimation. This is the setting of the well-known bias-variance tradeoff (Hastie, Tibshirani, and Friedman 2009; Bishop 2006).
Endogeneity: If $x$ is correlated with the error term $\varepsilon$ , ordinary least squares (OLS) is biased. Causal inference demands identifying exogenous variation in $x$ , which requires additional assumptions or designs (e.g., randomization).
Model Misspecification: If the functional form $g\bigl(x;\beta\bigr)$ is incorrect, parameter estimates can systematically deviate from capturing the true underlying mechanism.

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Hastie, Trevor, Robert Tibshirani, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.