34.7 Cautions in IV

34.7.1 Negative \(R^2\) in IV

In IV estimation, particularly 2SLS and 3SLS, it is common and not problematic to encounter negative \(R^2\) values in the second stage regression. Unlike Ordinary Least Squares, where \(R^2\) is often used to assess the fit of the model, in IV regression the primary concern is consistency and unbiased estimation of the coefficients of interest, not the goodness-of-fit.

What Should You Look At Instead of \(R^2\) in IV?

  1. Instrument Relevance (First-stage \(F\)-statistics, Partial \(R^2\))
  2. Weak Instrument Tests (Kleibergen-Paap, Anderson-Rubin tests)
  3. Validity of Instruments (Overidentification tests like Sargan/Hansen J-test)
  4. Endogeneity Tests (Durbin-Wu-Hausman test for endogeneity)
  5. Confidence Intervals and Standard Errors, focusing on inference for \(\hat{\beta}\).

Geometric Intuition

  • In OLS, the fitted values \(\hat{y}\) are the orthogonal projection of \(y\) onto the column space of \(X\).
  • In 2SLS, \(\hat{y}\) is the projection onto the space spanned by \(Z\), not \(X\).
  • As a result, the angle between \(y\) and \(\hat{y}\) may not minimize the residual variance, and RSS can be larger than in OLS.

Recall the formula for the coefficient of determination (\(R^2\)) in a regression model:

\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{MSS}{TSS} \]

Where:

  • \(TSS\) is the Total Sum of Squares: \[ TSS = \sum_{i=1}^n (y_i - \bar{y})^2 \]
  • \(MSS\) is the Model Sum of Squares: \[ MSS = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]
  • \(RSS\) is the Residual Sum of Squares: \[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

In OLS, the \(R^2\) measures the proportion of variance in \(Y\) that is explained by the regressors \(X\).

Key Properties in OLS:

  • \(R^2 \in [0, 1]\)
  • Adding more regressors (even irrelevant ones) never decreases \(R^2\).
  • \(R^2\) measures in-sample goodness-of-fit, not causal interpretation.

34.7.1.1 Why Does \(R^2\) Lose Its Meaning in IV Regression?

In IV regression, the second stage regression replaces the endogenous variable \(X_2\) with its predicted values from the first stage:

Stage 1:

\[ X_2 = Z \pi + v \]

Stage 2:

\[ Y = X_1 \beta_1 + \hat{X}_2 \beta_2 + \epsilon \]

  • \(\hat{X}_2\) is not the observed \(X_2\), but a proxy constructed from \(Z\).
  • \(\hat{X}_2\) isolates the exogenous variation in \(X_2\) that is independent of \(\epsilon\).
  • This reduces bias, but comes at a cost:
    • The variation in \(\hat{X}_2\) is typically less than that in \(X_2\).
    • The predicted values \(\hat{y}_i\) from the second stage are not necessarily close to \(y_i\).

34.7.1.2 Why \(R^2\) Can Be Negative:

  1. \(R^2\) is calculated using: \[ R^2 = 1 - \frac{RSS}{TSS} \] But in IV:
  • The predicted values of \(Y\) are not chosen to minimize RSS, because IV is not minimizing the residuals in the second stage.
  • Unlike OLS, 2SLS chooses \(\hat{\beta}\) to satisfy moment conditions rather than minimizing the sum of squared errors.
  1. It is possible (and common in IV) for the residual sum of squares to be greater than the total sum of squares: \[ RSS > TSS \] Which makes: \[ R^2 = 1 - \frac{RSS}{TSS} < 0 \]

  2. This happens because:

    • The predicted values \(\hat{y}_i\) in IV are not optimized to fit the observed \(y_i\).
    • The residuals can be larger, because IV focuses on identifying causal effects, not prediction.

For example, assume we have:

  • \(TSS = 100\)

  • \(RSS = 120\)

Then: \[ R^2 = 1 - \frac{120}{100} = -0.20 \]

This happens because the IV procedure does not minimize RSS. It prioritizes solving the endogeneity problem over explaining the variance in \(Y\).


34.7.1.3 Why We Don’t Care About \(R^2\) in IV

  1. IV Estimates Focus on Consistency, Not Prediction
  • The goal of IV is to obtain a consistent estimate of \(\beta_2\).
  • IV sacrifices fit (higher variance in \(\hat{y}_i\)) to remove endogeneity bias.
  1. \(R^2\) Does Not Reflect the Quality of an IV Estimator
  • A high \(R^2\) in IV may be misleading (for instance, when instruments are weak or invalid).
  • A negative \(R^2\) does not imply a bad IV estimator if the assumptions of instrument validity are met.
  1. IV Regression Is About Identification, Not In-Sample Fit
  • IV relies on relevance and exogeneity of instruments, not residual minimization.

34.7.1.4 Technical Details on \(R^2\)

In OLS: \[ \hat{\beta}^{OLS} = (X'X)^{-1} X'Y \] Minimizes: \[ RSS = (Y - X \hat{\beta}^{OLS})'(Y - X \hat{\beta}^{OLS}) \]

In IV: \[ \hat{\beta}^{IV} = (X'P_Z X)^{-1} X'P_Z Y \]

where:

  • \(P_Z = Z (Z'Z)^{-1} Z'\) is the projection matrix onto \(Z\).

  • The IV estimator solves: \[ Z'(Y - X\hat{\beta}) = 0 \]

  • No guarantee that this minimizes RSS.

Residuals:

\[ e^{IV} = Y - X \hat{\beta}^{IV} \]

The norm of \(e^{IV}\) is typically larger than in OLS because IV uses fewer effective degrees of freedom (constrained variation via \(Z\)).

A Note on \(R^2\) in 3SLS and GMM

  • In 3SLS or GMM IV, \(R^2\) can be similarly misleading.
  • These methods often operate under moment conditions or system estimation, not residual minimization.

34.7.2 Many-Instruments Bias

While IV is powerful, it is also delicate. One critical issue that arises is the many-instruments problem, also known as many-IV bias.

Consider the structural model:

\(y_i = \beta x_i + u_i\)

where \(x_i\) is endogenous: \(\mathbb{E}[x_i u_i] \neq 0\). To address this, we introduce instruments \(z_i\) such that:

  1. Relevance: \(\mathbb{E}[z_i x_i] \neq 0\)
  2. Exogeneity: \(\mathbb{E}[z_i u_i] = 0\)

The standard 2SLS estimator is given by:

\(\hat{\beta}_{2SLS} = (X'P_ZX)^{-1} X'P_Zy\)

where \(P_Z = Z(Z'Z)^{-1}Z'\) is the projection matrix onto the column space of \(Z\).

The many-IV problem arises when the number of instruments \(L\) is large relative to the number of observations \(n\). In particular, the issue becomes severe as \(L/n \to \alpha > 0\), leading to several problems:

  • Bias Toward OLS: The 2SLS estimator becomes increasingly biased toward the OLS estimator.
  • Overfitting: The first-stage regression overfits the endogenous variable, capturing noise rather than true variation.
  • Inflated Variance: The second-stage estimates become imprecise, leading to misleading inference.

Traditional IV asymptotics assume \(L\) is fixed as \(n \to \infty\). Bekker (1994) proposed an alternative framework where:

\(L/n \to \alpha \in (0, \infty) \quad \text{as } n \to \infty\)

Under Bekker asymptotics:

  • 2SLS is biased and inconsistent unless the instruments are very strong.
  • The bias grows with \(\alpha\), approaching that of OLS.

This formalized the intuition that adding more instruments—especially weak ones—does not help, and can actually harm estimation.


34.7.2.1 Sources of Many-IV Bias

  1. Weak Instruments

Many instruments are individually weak (i.e., contribute little to explaining \(x\)), and collectively, they inflate the projection without improving identification.

  1. Overfitting the First Stage

With too many instruments, the first-stage regression captures random noise, leading to poor out-of-sample performance and contamination of the second stage.

  1. Endogeneity Leakage

Overfit first-stage predictions may reintroduce endogeneity due to incidental correlation with the structural error term.


34.7.2.2 Diagnostic Tools

  1. First-Stage F-Statistic

A weak instrument test: rule of thumb is \(F > 10\) for a single instrument; more stringent thresholds apply for many instruments.

  1. Overidentification Tests
  • Sargan Test: Assumes homoskedastic errors
  • Hansen’s J-Test: Robust to heteroskedasticity
  1. Eigenvalue Diagnostics
  • Kleibergen-Paap rk statistic (generalized for clustered or heteroskedastic settings)

34.7.2.3 Remedies and Alternatives

  1. Instrument Selection
  • Lasso IV / Post-Double Selection: Use regularization to select valid instruments.
  • Factor-Based Methods: Project instruments onto principal components.
  1. Shrinkage Estimators
  • Limited Information Maximum Likelihood (LIML): More robust to many-IV bias.
  • Jackknife IV Estimator (JIVE): Adjusts for the overfitting bias in the first stage.
  1. Grouped or Aggregated Instruments
  • Collapse multiple instruments into a smaller number of aggregated measures.

34.7.2.4 Practical Guidelines

  1. Avoid Including All Possible Instruments: Parsimony matters more than volume.
  2. Always Check First-Stage Strength: Even if \(R^2\) is high, individual instrument strength matters.
  3. Report Robustness with Alternative Estimators: LIML or JIVE can serve as robustness checks.
  4. Test for Overidentification: But interpret results cautiously when \(L\) is large.

34.7.3 Heterogeneous Effects in IV Estimation

34.7.3.1 Constant vs. Heterogeneous Treatment Effects

The standard instrumental variables framework assumes that the causal effect of an endogenous regressor \(D_i\) on an outcome \(Y_i\) is constant across individuals, i.e.:

\[ Y_i = \beta_0 + \beta_1 D_i + u_i \]

This implies homogeneous treatment effects, where \(\beta_1\) is a structural parameter that applies uniformly to all individuals \(i\) in the population. We refer to this as the homogeneous treatment effects model, and it underlies the traditional IV assumptions:

  • Linearity with a constant effect \(\beta_1\).

  • Instrument relevance: \(\mathrm{Cov}(Z_i, D_i) \ne 0\).

  • Instrument exogeneity: \(\mathrm{Cov}(Z_i, u_i) = 0\).

Under these assumptions, the IV estimator \(\hat{\beta}_1^{IV}\) consistently estimates the causal effect \(\beta_1\).

34.7.3.2 Heterogeneous Treatment Effects and the Problem for IV

In practice, treatment effects often vary across individuals. That is, the effect of \(D_i\) on \(Y_i\) depends on the individual’s characteristics or other unobserved factors:

\[ Y_i = \beta_{1i} D_i + u_i \]

Here, \(\beta_{1i}\) represents the individual-specific causal effect, and the population Average Treatment Effect is:

\[ ATE = \mathbb{E}[\beta_{1i}] \]

In the presence of treatment effect heterogeneity, the IV estimator \(\hat{\beta}_1^{IV}\) does not, in general, estimate the ATE. Instead, it estimates a weighted average of the heterogeneous treatment effects, with weights determined by the instrumental variation in the data.

This distinction is critical:

  • OLS estimates a weighted average treatment effect, with weights depending on the variance of \(D_i\).

  • IV estimates a [Local Average Treatment Effect] (LATE), depending on the instrument \(Z_i\).

When there is one endogenous regressor \(D_i\) and one instrument \(Z_i\), both binary variables, we can interpret the IV estimator as the [Local Average Treatment Effect] under specific assumptions. The setup is:

\[ Y_i = \beta_0 + \beta_{1i} D_i + u_i \]

  • \(D_i \in \{0, 1\}\): The treatment indicator.
  • \(Z_i \in \{0, 1\}\): The binary instrument.

Assumptions for the LATE Interpretation

  1. Instrument Exogeneity

\[ Z_i \perp (u_i, v_i) \] - The instrument is as good as randomly assigned, and is independent of both the structural error term \(u_i\) and the unobserved determinants \(v_i\) that affect treatment selection.

  1. Relevance

\[ \mathbb{P}(D_i = 1 | Z_i = 1) \ne \mathbb{P}(D_i = 1 | Z_i = 0) \] - The instrument must affect the likelihood of receiving treatment \(D_i\).

  1. Monotonicity (G. W. Imbens and Angrist 1994)

\[ D_i(1) \ge D_i(0) \quad \forall i \]

  • There are no defiers: no individual who takes the treatment when \(Z_i = 0\) but does not take it when \(Z_i = 1\).

  • Monotonicity is not testable, but must be defended on theoretical grounds.

Under these assumptions, \(\hat{\beta}_1^{IV}\) estimates the [Local Average Treatment Effect]:

\[ LATE = \mathbb{E}[\beta_{1i} | \text{Compliers}] \]

  • Compliers are individuals who receive the treatment when \(Z_i = 1\), but not when \(Z_i = 0\).
  • Local refers to the fact that the estimate pertains to this specific subpopulation of compliers.

Implications:

  • The LATE is not the ATE, unless treatment effects are homogeneous, or the complier subpopulation is representative of the entire population.
  • Different instruments define different complier groups, leading to different LATEs.

34.7.3.3 Multiple Instruments and Multiple LATEs

When we have multiple instruments \(Z_i^{(1)}, Z_i^{(2)}, \dots, Z_i^{(m)}\), each can induce different complier groups:

  • Each instrument has its own LATE, corresponding to its own group of compliers.

  • If heterogeneous treatment effects exist, these LATEs may differ.

In an overidentified model, where \(m > k\), the 2SLS estimator imposes the assumption that all instruments identify the same causal effect \(\beta_1\). This leads to the moment conditions:

\[ \mathbb{E}[Z_i^{(j)}(Y_i - D_i \beta_1)] = 0 \quad \forall j = 1, \dots, m \]

If instruments identify different LATEs:

  • These moment conditions can be inconsistent with one another.

  • The Sargan-Hansen J-test may reject, even though each instrument is valid (i.e., exogenous and relevant).

Key Insight: The J-test rejects because the homogeneity assumption is violated—not because instruments are invalid in the exogeneity sense.

34.7.3.4 Illustration: Multiple Instruments, Different LATEs

Consider the following example:

  • \(Z_i^{(1)}\) identifies a LATE of 1.0.

  • \(Z_i^{(2)}\) identifies a LATE of 2.0.

  • If both instruments are included in an overidentified IV model, the 2SLS estimator tries to reconcile these LATEs as if they were identifying the same \(\beta_1\), leading to:

    • An average of these LATEs.

    • A possible rejection of the overidentification restrictions via the J-test.

This scenario is common in:

  • Labor economics (e.g., different instruments for education identify different populations).

  • Marketing and pricing experiments (e.g., different price instruments impact different customer segments).

34.7.3.5 Practical Implications for Empirical Research

  1. Be Clear About Whose Effect You’re Estimating
  • Different instruments often imply different complier groups.
  • Understanding who the compliers are is essential for policy implications.
  1. Interpret the J-Test Carefully
  • A rejection may indicate treatment effect heterogeneity, not necessarily instrument invalidity.
  • Supplement the J-test with:
    • Subgroup analysis.
    • Sensitivity analysis.
    • Local Instrumental Variable or Marginal Treatment Effects frameworks.
  1. Use Structural Models When Needed
  • If you need an ATE, consider parametric or semi-parametric structural models that explicitly model heterogeneity.
  1. Don’t Assume LATE = ATE
  • Be cautious in generalizing LATE estimates beyond the complier subpopulation.

34.7.3.6 Beyond LATE

The presence of heterogeneous treatment effects (\(\beta_{1i}\) varying across individuals) raises a fundamental challenge for causal inference using IV methods. As we have seen, the traditional IV estimator identifies the [Local Average Treatment Effect] (LATE) under certain assumptions. However, this approach implicitly adopts a reverse engineering strategy: it uses classical linear IV estimators designed under homogeneity, acknowledges their likely misspecification in the presence of unobserved heterogeneity, and interprets the resulting estimate in terms of a LATE.

This strategy has been highly influential and remains central to empirical work. Nevertheless, it comes with limitations:

  • The interpretation depends critically on the specific instrument used (i.e., the definition of the complier group).
  • It cannot recover the Average Treatment Effect (ATE) or other policy-relevant parameters unless strong additional assumptions hold.

34.7.3.7 Forward Engineering: The Marginal Treatment Effect

In contrast, recent work—including that of Mogstad and Torgovitsky (2024) —emphasizes a forward engineering approach. Rather than adapting estimators designed under homogeneity, this strategy builds models and estimators that explicitly allow for unobserved heterogeneity in treatment effects from the outset.

A key framework in this approach is the Marginal Treatment Effect (MTE), originally developed in the context of selection models (Gronau 1974; J. J. Heckman 1979). The idea is to model the treatment decision as the result of a latent index:

\[ D_i = \mathbb{1}[v_i \leq Z_i'\pi + \eta_i] \]

and to let the treatment effect vary with unobserved selection variables. The MTE is then defined as:

\[ \text{MTE}(u) = \mathbb{E}[Y_i(1) - Y_i(0) | U_i = u] \]

where \(U_i\) is the latent variable governing treatment selection. This function traces out how the treatment effect varies across individuals with different propensities to receive treatment, and it underlies other average effects such as:

  • ATE: \(\int_0^1 \text{MTE}(u) \, du\)
  • LATE: average of MTE over complier margin
  • TT and TUT: other weighted averages of MTE

Comparison of IV, LATE, and MTE Approaches

Feature Traditional IV (LATE) MTE / Selection Models
Assumes constant effects Implicitly violated Explicitly allows heterogeneity
Interpretation LATE for compliers MTE curve + all average treatment effects
Data requirements Instrument + outcome Richer variation in instrument (e.g., continuous)
Estimation complexity Low (2SLS) Higher (requires modeling selection)

The MTE framework also connects to:

  • Control function methods: which account for selection via inclusion of latent variables (e.g., residuals) in outcome equations.
  • Partial identification / bounding methods: which avoid strong parametric assumptions and seek informative bounds on treatment effects.

These newer strategies reflect a shift in modern econometrics: away from treating unobserved heterogeneity as a nuisance, and toward modeling it directly for richer causal inference.

Understanding these two strategies helps practitioners choose appropriate methods based on:

  • Their identifying assumptions.
  • The richness of their instruments.
  • Their target estimand (e.g., ATE, LATE, MTE).
  • Their willingness to model the selection process.

Researchers should be cautious in interpreting IV estimates as general causal effects, especially when heterogeneous treatment effects are likely and the choice of instrument strongly influences the complier population.


34.7.4 Zero-Valued Outcomes

For outcomes that take zero values, log transformations can introduce interpretation issues. Specifically, the coefficient on a log-transformed outcome does not directly represent a percentage change (J. Chen and Roth 2024). We have to distinguish the treatment effect on the intensive (outcome: 10 to 11) vs. extensive margins (outcome: 0 to 1), and we can’t readily interpret the treatment coefficient of log-transformed outcome regression as percentage change. In such cases, researchers use alternative methods:

34.7.4.1 Proportional LATE Estimation

When dealing with zero-valued outcomes, direct log transformations can lead to interpretation issues. To obtain an interpretable percentage change in the outcome due to treatment among compliers, we estimate the proportional Local Average Treatment Effect (LATE), denoted as \(\theta_{ATE\%}\).

Steps to Estimate Proportional LATE:

  1. Estimate LATE using 2SLS:

    We first estimate the treatment effect using a standard Two-Stage Least Squares regression: \[ Y_i = \beta D_i + X_i + \epsilon_i, \] where:

    • \(D_i\) is the endogenous treatment variable.
    • \(X_i\) includes any exogenous controls.
    • \(\beta\) represents the LATE in levels for the mean of the control group’s compliers.
  2. Estimate the control complier mean (\(\beta_{cc}\)):

    Using the same 2SLS setup, we estimate the control mean for compliers by transforming the outcome variable (Abadie, Angrist, and Imbens 2002): \[ Y_i^{CC} = -(D_i - 1) Y_i. \] The estimated coefficient from this regression, \(\beta_{cc}\), captures the mean outcome for compliers in the control group.

  3. Compute the proportional LATE:

    The estimated proportional LATE is given by: \[ \theta_{ATE\%} = \frac{\hat{\beta}}{\hat{\beta}_{cc}}, \] which provides a direct percentage change interpretation for the outcome among compliers induced by the instrument.

  4. Obtain standard errors via non-parametric bootstrap:

    Since \(\theta_{ATE\%}\) is a ratio of estimated coefficients, standard errors are best obtained using non-parametric bootstrap methods.

  5. Special case: Binary instrument

    If the instrument is binary, \(\theta_{ATE\%}\) for the intensive margin of compliers can be directly estimated using Poisson IV regression (ivpoisson in Stata).

34.7.4.2 Bounds on Intensive-Margin Effects

Lee (2009) proposed a bounding approach for intensive-margin effects, assuming that compliers always have positive outcomes regardless of treatment (i.e., intensive-margin effect). These bounds help estimate treatment effects without relying on log transformations. However, this requires a monotonicity assumption for compliers where they should still have positive outcome regardless of treatment status.

References

Abadie, Alberto, Joshua Angrist, and Guido Imbens. 2002. “Instrumental Variables Estimates of the Effect of Subsidized Training on the Quantiles of Trainee Earnings.” Econometrica 70 (1): 91–117.
Bekker, Paul A. 1994. “Alternative Approximations to the Distributions of Instrumental Variable Estimators.” Econometrica: Journal of the Econometric Society, 657–81.
Chen, Jiafeng, and Jonathan Roth. 2024. “Logs with Zeros? Some Problems and Solutions.” The Quarterly Journal of Economics 139 (2): 891–936.
Gronau, Reuben. 1974. “Wage Comparisons–a Selectivity Bias.” Journal of Political Economy 82 (6): 1119–43.
———. 1979. “Sample Selection Bias as a Specification Error.” Econometrica: Journal of the Econometric Society, 153–61.
Imbens, Guido W, and Joshua D Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62 (2): 467–75.
Lee, David S. 2009. “Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects.” The Review of Economic Studies, 1071–1102.
Mogstad, Magne, and Alexander Torgovitsky. 2024. “Instrumental Variables with Unobserved Heterogeneity in Treatment Effects.” In Handbook of Labor Economics, 5:1–114. Elsevier.