34.7 Cautions in IV
34.7.1 Negative R2 in IV
In IV estimation, particularly 2SLS and 3SLS, it is common and not problematic to encounter negative R2 values in the second stage regression. Unlike Ordinary Least Squares, where R2 is often used to assess the fit of the model, in IV regression the primary concern is consistency and unbiased estimation of the coefficients of interest, not the goodness-of-fit.
What Should You Look At Instead of R2 in IV?
- Instrument Relevance (First-stage F-statistics, Partial R2)
- Weak Instrument Tests (Kleibergen-Paap, Anderson-Rubin tests)
- Validity of Instruments (Overidentification tests like Sargan/Hansen J-test)
- Endogeneity Tests (Durbin-Wu-Hausman test for endogeneity)
- Confidence Intervals and Standard Errors, focusing on inference for ˆβ.
Geometric Intuition
- In OLS, the fitted values ˆy are the orthogonal projection of y onto the column space of X.
- In 2SLS, ˆy is the projection onto the space spanned by Z, not X.
- As a result, the angle between y and ˆy may not minimize the residual variance, and RSS can be larger than in OLS.
Recall the formula for the coefficient of determination (R2) in a regression model:
R2=1−RSSTSS=MSSTSS
Where:
- TSS is the Total Sum of Squares: TSS=n∑i=1(yi−ˉy)2
- MSS is the Model Sum of Squares: MSS=n∑i=1(ˆyi−ˉy)2
- RSS is the Residual Sum of Squares: RSS=n∑i=1(yi−ˆyi)2
In OLS, the R2 measures the proportion of variance in Y that is explained by the regressors X.
Key Properties in OLS:
- R2∈[0,1]
- Adding more regressors (even irrelevant ones) never decreases R2.
- R2 measures in-sample goodness-of-fit, not causal interpretation.
34.7.1.1 Why Does R2 Lose Its Meaning in IV Regression?
In IV regression, the second stage regression replaces the endogenous variable X2 with its predicted values from the first stage:
Stage 1:
X2=Zπ+v
Stage 2:
Y=X1β1+ˆX2β2+ϵ
- ˆX2 is not the observed X2, but a proxy constructed from Z.
- ˆX2 isolates the exogenous variation in X2 that is independent of ϵ.
- This reduces bias, but comes at a cost:
- The variation in ˆX2 is typically less than that in X2.
- The predicted values ˆyi from the second stage are not necessarily close to yi.
34.7.1.2 Why R2 Can Be Negative:
- R2 is calculated using: R2=1−RSSTSS But in IV:
- The predicted values of Y are not chosen to minimize RSS, because IV is not minimizing the residuals in the second stage.
- Unlike OLS, 2SLS chooses ˆβ to satisfy moment conditions rather than minimizing the sum of squared errors.
It is possible (and common in IV) for the residual sum of squares to be greater than the total sum of squares: RSS>TSS Which makes: R2=1−RSSTSS<0
This happens because:
- The predicted values ˆyi in IV are not optimized to fit the observed yi.
- The residuals can be larger, because IV focuses on identifying causal effects, not prediction.
For example, assume we have:
TSS=100
RSS=120
Then: R2=1−120100=−0.20
This happens because the IV procedure does not minimize RSS. It prioritizes solving the endogeneity problem over explaining the variance in Y.
34.7.1.3 Why We Don’t Care About R2 in IV
- IV Estimates Focus on Consistency, Not Prediction
- The goal of IV is to obtain a consistent estimate of β2.
- IV sacrifices fit (higher variance in ˆyi) to remove endogeneity bias.
- R2 Does Not Reflect the Quality of an IV Estimator
- A high R2 in IV may be misleading (for instance, when instruments are weak or invalid).
- A negative R2 does not imply a bad IV estimator if the assumptions of instrument validity are met.
- IV Regression Is About Identification, Not In-Sample Fit
- IV relies on relevance and exogeneity of instruments, not residual minimization.
34.7.1.4 Technical Details on R2
In OLS: ˆβOLS=(X′X)−1X′Y Minimizes: RSS=(Y−XˆβOLS)′(Y−XˆβOLS)
In IV: ˆβIV=(X′PZX)−1X′PZY
where:
PZ=Z(Z′Z)−1Z′ is the projection matrix onto Z.
The IV estimator solves: Z′(Y−Xˆβ)=0
No guarantee that this minimizes RSS.
Residuals:
eIV=Y−XˆβIV
The norm of eIV is typically larger than in OLS because IV uses fewer effective degrees of freedom (constrained variation via Z).
A Note on R2 in 3SLS and GMM
- In 3SLS or GMM IV, R2 can be similarly misleading.
- These methods often operate under moment conditions or system estimation, not residual minimization.
34.7.2 Many-Instruments Bias
While IV is powerful, it is also delicate. One critical issue that arises is the many-instruments problem, also known as many-IV bias.
Consider the structural model:
yi=βxi+ui
where xi is endogenous: E[xiui]≠0. To address this, we introduce instruments zi such that:
- Relevance: E[zixi]≠0
- Exogeneity: E[ziui]=0
The standard 2SLS estimator is given by:
ˆβ2SLS=(X′PZX)−1X′PZy
where PZ=Z(Z′Z)−1Z′ is the projection matrix onto the column space of Z.
The many-IV problem arises when the number of instruments L is large relative to the number of observations n. In particular, the issue becomes severe as L/n→α>0, leading to several problems:
- Bias Toward OLS: The 2SLS estimator becomes increasingly biased toward the OLS estimator.
- Overfitting: The first-stage regression overfits the endogenous variable, capturing noise rather than true variation.
- Inflated Variance: The second-stage estimates become imprecise, leading to misleading inference.
Traditional IV asymptotics assume L is fixed as n→∞. Bekker (1994) proposed an alternative framework where:
L/n→α∈(0,∞)as n→∞
Under Bekker asymptotics:
- 2SLS is biased and inconsistent unless the instruments are very strong.
- The bias grows with α, approaching that of OLS.
This formalized the intuition that adding more instruments—especially weak ones—does not help, and can actually harm estimation.
34.7.2.1 Sources of Many-IV Bias
Many instruments are individually weak (i.e., contribute little to explaining x), and collectively, they inflate the projection without improving identification.
- Overfitting the First Stage
With too many instruments, the first-stage regression captures random noise, leading to poor out-of-sample performance and contamination of the second stage.
- Endogeneity Leakage
Overfit first-stage predictions may reintroduce endogeneity due to incidental correlation with the structural error term.
34.7.2.2 Diagnostic Tools
- First-Stage F-Statistic
A weak instrument test: rule of thumb is F>10 for a single instrument; more stringent thresholds apply for many instruments.
- Overidentification Tests
- Sargan Test: Assumes homoskedastic errors
- Hansen’s J-Test: Robust to heteroskedasticity
- Eigenvalue Diagnostics
- Kleibergen-Paap rk statistic (generalized for clustered or heteroskedastic settings)
34.7.2.3 Remedies and Alternatives
- Instrument Selection
- Lasso IV / Post-Double Selection: Use regularization to select valid instruments.
- Factor-Based Methods: Project instruments onto principal components.
- Shrinkage Estimators
- Limited Information Maximum Likelihood (LIML): More robust to many-IV bias.
- Jackknife IV Estimator (JIVE): Adjusts for the overfitting bias in the first stage.
- Grouped or Aggregated Instruments
- Collapse multiple instruments into a smaller number of aggregated measures.
34.7.2.4 Practical Guidelines
- Avoid Including All Possible Instruments: Parsimony matters more than volume.
- Always Check First-Stage Strength: Even if R2 is high, individual instrument strength matters.
- Report Robustness with Alternative Estimators: LIML or JIVE can serve as robustness checks.
- Test for Overidentification: But interpret results cautiously when L is large.
34.7.3 Heterogeneous Effects in IV Estimation
34.7.3.1 Constant vs. Heterogeneous Treatment Effects
The standard instrumental variables framework assumes that the causal effect of an endogenous regressor Di on an outcome Yi is constant across individuals, i.e.:
Yi=β0+β1Di+ui
This implies homogeneous treatment effects, where β1 is a structural parameter that applies uniformly to all individuals i in the population. We refer to this as the homogeneous treatment effects model, and it underlies the traditional IV assumptions:
Linearity with a constant effect β1.
Instrument relevance: Cov(Zi,Di)≠0.
Instrument exogeneity: Cov(Zi,ui)=0.
Under these assumptions, the IV estimator ˆβIV1 consistently estimates the causal effect β1.
34.7.3.2 Heterogeneous Treatment Effects and the Problem for IV
In practice, treatment effects often vary across individuals. That is, the effect of Di on Yi depends on the individual’s characteristics or other unobserved factors:
Yi=β1iDi+ui
Here, β1i represents the individual-specific causal effect, and the population Average Treatment Effect is:
ATE=E[β1i]
In the presence of treatment effect heterogeneity, the IV estimator ˆβIV1 does not, in general, estimate the ATE. Instead, it estimates a weighted average of the heterogeneous treatment effects, with weights determined by the instrumental variation in the data.
This distinction is critical:
OLS estimates a weighted average treatment effect, with weights depending on the variance of Di.
IV estimates a [Local Average Treatment Effect] (LATE), depending on the instrument Zi.
When there is one endogenous regressor Di and one instrument Zi, both binary variables, we can interpret the IV estimator as the [Local Average Treatment Effect] under specific assumptions. The setup is:
Yi=β0+β1iDi+ui
- Di∈{0,1}: The treatment indicator.
- Zi∈{0,1}: The binary instrument.
Assumptions for the LATE Interpretation
Zi⊥(ui,vi) - The instrument is as good as randomly assigned, and is independent of both the structural error term ui and the unobserved determinants vi that affect treatment selection.
P(Di=1|Zi=1)≠P(Di=1|Zi=0) - The instrument must affect the likelihood of receiving treatment Di.
Di(1)≥Di(0)∀i
There are no defiers: no individual who takes the treatment when Zi=0 but does not take it when Zi=1.
Monotonicity is not testable, but must be defended on theoretical grounds.
Under these assumptions, ˆβIV1 estimates the [Local Average Treatment Effect]:
LATE=E[β1i|Compliers]
- Compliers are individuals who receive the treatment when Zi=1, but not when Zi=0.
- Local refers to the fact that the estimate pertains to this specific subpopulation of compliers.
Implications:
- The LATE is not the ATE, unless treatment effects are homogeneous, or the complier subpopulation is representative of the entire population.
- Different instruments define different complier groups, leading to different LATEs.
34.7.3.3 Multiple Instruments and Multiple LATEs
When we have multiple instruments Z(1)i,Z(2)i,…,Z(m)i, each can induce different complier groups:
Each instrument has its own LATE, corresponding to its own group of compliers.
If heterogeneous treatment effects exist, these LATEs may differ.
In an overidentified model, where m>k, the 2SLS estimator imposes the assumption that all instruments identify the same causal effect β1. This leads to the moment conditions:
E[Z(j)i(Yi−Diβ1)]=0∀j=1,…,m
If instruments identify different LATEs:
These moment conditions can be inconsistent with one another.
The Sargan-Hansen J-test may reject, even though each instrument is valid (i.e., exogenous and relevant).
Key Insight: The J-test rejects because the homogeneity assumption is violated—not because instruments are invalid in the exogeneity sense.
34.7.3.4 Illustration: Multiple Instruments, Different LATEs
Consider the following example:
Z(1)i identifies a LATE of 1.0.
Z(2)i identifies a LATE of 2.0.
If both instruments are included in an overidentified IV model, the 2SLS estimator tries to reconcile these LATEs as if they were identifying the same β1, leading to:
An average of these LATEs.
A possible rejection of the overidentification restrictions via the J-test.
This scenario is common in:
Labor economics (e.g., different instruments for education identify different populations).
Marketing and pricing experiments (e.g., different price instruments impact different customer segments).
34.7.3.5 Practical Implications for Empirical Research
- Be Clear About Whose Effect You’re Estimating
- Different instruments often imply different complier groups.
- Understanding who the compliers are is essential for policy implications.
- Interpret the J-Test Carefully
- A rejection may indicate treatment effect heterogeneity, not necessarily instrument invalidity.
- Supplement the J-test with:
- Subgroup analysis.
- Sensitivity analysis.
- Local Instrumental Variable or Marginal Treatment Effects frameworks.
- Use Structural Models When Needed
- If you need an ATE, consider parametric or semi-parametric structural models that explicitly model heterogeneity.
- Don’t Assume LATE = ATE
- Be cautious in generalizing LATE estimates beyond the complier subpopulation.
34.7.3.6 Beyond LATE
The presence of heterogeneous treatment effects (β1i varying across individuals) raises a fundamental challenge for causal inference using IV methods. As we have seen, the traditional IV estimator identifies the [Local Average Treatment Effect] (LATE) under certain assumptions. However, this approach implicitly adopts a reverse engineering strategy: it uses classical linear IV estimators designed under homogeneity, acknowledges their likely misspecification in the presence of unobserved heterogeneity, and interprets the resulting estimate in terms of a LATE.
This strategy has been highly influential and remains central to empirical work. Nevertheless, it comes with limitations:
- The interpretation depends critically on the specific instrument used (i.e., the definition of the complier group).
- It cannot recover the Average Treatment Effect (ATE) or other policy-relevant parameters unless strong additional assumptions hold.
34.7.3.7 Forward Engineering: The Marginal Treatment Effect
In contrast, recent work—including that of Mogstad and Torgovitsky (2024) —emphasizes a forward engineering approach. Rather than adapting estimators designed under homogeneity, this strategy builds models and estimators that explicitly allow for unobserved heterogeneity in treatment effects from the outset.
A key framework in this approach is the Marginal Treatment Effect (MTE), originally developed in the context of selection models (Gronau 1974; J. J. Heckman 1979). The idea is to model the treatment decision as the result of a latent index:
Di=1[vi≤Z′iπ+ηi]
and to let the treatment effect vary with unobserved selection variables. The MTE is then defined as:
MTE(u)=E[Yi(1)−Yi(0)|Ui=u]
where Ui is the latent variable governing treatment selection. This function traces out how the treatment effect varies across individuals with different propensities to receive treatment, and it underlies other average effects such as:
- ATE: ∫10MTE(u)du
- LATE: average of MTE over complier margin
- TT and TUT: other weighted averages of MTE
Comparison of IV, LATE, and MTE Approaches
Feature | Traditional IV (LATE) | MTE / Selection Models |
---|---|---|
Assumes constant effects | Implicitly violated | Explicitly allows heterogeneity |
Interpretation | LATE for compliers | MTE curve + all average treatment effects |
Data requirements | Instrument + outcome | Richer variation in instrument (e.g., continuous) |
Estimation complexity | Low (2SLS) | Higher (requires modeling selection) |
The MTE framework also connects to:
- Control function methods: which account for selection via inclusion of latent variables (e.g., residuals) in outcome equations.
- Partial identification / bounding methods: which avoid strong parametric assumptions and seek informative bounds on treatment effects.
These newer strategies reflect a shift in modern econometrics: away from treating unobserved heterogeneity as a nuisance, and toward modeling it directly for richer causal inference.
Understanding these two strategies helps practitioners choose appropriate methods based on:
- Their identifying assumptions.
- The richness of their instruments.
- Their target estimand (e.g., ATE, LATE, MTE).
- Their willingness to model the selection process.
Researchers should be cautious in interpreting IV estimates as general causal effects, especially when heterogeneous treatment effects are likely and the choice of instrument strongly influences the complier population.
34.7.4 Zero-Valued Outcomes
For outcomes that take zero values, log transformations can introduce interpretation issues. Specifically, the coefficient on a log-transformed outcome does not directly represent a percentage change (J. Chen and Roth 2024). We have to distinguish the treatment effect on the intensive (outcome: 10 to 11) vs. extensive margins (outcome: 0 to 1), and we can’t readily interpret the treatment coefficient of log-transformed outcome regression as percentage change. In such cases, researchers use alternative methods:
34.7.4.1 Proportional LATE Estimation
When dealing with zero-valued outcomes, direct log transformations can lead to interpretation issues. To obtain an interpretable percentage change in the outcome due to treatment among compliers, we estimate the proportional Local Average Treatment Effect (LATE), denoted as θATE%.
Steps to Estimate Proportional LATE:
Estimate LATE using 2SLS:
We first estimate the treatment effect using a standard Two-Stage Least Squares regression: Yi=βDi+Xi+ϵi, where:
- Di is the endogenous treatment variable.
- Xi includes any exogenous controls.
- β represents the LATE in levels for the mean of the control group’s compliers.
Estimate the control complier mean (βcc):
Using the same 2SLS setup, we estimate the control mean for compliers by transforming the outcome variable (Abadie, Angrist, and Imbens 2002): YCCi=−(Di−1)Yi. The estimated coefficient from this regression, βcc, captures the mean outcome for compliers in the control group.
Compute the proportional LATE:
The estimated proportional LATE is given by: θATE%=ˆβˆβcc, which provides a direct percentage change interpretation for the outcome among compliers induced by the instrument.
Obtain standard errors via non-parametric bootstrap:
Since θATE% is a ratio of estimated coefficients, standard errors are best obtained using non-parametric bootstrap methods.
Special case: Binary instrument
If the instrument is binary, θATE% for the intensive margin of compliers can be directly estimated using Poisson IV regression (
ivpoisson
in Stata).
34.7.4.2 Bounds on Intensive-Margin Effects
Lee (2009) proposed a bounding approach for intensive-margin effects, assuming that compliers always have positive outcomes regardless of treatment (i.e., intensive-margin effect). These bounds help estimate treatment effects without relying on log transformations. However, this requires a monotonicity assumption for compliers where they should still have positive outcome regardless of treatment status.