36.1 Endogenous Treatment
36.1.1 Measurement Error
Data error can stem from
Coding errors
Reporting errors
Two forms of measurement error:
- Random (stochastic) (indeterminate error) (Classical Measurement Errors): noise or measurement errors do not show up in a consistent or predictable way.
- Systematic (determinate error) (Non-classical Measurement Errors): When measurement error is consistent and predictable across observations.
- Instrument errors (e.g., faulty scale) -> calibration or adjustment
- Method errors (e.g., sampling errors) -> better method development + study design
- Human errors (e.g., judgement)
Usually the systematic measurement error is a bigger issue because it introduces “bias” into our estimates, while random error introduces noise into our estimates
- Noise -> regression estimate to 0
- Bias -> can pull estimate to upward or downward.
36.1.1.1 Classical Measurement Errors
36.1.1.1.1 Right-hand side
- Right-hand side measurement error: When the measurement is in the covariates, then we have the endogeneity problem.
Say you know the true model is
Yi=β0+β1Xi+ui
But you don’t observe Xi, but you observe
˜Xi=Xi+ei
which is known as classical measurement errors where we assume ei is uncorrelated with Xi (i.e., E(Xiei)=0)
Then, when you estimate your observed variables, you have (substitute Xi with ˜Xi−ei ):
Yi=β0+β1(˜Xi−ei)+ui=β0+β1˜Xi+ui−β1ei=β0+β1˜Xi+vi
In words, the measurement error in Xi is now a part of the error term in the regression equation vi. Hence, we have an endogeneity bias.
Endogeneity arises when
E(˜Xivi)=E((Xi+ei)(ui−β1ei))=−β1Var(ei)≠0
Since ˜Xi and ei are positively correlated, then it leads to
a negative bias in ˆβ1 if the true β1 is positive
a positive bias if β1 is negative
In other words, measurement errors cause attenuation bias, which inter turn pushes the coefficient towards 0
As Var(ei) increases or Var(ei)Var(˜X)→1 then ei is a random (noise) and β1→0 (random variable ˜X should not have any relation to Yi)
Technical note:
The size of the bias in the OLS-estimator is
ˆβOLS=cov(˜X,Y)var(˜X)=cov(X+e,βX+u)var(X+e)
then
plimˆβOLS=βσ2Xσ2X+σ2e=βλ
where λ is reliability or signal-to-total variance ratio or attenuation factor
Reliability affect the extent to which measurement error attenuates ˆβ. The attenuation bias is
ˆβOLS−β=−(1−λ)β
Thus, ˆβOLS<β (unless λ=1, in which case we don’t even have measurement error).
Note:
Data transformation worsen (magnify) the measurement error
y=βx+γx2+ϵ
then, the attenuation factor for ˆγ is the square of the attenuation factor for ˆβ (i.e., λˆγ=λ2ˆβ)
Adding covariates increases attenuation bias
To fix classical measurement error problem, we can
- Find estimates of either σ2X,σ2ϵ or λ from validation studies, or survey data.
- Endogenous Treatment Use instrument Z correlated with X but uncorrelated with ϵ
- Abandon your project
36.1.1.1.2 Left-hand side
When the measurement is in the outcome variable, econometricians or causal scientists do not care because they still have an unbiased estimate of the coefficients (the zero conditional mean assumption is not violated, hence we don’t have endogeneity). However, statisticians might care because it might inflate our uncertainty in the coefficient estimates (i.e., higher standard errors).
˜Y=Y+v
then the model you estimate is
˜Y=βX+u+v
Since v is uncorrelated with X, then ˆβ is consistently estimated by OLS
If we have measurement error in Yi, it will pass through β1 and go to ui
36.1.1.2 Non-classical Measurement Errors
Relaxing the assumption that X and ϵ are uncorrelated
Recall the true model we have true estimate is
ˆβ=cov(X+ϵ,βX+u)var(X+ϵ)
then without the above assumption, we have
plimˆβ=β(σ2X+σXϵ)σ2X+σ2ϵ+2σXϵ=(1−σ2ϵ+σXϵσ2X+σ2ϵ+2σXϵ)β=(1−bϵ˜X)β
where bϵ˜X is the covariance between ˜X and ϵ (also the regression coefficient of a regression of ϵ on ˜X)
Hence, the Classical Measurement Errors is just a special case of Non-classical Measurement Errors where bϵ˜X=1−λ
So when σXϵ=0 (Classical Measurement Errors), increasing this covariance bϵ˜X increases the covariance increases the attenuation factor if more than half of the variance in ˜X is measurement error, and decreases the attenuation factor otherwise. This is also known as mean reverting measurement error Bound, Brown, and Mathiowetz (2001)
A general framework for both right-hand side and left-hand side measurement error is (Bound, Brown, and Mathiowetz 2001):
consider the true model
Y=Xβ+ϵ
then
ˆβ=(˜X′˜X)−1˜X˜Y=(˜X′˜X)−1˜X′(˜Xβ−Uβ+v+ϵ)=β+(˜X′˜X)−1˜X′(−Uβ+v+ϵ)plimˆβ=β+plim(˜X′˜X)−1˜X′(−Uβ+v)=β+plim(˜X′˜X)−1˜X′W[−β1]
Since we collect the measurement errors in a matrix W=[U|v], then
(−Uβ+v)=W[−β1]
Hence, in general, biases in the coefficients β are regression coefficients from regressing the measurement errors on the mis-measured ˜X
Notes:
Instrumental Variable can help fix this problem
There can also be measurement error in dummy variables and you can still use Instrumental Variable to fix it.
36.1.1.3 Solution to Measurement Errors
36.1.1.3.1 Correlation
P(ρ|data)=P(data|ρ)P(ρ)P(data)Posterior Probability∝Likelihood×Prior Probability where
- ρ is a correlation coefficient
- P(data|ρ) is the likelihood function evaluated at ρ
- P(ρ) prior probability
- P(data) is the normalizing constant
With sample correlation coefficient r:
r=Sxy√SxxSyy Then the posterior density approximation of ρ is (Schisterman et al. 2003, 3)
P(ρ|x,y)∝P(ρ)(1−ρ2)(n−1)/2(1−ρ×r)n−(3/2)
where
- ρ=tanhξ where ξ∼N(z,1/n)
- r=tanhz
Then the posterior density follow a normal distribution where
Mean
μposterior=σ2posterior×(nprior×tanh−1rprior+nlikelihood×tanh−1rlikelihood)
variance
σ2posterior=1nprior+nLikelihood
To simplify the integration process, we choose prior that is
P(ρ)∝(1−ρ2)c where
- c is the weight the prior will have in estimation (i.e., c=0 if no prior info, hence P(ρ)∝1)
Example:
Current study: rxy=0.5,n=200
Previous study: rxy=0.2765,(n=50205)
Combining two, we have the posterior following a normal distribution with the variance of
σ2posterior=1nprior+nLikelihood=1200+50205=0.0000198393
Mean
μPosterior=σ2Posterior×(nprior×tanh−1rprior+nlikelihood×tanh−1rlikelihood)=0.0000198393×(50205×tanh−10.2765+200×tanh−10.5)=0.2849415
Hence, Posterior∼N(0.691,0.0009), which means the correlation coefficient is tanh(0.691)=0.598 and 95% CI is
μposterior±1.96×√σ2Posterior=0.2849415±1.96×(0.0000198393)1/2=(0.2762115,0.2936714)
Hence, the interval for posterior ρ is (0.2693952,0.2855105)
If future authors suspect that they have
- Large sampling variation
- Measurement error in either measures in the correlation, which attenuates the relationship between the two variables
Applying this Bayesian correction can give them a better estimate of the correlation between the two.
To implement this calculation in R, see below
n_new <- 200
r_new <- 0.5
alpha <- 0.05
update_correlation <- function(n_new, r_new, alpha) {
n_meta <- 50205
r_meta <- 0.2765
# Variance
var_xi <- 1 / (n_new + n_meta)
format(var_xi, scientific = FALSE)
# mean
mu_xi <- var_xi * (n_meta * atanh(r_meta) + n_new * (atanh(r_new)))
format(mu_xi, scientific = FALSE)
# confidence interval
upper_xi <- mu_xi + qnorm(1 - alpha / 2) * sqrt(var_xi)
lower_xi <- mu_xi - qnorm(1 - alpha / 2) * sqrt(var_xi)
# rho
mean_rho <- tanh(mu_xi)
upper_rho <- tanh(upper_xi)
lower_rho <- tanh(lower_xi)
# return a list
return(
list(
"mu_xi" = mu_xi,
"var_xi" = var_xi,
"upper_xi" = upper_xi,
"lower_xi" = lower_xi,
"mean_rho" = mean_rho,
"upper_rho" = upper_rho,
"lower_rho" = lower_rho
)
)
}
# Old confidence interval
r_new + qnorm(1 - alpha / 2) * sqrt(1/n_new)
#> [1] 0.6385904
r_new - qnorm(1 - alpha / 2) * sqrt(1/n_new)
#> [1] 0.3614096
testing = update_correlation(n_new = n_new, r_new = r_new, alpha = alpha)
# Updated rho
testing$mean_rho
#> [1] 0.2774723
# Updated confidence interval
testing$upper_rho
#> [1] 0.2855105
testing$lower_rho
#> [1] 0.2693952
36.1.2 Simultaneity
When independent variables (X’s) are jointly determined with the dependent variable Y, typically through an equilibrium mechanism, violates the second condition for causality (i.e., temporal order).
Examples: quantity and price by demand and supply, investment and productivity, sales and advertisement
General Simultaneous (Structural) Equations
Yi=β0+β1Xi+uiXi=α0+α1Yi+vi
Hence, the solutions are
Yi=β0+β1α01−α1β1+β1vi+ui1−α1β1Xi=α0+α1β01−α1β1+vi+α1ui1−α1β1
If we run only one regression, we will have biased estimators (because of simultaneity bias):
Cov(Xi,ui)=Cov(vi+α1ui1−α1β1,ui)=α11−α1β1Var(ui)
In an even more general model
{Yi=β0+β1Xi+β2Ti+uiXi=α0+α1Yi+α2Zi+vi
where
Xi,Yi are endogenous variables determined within the system
Ti,Zi are exogenous variables
Then, the reduced form of the model is
{Yi=β0+β1α01−α1β1+β1α21−α1β1Zi+β21−α1β1Ti+˜ui=B0+B1Zi+B2Ti+˜uiXi=α0+α1β01−α1β1+α21−α1β1Zi+α1β21−α1β1Ti+˜vi=A0+A1Zi+A2Ti+˜vi
Then, now we can get consistent estimates of the reduced form parameters
And to get the original parameter estimates
B1A1=β1B2(1−B1A2A1B2)=β2A2B2=α1A1(1−B1A2A1B2)=α2
Rules for Identification
Order Condition (necessary but not sufficient)
K−k≥m−1
where
M = number of endogenous variables in the model
K = number of exogenous variables int he model
m = number of endogenous variables in a given
k = is the number of exogenous variables in a given equation
This is actually the general framework for instrumental variables
36.1.3 Endogenous Treatment Solutions
Using the OLS estimates as a reference point
library(AER)
library(REndo)
set.seed(421)
data("CASchools")
school <- CASchools
school$stratio <- with(CASchools, students / teachers)
m1.ols <-
lm(read ~ stratio + english + lunch
+ grades + income + calworks + county,
data = school)
summary(m1.ols)$coefficients[1:7,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 683.45305948 9.56214469 71.4748711 3.011667e-218
#> stratio -0.30035544 0.25797023 -1.1643027 2.450536e-01
#> english -0.20550107 0.03765408 -5.4576041 8.871666e-08
#> lunch -0.38684059 0.03700982 -10.4523759 1.427370e-22
#> gradesKK-08 -1.91291321 1.35865394 -1.4079474 1.599886e-01
#> income 0.71615378 0.09832843 7.2832829 1.986712e-12
#> calworks -0.05273312 0.06154758 -0.8567863 3.921191e-01
36.1.3.1 Instrumental Variable
[A3a] requires ϵi to be uncorrelated with xi
plim(ˆβOLS)=β+[E(x′ixi)]−1E(x′iϵi)
[A3a] is the weakest assumption needed for OLS to be consistent
A3 fails when xik is correlated with ϵi
- Omitted Variables Bias: ϵi includes any other factors that may influence the dependent variable (linearly)
- Simultaneity Demand and prices are simultaneously determined.
- Endogenous Sample Selection we did not have iid sample
- Measurement Error
Note
- Omitted Variable: an omitted variable is a variable, omitted from the model (but is in the ϵi) and unobserved has predictive power towards the outcome.
- Omitted Variable Bias: is the bias (and inconsistency when looking at large sample properties) of the OLS estimator when the omitted variable.
- We cam have both positive and negative selection bias (it depends on what our story is)
The structural equation is used to emphasize that we are interested understanding a causal relationship
yi1=β0+zi1β1+yi2β2+ϵi
where
- yit is the outcome variable (inherently correlated with ϵi)
- yi2 is the endogenous covariate (presumed to be correlated with ϵi)
- β1 represents the causal effect of yi2 on yi1
- zi1 is exogenous controls (uncorrelated with ϵi) (E(z′1iϵi)=0)
OLS is an inconsistent estimator of the causal effect β2
If there was no endogeneity
- E(y′i2ϵi)=0
- the exogenous variation in yi2 is what identifies the causal effect
If there is endogeneity
- Any wiggle in yi2 will shift simultaneously with ϵi
plim(ˆβOLS)=β+[E(x′ixi)]−1E(x′iϵi)
where
- β is the causal effect
- [E(x′ixi)]−1E(x′iϵi) is the endogenous effect
Hence ˆβOLS can be either more positive and negative than the true causal effect.
Motivation for Two Stage Least Squares (2SLS)
yi1=β0+zi1β1+yi2β2+ϵi
We want to understand how movement in yi2 effects movement in yi1, but whenever we move yi2, ϵi also moves.
Solution
We need a way to move yi2 independently of ϵi, then we can analyze the response in yi1 as a causal effect
Find an instrumental variable(s) zi2
- Instrument Relevance: when** zi2 moves then yi2 also moves
- Instrument Exogeneity: when zi2 moves then ϵi does not move.
zi2 is the exogenous variation that identifies the causal effect β2
Finding an Instrumental variable:
- Random Assignment: + Effect of class size on educational outcomes: instrument is initial random
- Relation’s Choice + Effect of Education on Fertility: instrument is parent’s educational level
- Eligibility + Trade-off between IRA and 401K retirement savings: instrument is 401k eligibility
Example
Return to College
education is correlated with ability - endogenous
Near 4year as an instrument
- Instrument Relevance: when near moves then education also moves
- Instrument Exogeneity: when near moves then ϵi does not move.
Other potential instruments; near a 2-year college. Parent’s Education. Owning Library Card
yi1=β0+zi1β1+yi2β2+ϵi
First Stage (Reduced Form) Equation:
yi2=π0+zi1π1+zi2π2+vi
where
- π0+zi1π1+zi2π2 is exogenous variation vi is endogenous variation
This is called a reduced form equation
Not interested in the causal interpretation of π1 or π2
A linear projection of zi1 and zi2 on yi2 (simple correlations)
The projections π1 and π2 guarantee that E(z′i1vi)=0 and E(z′i2vi)=0
Instrumental variable zi2
- Instrument Relevance: π2≠0
- Instrument Exogeneity: E(zi2ϵi)=0
Moving only the exogenous part of yi2 is moving
˜yi2=π0+zi1π1+zi2π2
two Stage Least Squares (2SLS)
yi1=β0+zi1β1+yi2β2+ϵi
yi2=π0+zi2π2+vi
Equivalently,
yi1=β0+zi1β1+˜yi2β2+uiwhere
- ˜yi2=π0+zi2π2
- ui=viβ2+ϵi
- A2 holds if the instrument is relevant π2≠0 + yi1=β0+zi1β1+(π0+zi1π1+zi2π2)β2+ui
- [A3a] holds if the instrument is exogenous E(zi2ϵi)=0
E(˜y′i2ui)=E((π0+zi1π1+zi2)(viβ2+ϵi))=E((π0+zi1π1+zi2)(ϵi))=E(ϵi)π0+E(ϵizi1)π1+E(ϵizi2)=0
Hence, (36.1) is consistent
The 2SLS Estimator
1. Estimate the first stage using OLS
yi2=π0+zi2π2+vi
and obtained estimated value ˆyi2
- Estimate the altered equation using OLS
yi1=β0+zi1β1+ˆyi2β2+ϵi
Properties of the 2SLS Estimator
- Under A1, A2, [A3a] (for zi1), A5 and if the instrument satisfies the following two conditions, + Instrument Relevance: π2≠0 + Instrument Exogeneity: E(z′i2ϵi)=0 then the 2SLS estimator is consistent
- Can handle more than one endogenous variable and more than one instrumental variable
yi1=β0+zi1β1+yi2β2+yi3β3+ϵiyi2=π0+zi1π1+zi2π2+zi3π3+zi4π4+vi2yi3=γ0+zi1γ1+zi2γ2+zi3γ3+zi4γ4+vi3
+ **IV estimator**: one endogenous variable with a single instrument
+ **2SLS estimator**: one endogenous variable with multiple instruments
+ **GMM estimator**: multiple endogenous variables with multiple instruments
Standard errors produced in the second step are not correct
- Because we do not know ˜y perfectly and need to estimate it in the firs step, we are introducing additional variation
- We did not have this problem with FGLS because “the first stage was orthogonal to the second stage.” This is generally not true for most multi-step procedure.
- If A4 does not hold, need to report robust standard errors.
2SLS is less efficient than OLS and will always have larger standard errors.
- First, Var(ui)=Var(viβ2+ϵi)>Var(ϵi)
- Second, ˆyi2 is generally highly collinear with zi1
- First, Var(ui)=Var(viβ2+ϵi)>Var(ϵi)
The number of instruments need to be at least as many or more the number of endogenous variables.
Note
- 2SLS can be combined with FGLS to make the estimator more efficient: You have the same first-stage, and in the second-stage, instead of using OLS, you can use FLGS with the weight matrix ˆw
- Generalized Method of Moments can be more efficient than 2SLS.
- In the second-stage of 2SLS, you can also use MLE, but then you are making assumption on the distribution of the outcome variable, the endogenous variable, and their relationship (joint distribution).
36.1.3.1.1 Testing Assumptions
Endogeneity Test: Is yi2 truly endogenous (i.e., can we just use OLS instead of 2SLS)?
Exogeneity (Cannot always test and when you can it might not be informative)
Relevancy (need to avoid “weak instruments”)
36.1.3.1.1.1 Endogeneity Test
2SLS is generally so inefficient that we may prefer OLS if there is not much endogeneity
Biased but inefficient vs efficient but biased
Want a sense of “how endogenous” yi2 is
- if “very” endogenous - should use 2SLS
- if not “very” endogenous - perhaps prefer OLS
Invalid Test of Endogeneity: yi2 is endogenous if it is correlated with ϵi,
ϵi=γ0+yi2γ1+errori
where γ1≠0 implies that there is endogeneity
- ϵi is not observed, but using the residuals
ei=γ0+yi2γ1+errori
is NOT a valid test of endogeneity + The OLS residual, e is mechanically uncorrelated with yi2 (by FOC for OLS) + In every situation, γ1 will be essentially 0 and you will never be able to reject the null of no endogeneity
Valid test of endogeneity
- If yi2 is not endogenous then ϵi and v are uncorrelated
yi1=β0+zi1β1+yi2β2+ϵiyi2=π0+zi1π1+zi2π2+vi
Variable Addition test: include the first stage residuals as an additional variable,
yi1=β0+zi1β1+yi2β2+ˆviθ+errori
Then the usual t-test of significance is a valid test to evaluate the following hypothesis. note this test requires your instrument to be valid instrument.
H0:θ=0 (not endogenous)H1:θ≠0 (endogenous)
36.1.3.1.1.2 Exogeneity
Why exogeneity matter?
E(z′i2ϵi)=0
- If [A3a] fails - 2SLS is also inconsistent
- If instrument is not exogenous, then we need to find a new one.
- Similar to Endogeneity Test, when there is a single instrument
ei=γ0+zi2γ1+erroriH0:γ1=0
is NOT a valid test of endogeneity
- the OLS residual, e is mechanically uncorrelated with zi2: ˆγ1 will be essentially 0 and you will never be able to determine if the instrument is endogenous.
Solution
Testing Instrumental Exogeneity in an Over-identified Model
When there is more than one exogenous instrument (per endogenous variable), we can test for instrument exogeneity.
When we have multiple instruments, the model is said to be over-identified.
Could estimate the same model several ways (i.e., can identify/ estimate β1 more than one way)
Idea behind the test: if the controls and instruments are truly exogenous then OLS estimation of the following regression,
ϵi=γ0+zi1γ1+zi2γ2+errori
should have a very low R2
- if the model is just identified (one instrument per endogenous variable) then the R2=0
Steps:
Estimate the structural equation by 2SLS (using all available instruments) and obtain the residuals e
Regress e on all controls and instruments and obtain the R2
Under the null hypothesis (all IV’s are uncorrelated), nR2∼χ2(q), where q is the number of instrumental variables minus the number of endogenous variables
- if the model is just identified (one instrument per endogenous variable) then q = 0, and the distribution under the null collapses.
low p-value means you reject the null of exogenous instruments. Hence you would like to have high p-value in this test.
Pitfalls for the Overid test
the overid test is essentially compiling the following information.
- Conditional on first instrument being exogenous is the other instrument exogenous?
- Conditional on the other instrument being exogenous, is the first instrument exogenous?
If all instruments are endogenous than neither test will be valid
really only useful if one instrument is thought to be truly exogenous (randomly assigned). even f you do reject the null, the test does not tell you which instrument is exogenous and which is endogenous.
Result | Implication |
---|---|
reject the null | you can be pretty sure there is an endogenous instrument, but don’t know which one. |
fail to reject | could be either (1) they are both exogenous, (2) they are both endogenous. |
36.1.3.1.1.3 Relevancy
Why Relevance matter?
π2≠0
used to show A2 holds
If π2=0 (instrument is not relevant) then A2 fails - perfect multicollinearity
If π2 is close to 0 (weak instrument) then there is near perfect multicollinearity - 2SLS is highly inefficient (Large standard errors).
A weak instrument will exacerbate any inconsistency due to an instrument being (even slightly) endogenous.
- In the simple case with no controls and a single endogenous variable and single instrumental variable,
plim(ˆβ22SLS)=β2+E(zi2ϵi)E(zi2yi2)
Testing Weak Instruments
can use t-test (or F-test for over-identified models) in the first stage to determine if there is a weak instrument problem.
J. Stock and Yogo (2005): a statistical rejection of the null hypothesis in the first stage at the 5% (or even 1%) level is not enough to insure the instrument is not weak
- Rule of Thumb: need a F-stat of at least 10 (or a t-stat of at least 3.2) to reject the null hypothesis that the instrument is weak.
Summary of the 2SLS Estimator
yi1=β0+zi1β1+yi2β2+ϵiyi2=π0+zi1π1+zi2π2+vi
- when [A3a] does not hold
E(y′i2ϵi)≠0
Then the OLS estimator is no longer unbiased or consistent.
If we have valid instruments zi2
Relevancy (need to avoid “weak instruments”): π2≠0 Then the 2SLS estimator is consistent under A1, A2, [A5a], and the above two conditions.
yi1=β0+zi1β1+yi2β2+ϵiyi2=π0+zi1π1+zi2π2+vi
- When [A3a] does hold
E(y′i2ϵi)=0
and we have valid instruments, then both the OLS and 2SLS estimators are consistent.
- The OLS estimator is always more efficient
- can use the variable addition test to determine if 2SLS is need (A3a does hold) or if OLS is valid (A3a does not hold)
Sometimes we can test the assumption for instrument to be valid:
- Exogeneity : Only table when there are more instruments than endogenous variables.
- Relevancy (need to avoid “weak instruments”): Always testable, need the F-stat to be greater than 10 to rule out a weak instrument
Application
Expenditure as observed instrument
m2.2sls <-
ivreg(
read ~ stratio + english + lunch
+ grades + income + calworks + county |
expenditure + english + lunch
+ grades + income + calworks + county ,
data = school
)
summary(m2.2sls)$coefficients[1:7,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 700.47891593 13.58064436 51.5792106 8.950497e-171
#> stratio -1.13674002 0.53533638 -2.1234126 3.438427e-02
#> english -0.21396934 0.03847833 -5.5607753 5.162571e-08
#> lunch -0.39384225 0.03773637 -10.4366757 1.621794e-22
#> gradesKK-08 -1.89227865 1.37791820 -1.3732881 1.704966e-01
#> income 0.62487986 0.11199008 5.5797785 4.668490e-08
#> calworks -0.04950501 0.06244410 -0.7927892 4.284101e-01
36.1.3.1.2 Checklist
- Regress the dependent variable on the instrument (reduced form). Since under OLS, we have unbiased estimate, the coefficient estimate should be significant (make sure the sign makes sense)
- Report F-stat on the excluded instruments. F-stat < 10 means you have a weak instrument (J. H. Stock, Wright, and Yogo 2002).
- Present R2 before and after including the instrument (Rossi 2014)
- For models with multiple instrument, present firs-t and second-stage result for each instrument separately. Overid test should be conducted (e.g., Sargan-Hansen J)
- Hausman test between OLS and 2SLS (don’t confuse this test for evidence that endogeneity is irrelevant - under invalid IV, the test is useless)
- Compare the 2SLS with the limited information ML. If they are different, you have evidence for weak instruments.
36.1.3.2 Good Instruments
Exogeneity and Relevancy are necessary but not sufficient for IV to produce consistent estimates.
Without theory or possible explanation, you can always create a new variable that is correlated with X and uncorrelated with ϵ
For example, we want to estimate the effect of price on quantity (Reiss 2011, 960)
Q=β1P+β2X+ϵP=π1X+η
where ϵ and η are jointly determined, X⊥ϵ,η
Without theory, we can just create a new variable Z=X+u where E(u)=0;u⊥X,ϵ,η
Then, Z satisfied both conditions:
Relevancy: X correlates P → Z correlates P
Exogeneity: u⊥ϵ (random noise)
But obviously, it’s not a valid instrument (intuitively). But theoretically, relevance and exogeneity are not sufficient to identify β because of unsatisfied rank condition for identification.
Moreover, the functional form of the instrument also plays a role when choosing a good instrument. Hence, we always need to check for the robustness of our instrument.
IV methods even with valid instruments can still have poor sampling properties (finite sample bias, large sampling errors) (Rossi 2014)
When you have a weak instrument, it’s important to report it appropriately. This problem will be exacerbated if you have multiple instruments (Larcker and Rusticus 2010).
36.1.3.2.1 Lagged dependent variable
In time series data sets, we can use lagged dependent variable as an instrument because it is not influenced by current shocks. For example, Chetty, Friedman, and Rockoff (2014) used lagged dependent variable in econ.
36.1.3.2.2 Lagged explanatory variable
Common practice in applied economics: Replace suspected simultaneously determined explanatory variable with its lagged value Bellemare, Masaki, and Pepinsky (2017).
This practice does not avoid simultaneity bias.
Estimates using this method are still inconsistent.
Hypothesis testing becomes invalid under this approach.
Lagging variables changes how endogeneity bias operates, adding a “no dynamics among unobservables” assumption to the “selection on observables” assumption.
Key conditions for appropriate use (Bellemare, Masaki, and Pepinsky 2017):
- Under unobserved confounding:
- No dynamics among unobservables.
- The lagged variable X is a stationary autoregressive process.
- Under no unobserved confounding:
- No reverse causality; the causal effect operates with a one-period lag (Xt−1→Y, Xt↛).
- Reverse causality is contemporaneous, with a one-period lag effect.
- Reverse causality is contemporaneous; no dynamics in Y, but dynamics exist in X (X_{t-1} \to X).
- Under unobserved confounding:
Alternative approach: Use lagged values of the endogenous variable in IV estimation. However, IV estimation is only effective if (Reed 2015):
Lagged values do not belong in the estimating equation.
Lagged values are sufficiently correlated with the simultaneously determined explanatory variable.
Lagged IVs help mitigate endogeneity if they only violate the independence assumption. However, if lagged IVs violate both the independence assumption and exclusion restriction, they may aggravate endogeneity (Yu Wang and Bellemare 2019).
36.1.3.3 Internal instrumental variable
(also known as instrument free methods). This section is based on Raluca Gui’s guide
alternative to external instrumental variable approaches
All approaches here assume a continuous dependent variable
36.1.3.3.1 Non-hierarchical Data (Cross-classified)
Y_t = \beta_0 + \beta_1 P_t + \beta_2 X_t + \epsilon_t
where
- t = 1, .., T (indexes either time or cross-sectional units)
- Y_t is a k \times 1 response variable
- X_t is a k \times n exogenous regressor
- P_t is a k \times 1 continuous endogenous regressor
- \epsilon_t is a structural error term with \mu_\epsilon =0 and E(\epsilon^2) = \sigma^2
- \beta are model parameters
The endogeneity problem arises from the correlation of P_t and \epsilon_t:
P_t = \gamma Z_t + v_t
where
- Z_t is a l \times 1 vector of internal instrumental variables
- ν_t is a random error with \mu_{v_t}, E(v^2) = \sigma^2_v, E(\epsilon v) = \sigma_{\epsilon v}
- Z_t is assumed to be stochastic with distribution G
- ν_t is assumed to have density h(·)
36.1.3.3.1.1 Latent Instrumental Variable
assume Z_t (unobserved) to be uncorrelated with \epsilon_t, which is similar to Instrumental Variable. Hence, Z_t and ν_t can’t be identified without distributional assumptions
The distributions of Z_t and ν_t need to be specified such that:
- endogeneity of P_t is corrected
- the distribution of P_t is empirically close to the integral that expresses the amount of overlap of Z as it is shifted over ν (= the convolution between Z_t and ν_t).
When the density h(·) = Normal, then G cannot be normal because the parameters would not be identified (Ebbes et al. 2005) .
Hence,
- in the LIV model the distribution of Z_t is discrete
- in the Higher Moments Method and Joint Estimation Using Copula methods, the distribution of Z_t is taken to be skewed.
Z_t are assumed unobserved, discrete and exogenous, with
- an unknown number of groups m
- \gamma is a vector of group means.
Identification of the parameters relies on the distributional assumptions of
- P_t: a non-Gaussian distribution
- Z_t discrete with m \ge 2
Note:
- If Z_t is continuous, the model is unidentified
- If P_t \sim N, you have inefficient estimates.
m3.liv <- latentIV(read ~ stratio, data = school)
summary(m3.liv)$coefficients[1:7, ]
#> Estimate Std. Error z-score Pr(>|z|)
#> (Intercept) 6.996014e+02 2.686165e+02 2.604462e+00 9.529035e-03
#> stratio -2.272673e+00 1.367747e+01 -1.661618e-01 8.681097e-01
#> pi1 -4.896363e+01 NaN NaN NaN
#> pi2 1.963920e+01 9.225351e-02 2.128830e+02 0.000000e+00
#> theta5 6.939432e-152 3.143456e-161 2.207581e+09 0.000000e+00
#> theta6 3.787512e+02 4.249436e+01 8.912976e+00 1.541010e-17
#> theta7 -1.227543e+00 4.885237e+01 -2.512761e-02 9.799651e-01
it will return a coefficient very different from the other methods since there is only one endogenous variable.
36.1.3.3.1.2 Joint Estimation Using Copula
assume Z_t (unobserved) to be uncorrelated with \epsilon_t, which is similar to Instrumental Variable. Hence, Z_t and ν_t can’t be identified without distributional assumptions
(S. Park and Gupta 2012) allows joint estimation of the continuous P_t and \epsilon_t using Gaussian copulas, where a copula is a function that maps several conditional distribution functions (CDF) into their joint CDF).
The underlying idea is that using information contained in the observed data, one selects marginal distributions for P_t and \epsilon_t. Then, the copula model constructs a flexible multivariate joint distribution that allows a wide range of correlations between the two marginals.
The method allows both continuous and discrete P_t.
In the special case of one continuous P_t, estimation is based on MLE
Otherwise, based on Gaussian copulas, augmented OLS estimation is used.
Assumptions:
skewed P_t
the recovery of the correct parameter estimates
\epsilon_t \sim normal marginal distribution. The marginal distribution of P_t is obtained using the Epanechnikov kernel density estimator
\hat{h}_p = \frac{1}{T . b} \sum_{t=1}^TK(\frac{p - P_t}{b}) whereP_t = endogenous variables
K(x) = 0.75(1-x^2)I(||x||\le 1)
b=0.9T^{-1/5}\times min(s, IQR/1.34)
- IQR = interquartile range
- s = sample standard deviation
- T = n of time periods observed in the data
In augmented OLS and MLE, the inference procedure occurs in two stages:
(1): the empirical distribution of P_t is computed
(2) used in it constructing the likelihood function)
Hence, the standard errors would not be correct.
So we use the sampling distributions (from bootstrapping) to get standard errors and the variance-covariance matrix. Since the distribution of the bootstrapped parameters is highly skewed, we report the percentile confidence intervals is preferable.
set.seed(110)
m4.cc <-
copulaCorrection(
read ~ stratio + english + lunch + calworks +
grades + income + county |
continuous(stratio),
data = school,
optimx.args = list(method = c("Nelder-Mead"),
itnmax = 60000),
num.boots = 2,
verbose = FALSE
)
summary(m4.cc)$coefficients[1:7,]
#> Point Estimate Boots SE Lower Boots CI (95%) Upper Boots CI (95%)
#> (Intercept) 682.25380724 2.80554213 NA NA
#> stratio -0.35704030 0.02075999 NA NA
#> english -0.21753937 0.01450666 NA NA
#> lunch -0.35642639 0.01902052 NA NA
#> calworks -0.06930202 0.02076781 NA NA
#> gradesKK-08 -2.02155911 0.25684614 NA NA
#> income 0.80137171 0.04725700 NA NA
we run this model with only one endogenous continuous regressor (stratio
). Sometimes, the code will not converge, in which case you can use different
- optimization algorithm
- starting values
- maximum number of iterations
36.1.3.3.1.3 Higher Moments Method
suggested by (Lewbel 1997) to identify \epsilon_t caused by measurement error.
Identification is achieved by using third moments of the data, with no restrictions on the distribution of \epsilon_t
The following instruments can be used with 2SLS estimation to obtain consistent estimates:
\begin{aligned} q_{1t} &= (G_t - \bar{G}) \\ q_{2t} &= (G_t - \bar{G})(P_t - \bar{P}) \\ q_{3t} &= (G_t - \bar{G})(Y_t - \bar{Y})\\ q_{4t} &= (Y_t - \bar{Y})(P_t - \bar{P}) \\ q_{5t} &= (P_t - \bar{P})^2 \\ q_{6t} &= (Y_t - \bar{Y})^2 \\ \end{aligned}
where
- G_t = G(X_t) for any given function G that has finite third own and cross moments
- X = exogenous variable
q_{5t}, q_{6t} can be used only when the measurement and \epsilon_t are symmetrically distributed. The rest of the instruments does not require any distributional assumptions for \epsilon_t.
Since the regressors G(X) = X are included as instruments, G(X) can’t be a linear function of X in q_{1t}
Since this method has very strong assumptions, Higher Moments Method should only be used in case of overidentification
set.seed(111)
m5.hetEr <-
hetErrorsIV(
read ~ stratio + english + lunch + calworks + income +
grades + county |
stratio | IIV(income, english),
data = school
)
summary(m5.hetEr)$coefficients[1:7,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 662.78791557 27.90173069 23.7543657 2.380436e-76
#> stratio 0.71480686 1.31077325 0.5453322 5.858545e-01
#> english -0.19522271 0.04057527 -4.8113717 2.188618e-06
#> lunch -0.37834232 0.03927793 -9.6324402 9.760809e-20
#> calworks -0.05665126 0.06302095 -0.8989273 3.692776e-01
#> income 0.82693755 0.17236557 4.7975797 2.335271e-06
#> gradesKK-08 -1.93795843 1.38723186 -1.3969968 1.632541e-01
recommend using this approach to create additional instruments to use with external ones for better efficiency.
36.1.3.3.1.4 Heteroskedastic Error Approach
- using means of variables that are uncorrelated with the product of heteroskedastic errors to identify structural parameters.
- This method can be use either when you don’t have external instruments or you want to use additional instruments to improve the efficiency of the IV estimator (Lewbel 2012)
- The instruments are constructed as simple functions of data
- Model’s assumptions:
\begin{aligned} E(X \epsilon) &= 0 \\ E(X v ) &= 0 \\ cov(Z, \epsilon v) &= 0 \\ cov(Z, v^2) &\neq 0 \text{ (for identification)} \end{aligned}
Structural parameters are identified by 2SLS regression of Y on X and P, using X and [Z − E(Z)]ν as instruments.
\text{instrument's strength} \propto cov((Z-\bar{Z})v,v)
where cov((Z-\bar{Z})v,v) is the degree of heteroskedasticity of ν with respect to Z (Lewbel 2012), which can be empirically tested.
If it is zero or close to zero (i.e.,the instrument is weak), you might have imprecise estimates, with large standard errors.
- Under homoskedasticity, the parameters of the model are unidentified.
- Under heteroskedasticity related to at least some elements of X, the parameters of the model are identified.
36.1.3.3.2 Hierarchical Data
Multiple independent assumptions involving various random components at different levels mean that any moderate correlation between some predictors and a random component or error term can result in a significant bias of the coefficients and of the variance components. (J.-S. Kim and Frees 2007) proposed a generalized method of moments which uses both, the between and within variations of the exogenous variables, but only assumes the within variation of the variables to be endogenous.
Assumptions
- the errors at each level \sim iid N
- the slope variables are exogenous
- the level-1 \epsilon \perp X, P. If this is not the case, additional, external instruments are necessary
Hierarchical Model
\begin{aligned} Y_{cst} &= Z_{cst}^1 \beta_{cs}^1 + X_{cst}^1 \beta_1 + \epsilon_{cst}^1 \\ \beta^1_{cs} &= Z_{cs}^2 \beta_{c}^2 + X_{cst}^2 \beta_2 + \epsilon_{cst}^2 \\ \beta^2_{c} &= X^3_c \beta_3 + \epsilon_c^3 \end{aligned}
Bias could stem from:
- errors at the higher two levels (\epsilon_c^3,\epsilon_{cst}^2) are correlated with some of the regressors
- only third level errors (\epsilon_c^3) are correlated with some of the regressors
(J.-S. Kim and Frees 2007) proposed
- When all variables are assumed exogenous, the proposed estimator equals the random effects estimator
- When all variables are assumed endogenous, it equals the fixed effects estimator
- also use omitted variable test (based on the Hausman-test (J. A. Hausman 1978) for panel data), which allows the comparison of a robust estimator and an estimator that is efficient under the null hypothesis of no omitted variables or the comparison of two robust estimators at different levels.
# function 'cholmod_factor_ldetA' not provided by package 'Matrix'
set.seed(113)
school$gr08 <- school$grades == "KK-06"
m7.multilevel <-
multilevelIV(read ~ stratio + english + lunch + income + gr08 +
calworks + (1 | county) | endo(stratio),
data = school)
summary(m7.multilevel)$coefficients[1:7,]
Another example using simulated data
- level-1 regressors: X_{11}, X_{12}, X_{13}, X_{14}, X_{15}, where X_{15} is correlated with the level-2 error (i.e., endogenous).
- level-2 regressors: X_{21}, X_{22}, X_{23}, X_{24}
- level-3 regressors: X_{31}, X_{32}, X_{33}
We estimate a three-level model with X15 assumed endogenous. Having a three-level hierarchy, multilevelIV()
returns five estimators, from the most robust to omitted variables (FE_L2), to the most efficient (REF) (i.e. lowest mean squared error).
- The random effects estimator (REF) is efficient assuming no omitted variables
- The fixed effects estimator (FE) is unbiased and asymptotically normal even in the presence of omitted variables.
- Because of the efficiency, the random effects estimator is preferable if you think there is no omitted. variables
- The robust estimator would be preferable if you think there is omitted variables.
# function 'cholmod_factor_ldetA' not provided by package 'Matrix'’
data(dataMultilevelIV)
set.seed(114)
formula1 <-
y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 +
X31 + X32 + X33 + (1 | CID) + (1 | SID) | endo(X15)
m8.multilevel <-
multilevelIV(formula = formula1, data = dataMultilevelIV)
coef(m8.multilevel)
summary(m8.multilevel, "REF")
True \beta_{X_{15}} =-1. We can see that some estimators are bias because X_{15} is correlated with the level-two error, to which only FE_L2 and GMM_L2 are robust
To select the appropriate estimator, we use the omitted variable test.
In a three-level setting, we can have different estimator comparisons:
- Fixed effects vs. random effects estimators: Test for omitted level-two and level-three omitted effects, simultaneously, one compares FE_L2 to REF. But we will not know at which omitted variables exist.
- Fixed effects vs. GMM estimators: Once the existence of omitted effects is established but not sure at which level, we test for level-2 omitted effects by comparing FE_L2 vs GMM_L3. If you reject the null, the omitted variables are at level-2 The same is accomplished by testing FE_L2 vs. GMM_L2, since the latter is consistent only if there are no omitted effects at level-2.
- Fixed effects vs. fixed effects estimators: We can test for omitted level-2 effects, while allowing for omitted level-3 effects by comparing FE_L2 vs. FE_L3 since FE_L2 is robust against both level-2 and level-3 omitted effects while FE_L3 is only robust to level-3 omitted variables.
Summary, use the omitted variable test comparing REF vs. FE_L2
first.
If the null hypothesis is rejected, then there are omitted variables either at level-2 or level-3
Next, test whether there are level-2 omitted effects, since testing for omitted level three effects relies on the assumption there are no level-two omitted effects. You can use any of these pair of comparisons:
FE_L2 vs. FE_L3
FE_L2 vs. GMM_L2
If no omitted variables at level-2 are found, test for omitted level-3 effects by comparing either
FE_L3
vs.GMM_L3
GMM_L2
vs.GMM_L3
summary(m8.multilevel, "REF")
# compare REF with all the other estimators. Testing REF (the most efficient estimator) against FE_L2 (the most robust estimator), equivalently we are testing simultaneously for level-2 and level-3 omitted effects.
Since the null hypothesis is rejected (p = 0.000139), there is bias in the random effects estimator.
To test for level-2 omitted effects (regardless of level-3 omitted effects), we compare FE_L2 versus FE_L3
The null hypothesis of no omitted level-2 effects is rejected (p = 3.92e − 05). Hence, there are omitted effects at level-two. We should use FE_L2 which is consistent with the underlying data that we generated (level-2 error correlated with X_15, which leads to biased FE_L3 coefficients.
The omitted variable test between FE_L2 and GMM_L2 should reject the null hypothesis of no omitted level-2 effects (p-value is 0).
If we assume an endogenous variable as exogenous, the RE and GMM estimators will be biased because of the wrong set of internal instrumental variables. To increase our confidence, we should compare the omitted variable tests when the variable is considered endogenous vs. exogenous to get a sense whether the variable is truly endogenous.
36.1.3.4 Proxy Variables
Can be in place of the omitted variable
will not be able to estimate the effect of the omitted variable
will be able to reduce some endogeneity caused bye the omitted variable
but it can have Measurement Error. Hence, you have to be extremely careful when using proxies.
Criteria for a proxy variable:
- The proxy is correlated with the omitted variable.
- Having the omitted variable in the regression will solve the problem of endogeneity
- The variation of the omitted variable unexplained by the proxy is uncorrelated with all independent variables, including the proxy.
IQ test can be a proxy for ability in the regression between wage explained education.
For the third requirement
ability = \gamma_0 + \gamma_1 IQ + \epsilon
where \epsilon is uncorrelated with education and IQ test.