Chapter 10 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
In this chapter we will learn to use R to instrumental variables and two–stage least squares models. We will use the libraries below.
10.1 2 Stage Least Squares
To estimate a 2SLS, use ivreg
from the AER
package. ivreg
, at a minimum, requires a formula that specifies the dependent and independent variables, instruments that identify instrumental variables, and the data. So the form of the call is, for example, ivreg
(Y ~ X1 + X2 | Z1 + Z2 + X2, dataframe).28 Where X1 is the endogenous variable, X2 is exogenous and Z1 and Z2 are instruments for X1.
The classic example of endogeneity in economics is that of a demand equation, that is of quantity demanded as a function of price, \(Q=Q(P)\). There is no reason we can’t write \(P=P(Q)\) because a price determines quantity demanded, but we can’t have a quantity without a price. That is, price depends on quantity demanded which depends on price. To solve this problem we need an instrument that is exogenous to the demand equation but related to supply. This variable will induce changes in supply along the demand curve and thus changes in price. Since changes in supply will be correlated (cause) with changes in price, this new variable can serve as an instrument for price.
Let the demand equation be given by \[q_d=\beta_0+\beta_1p+u,\] supply by \[q_s=\alpha_0+\alpha_1p+v,\] and the market clearing equation by \[q_d=q_s=q\] These are known as the structural equations. Solving for \(p\) and \(q\) separately gives us the reduced form equations. Using the market clearing equation we know: \[\beta_0+\beta_1p+u=\alpha_0+\alpha_1p+v\] so, \[p=\frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}+\frac{v-u}{\beta_1-\alpha_1}=\lambda_0+\epsilon_1\] and \[q=\frac{\beta_1\alpha_0-\beta_0\alpha_1}{\beta_1-\alpha_1}+\frac{\beta_1v-\alpha_1u}{\beta_1-\alpha_1}=\mu_0+\epsilon_2\]
Notice that we have two estimable equations now. We can obtain OLS estimates for the reduced form parameters as \(\hat\lambda_0\) and \(\hat\mu_0\) as \[\hat\lambda_0=\bar p = \frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}\] and \[\hat\mu_o=\bar q=\frac{\beta_1\alpha_0-\beta_0\alpha_1}{\beta_1-\alpha_1}\] where \(\bar p\) and \(\bar q\) are the sample means of \(p\) and \(q\).
What we want, however, are estimates of the structural parameters \(\beta_0\), \(\beta_1\), \(\alpha_0\), and \(\alpha_1\). We have two equations and four unknowns; we cannot estimate the four parameters from the the two OLS estimates, \(\hat\lambda_0\) and \(\hat\mu_0\). That is, we cannot derive unique values for structural parameters from our estimates of the reduced form parameters. This is the essence of what’s known as the identification problem. If we can find a unique solution to the structural parameters from the OLS estimates of the reduced form parameters, then the equation is identified. The parameters of an identified equation are estimable.
Suppose the supply is now given by \[q_s=\alpha_0+\alpha_1p+\alpha_2r+v\] where r is an exogenous variable. Solving for p and q yields the reduced form equations \[p=\lambda_0+\lambda_1r+\epsilon_1\] and \[q=\mu_0+\mu_1r+\epsilon_2\] where \(\lambda_0=\frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}\), \(\lambda_1=\frac{\alpha_2}{\beta_1-\alpha_1}\), \(\mu_0=\beta_0+\beta_1\lambda_0\), and \(\mu_1=\beta_1\lambda_1\). We can solve for unique values of \(\hat\beta_0=\hat\mu_0-\frac{\hat\mu_1}{\hat\lambda_1}\hat\lambda_0\) and \(\hat\beta_1=\frac{\hat\mu_1}{\hat\lambda_1}\). So the demand equation is identified. We can not obtain unique parameter estimates for the supply equation, however, so because \(\hat\mu_0=\frac{\hat\beta_1\alpha_1-\hat\beta_0\alpha_1}{\hat\beta_1-\alpha_1}\) and \(\hat\mu_1=\frac{\hat\beta_1\alpha_2}{\hat\beta_1-\alpha_1}\) are only two equations with three unknowns. If we add an exogenous variable to the demand equation, both equations would be identified.29
This method for obtaining parameter estimates is called indirect least squares (ILS). Let’s use the truffles data set from the PoEdata
package.30 Truffles is a data frame with 30 observations on 5 variables. p is the price per ounce of premium truffles in $, q is the quantity of truffles traded in ounces, ps is the price per ounce of choice truffles in $, di is monthly per capita disposable income in $1000 per month, and pf is the hourly rental fee in $ of a truffle pig.
Let the demand function be \[q=\beta_0+\beta_1p+u\] and the supply function be \[q=\alpha_0+\alpha_1p+\alpha_2pf+v\]
Estimate the two reduced form equations as follows:
Call:
lm(formula = p ~ pf)
Coefficients:
(Intercept) pf
4.34 2.57
Call:
lm(formula = q ~ pf)
Coefficients:
(Intercept) pf
21.501 -0.134
The reduced form parameter estimates are \(\hat\lambda_0=3.343\), \(\hat\lambda_1=2.566\), \(\hat\mu_0=21.5006\), and \(\hat\mu_1=-0.1337\). The structural from parameter estimates for the demand equation are \(\hat\beta_1=\frac{-0.1337}{2.566}=-0.0521\) and \(\hat\beta_0=21.5006-(-0.0521)*4.343=21.7269\). So are demand equation is \(q_d=21.7269-0.0521p\).
Below we see the two stage least square estimates are the same.
Call:
ivreg(formula = q ~ p | pf)
Residuals:
Min 1Q Median 3Q Max
-13.350 -2.662 0.148 3.931 9.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.7269 4.6046 4.72 0.00006 ***
p -0.0521 0.0718 -0.73 0.47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.29 on 28 degrees of freedom
Multiple R-Squared: -0.268, Adjusted R-squared: -0.313
Wald test: 0.527 on 1 and 28 DF, p-value: 0.474
10.2 Explanatory power of the instruments
Now, let the demand for premium truffles be a function of the price premium truffles, disposable income, and the price of choice truffles. Let the supply of premium truffles be a function the price of premium truffles and the rental rate of a truffle pig. Suppose we’d like to estimate the demand equation. In this case, pf is the lone instrument for p. Assess the explanatory power of pf as an instrument as follows:
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -32.5 7.98 -4.07 0.000387
2 pf 1.35 0.299 4.54 0.000115
3 di 7.60 1.72 4.41 0.000160
4 ps 1.71 0.351 4.87 0.0000476
The t statistic exceeds 3, so pf is a good instrument for p.
Similarly we can estimate the supply of premium truffles as a function of the price of premium truffles and the rental rate of a truffle pig. Using the demand function from above, we now have two instruments for p in the supply equation, ps and di. Since there is only one exogenous variable in the supply equation, the F test for the instruments is simply the F test for overall significance for the regression \(pf = \beta_0+\beta_1ps+\beta_2di+\epsilon\).
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.407 0.363 4.25 9.27 8.63e-4 2 -84.4 177. 182.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The F statistic is 9.27 which is slightly below the rule of thumb of 10 for multiple instruments.
10.3 Estimating Simultaneous Equation Model
We can estimate the model posed above by estimating each equation as follows:
Call:
ivreg(formula = q ~ p + ps + di | p + ps + di + pf)
Residuals:
Min 1Q Median 3Q Max
-7.155 -1.936 -0.374 2.396 6.335
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0910 3.7116 0.29 0.7711
p 0.0233 0.0768 0.30 0.7642
ps 0.7100 0.2143 3.31 0.0027 **
di 0.0764 1.1909 0.06 0.9493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.46 on 26 degrees of freedom
Multiple R-Squared: 0.496, Adjusted R-squared: 0.438
Wald test: 8.52 on 3 and 26 DF, p-value: 0.000416
Call:
ivreg(formula = q ~ p + pf | p + ps + di + pf)
Residuals:
Min 1Q Median 3Q Max
-3.783 -0.853 0.227 0.758 3.347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.0328 1.2220 16.4 0.0000000000000015 ***
p 0.3380 0.0217 15.5 0.0000000000000054 ***
pf -1.0009 0.0764 -13.1 0.0000000000003235 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.5 on 27 degrees of freedom
Multiple R-Squared: 0.902, Adjusted R-squared: 0.895
Wald test: 124 on 2 and 27 DF, p-value: 0.0000000000000245
The data argument can be called with the expose operator %$%.↩︎
The reader can verify that be saying adding the exogenous variable y to the demand equation to yield \(q_d=\beta_0+\beta_1p+\beta_2y+u\) and solving for the reduced form equations.↩︎
Install the
PoEdata
package as follows: Install theremotes
package withinstall.packages("remotes")
. The remotes package allows you to install R packages from remote repositories such as GitHub. Install thePoEdata
package by callingremotes::install_github("ccolonescu/PoEdata")
. Finally, load the truffles data by callingdata("truffles")
.↩︎