Chapter 10 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

In this chapter we will learn to use R to instrumental variables and two–stage least squares models. We will use the libraries below.

library(tidyverse)
library(magrittr)
library(broom)

10.1 2 Stage Least Squares

To estimate a 2SLS, use ivreg from the AER package. ivreg, at a minimum, requires a formula that specifies the dependent and independent variables, instruments that identify instrumental variables, and the data. So the form of the call is, for example, ivreg(Y ~ X1 + X2 | Z1 + Z2 + X2, dataframe).28 Where X1 is the endogenous variable, X2 is exogenous and Z1 and Z2 are instruments for X1.

library(AER)

The classic example of endogeneity in economics is that of a demand equation, that is of quantity demanded as a function of price, \(Q=Q(P)\). There is no reason we can’t write \(P=P(Q)\) because a price determines quantity demanded, but we can’t have a quantity without a price. That is, price depends on quantity demanded which depends on price. To solve this problem we need an instrument that is exogenous to the demand equation but related to supply. This variable will induce changes in supply along the demand curve and thus changes in price. Since changes in supply will be correlated (cause) with changes in price, this new variable can serve as an instrument for price.

Let the demand equation be given by \[q_d=\beta_0+\beta_1p+u,\] supply by \[q_s=\alpha_0+\alpha_1p+v,\] and the market clearing equation by \[q_d=q_s=q\] These are known as the structural equations. Solving for \(p\) and \(q\) separately gives us the reduced form equations. Using the market clearing equation we know: \[\beta_0+\beta_1p+u=\alpha_0+\alpha_1p+v\] so, \[p=\frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}+\frac{v-u}{\beta_1-\alpha_1}=\lambda_0+\epsilon_1\] and \[q=\frac{\beta_1\alpha_0-\beta_0\alpha_1}{\beta_1-\alpha_1}+\frac{\beta_1v-\alpha_1u}{\beta_1-\alpha_1}=\mu_0+\epsilon_2\]

Notice that we have two estimable equations now. We can obtain OLS estimates for the reduced form parameters as \(\hat\lambda_0\) and \(\hat\mu_0\) as \[\hat\lambda_0=\bar p = \frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}\] and \[\hat\mu_o=\bar q=\frac{\beta_1\alpha_0-\beta_0\alpha_1}{\beta_1-\alpha_1}\] where \(\bar p\) and \(\bar q\) are the sample means of \(p\) and \(q\).

What we want, however, are estimates of the structural parameters \(\beta_0\), \(\beta_1\), \(\alpha_0\), and \(\alpha_1\). We have two equations and four unknowns; we cannot estimate the four parameters from the the two OLS estimates, \(\hat\lambda_0\) and \(\hat\mu_0\). That is, we cannot derive unique values for structural parameters from our estimates of the reduced form parameters. This is the essence of what’s known as the identification problem. If we can find a unique solution to the structural parameters from the OLS estimates of the reduced form parameters, then the equation is identified. The parameters of an identified equation are estimable.

Suppose the supply is now given by \[q_s=\alpha_0+\alpha_1p+\alpha_2r+v\] where r is an exogenous variable. Solving for p and q yields the reduced form equations \[p=\lambda_0+\lambda_1r+\epsilon_1\] and \[q=\mu_0+\mu_1r+\epsilon_2\] where \(\lambda_0=\frac{\alpha_0-\beta_0}{\beta_1-\alpha_1}\), \(\lambda_1=\frac{\alpha_2}{\beta_1-\alpha_1}\), \(\mu_0=\beta_0+\beta_1\lambda_0\), and \(\mu_1=\beta_1\lambda_1\). We can solve for unique values of \(\hat\beta_0=\hat\mu_0-\frac{\hat\mu_1}{\hat\lambda_1}\hat\lambda_0\) and \(\hat\beta_1=\frac{\hat\mu_1}{\hat\lambda_1}\). So the demand equation is identified. We can not obtain unique parameter estimates for the supply equation, however, so because \(\hat\mu_0=\frac{\hat\beta_1\alpha_1-\hat\beta_0\alpha_1}{\hat\beta_1-\alpha_1}\) and \(\hat\mu_1=\frac{\hat\beta_1\alpha_2}{\hat\beta_1-\alpha_1}\) are only two equations with three unknowns. If we add an exogenous variable to the demand equation, both equations would be identified.29

This method for obtaining parameter estimates is called indirect least squares (ILS). Let’s use the truffles data set from the PoEdata package.30 Truffles is a data frame with 30 observations on 5 variables. p is the price per ounce of premium truffles in $, q is the quantity of truffles traded in ounces, ps is the price per ounce of choice truffles in $, di is monthly per capita disposable income in $1000 per month, and pf is the hourly rental fee in $ of a truffle pig.

library(PoEdata)
data("truffles")

Let the demand function be \[q=\beta_0+\beta_1p+u\] and the supply function be \[q=\alpha_0+\alpha_1p+\alpha_2pf+v\]

Estimate the two reduced form equations as follows:

truffles %$%
  lm(p ~ pf)

Call:
lm(formula = p ~ pf)

Coefficients:
(Intercept)           pf  
       4.34         2.57  
truffles %$%
  lm(q ~ pf)

Call:
lm(formula = q ~ pf)

Coefficients:
(Intercept)           pf  
     21.501       -0.134  

The reduced form parameter estimates are \(\hat\lambda_0=3.343\), \(\hat\lambda_1=2.566\), \(\hat\mu_0=21.5006\), and \(\hat\mu_1=-0.1337\). The structural from parameter estimates for the demand equation are \(\hat\beta_1=\frac{-0.1337}{2.566}=-0.0521\) and \(\hat\beta_0=21.5006-(-0.0521)*4.343=21.7269\). So are demand equation is \(q_d=21.7269-0.0521p\).

Below we see the two stage least square estimates are the same.

truffles %$%
  ivreg(q ~ p | pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p | pf)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.350  -2.662   0.148   3.931   9.152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.7269     4.6046    4.72  0.00006 ***
p            -0.0521     0.0718   -0.73     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.29 on 28 degrees of freedom
Multiple R-Squared: -0.268,	Adjusted R-squared: -0.313 
Wald test: 0.527 on 1 and 28 DF,  p-value: 0.474 

10.2 Explanatory power of the instruments

Now, let the demand for premium truffles be a function of the price premium truffles, disposable income, and the price of choice truffles. Let the supply of premium truffles be a function the price of premium truffles and the rental rate of a truffle pig. Suppose we’d like to estimate the demand equation. In this case, pf is the lone instrument for p. Assess the explanatory power of pf as an instrument as follows:

truffles %$%
  lm(p ~ pf + di + ps) %>% 
  tidy()
# A tibble: 4 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -32.5      7.98      -4.07 0.000387 
2 pf              1.35     0.299      4.54 0.000115 
3 di              7.60     1.72       4.41 0.000160 
4 ps              1.71     0.351      4.87 0.0000476

The t statistic exceeds 3, so pf is a good instrument for p.

Similarly we can estimate the supply of premium truffles as a function of the price of premium truffles and the rental rate of a truffle pig. Using the demand function from above, we now have two instruments for p in the supply equation, ps and di. Since there is only one exogenous variable in the supply equation, the F test for the instruments is simply the F test for overall significance for the regression \(pf = \beta_0+\beta_1ps+\beta_2di+\epsilon\).

truffles %$%
  lm(pf ~ ps + di) %>% 
  glance()
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.407         0.363  4.25      9.27 8.63e-4     2  -84.4  177.  182.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The F statistic is 9.27 which is slightly below the rule of thumb of 10 for multiple instruments.

10.3 Estimating Simultaneous Equation Model

We can estimate the model posed above by estimating each equation as follows:

truffles %$%
  ivreg(q ~ p + ps + di | p + ps + di + pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p + ps + di | p + ps + di + pf)

Residuals:
   Min     1Q Median     3Q    Max 
-7.155 -1.936 -0.374  2.396  6.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.0910     3.7116    0.29   0.7711   
p             0.0233     0.0768    0.30   0.7642   
ps            0.7100     0.2143    3.31   0.0027 **
di            0.0764     1.1909    0.06   0.9493   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.46 on 26 degrees of freedom
Multiple R-Squared: 0.496,	Adjusted R-squared: 0.438 
Wald test: 8.52 on 3 and 26 DF,  p-value: 0.000416 
truffles %$%
  ivreg(q ~ p + pf | p + ps + di + pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p + pf | p + ps + di + pf)

Residuals:
   Min     1Q Median     3Q    Max 
-3.783 -0.853  0.227  0.758  3.347 

Coefficients:
            Estimate Std. Error t value           Pr(>|t|)    
(Intercept)  20.0328     1.2220    16.4 0.0000000000000015 ***
p             0.3380     0.0217    15.5 0.0000000000000054 ***
pf           -1.0009     0.0764   -13.1 0.0000000000003235 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 27 degrees of freedom
Multiple R-Squared: 0.902,	Adjusted R-squared: 0.895 
Wald test:  124 on 2 and 27 DF,  p-value: 0.0000000000000245 

  1. The data argument can be called with the expose operator %$%.↩︎

  2. The reader can verify that be saying adding the exogenous variable y to the demand equation to yield \(q_d=\beta_0+\beta_1p+\beta_2y+u\) and solving for the reduced form equations.↩︎

  3. Install the PoEdata package as follows: Install the remotes package with install.packages("remotes"). The remotes package allows you to install R packages from remote repositories such as GitHub. Install the PoEdata package by calling remotes::install_github("ccolonescu/PoEdata"). Finally, load the truffles data by calling data("truffles").↩︎