Chapter 10 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

In this chapter we will learn to use R to instrumental variables and two–stage least squares models. We will use the libraries below.

library(tidyverse)
library(magrittr)
library(broom)

10.1 2 Stage Least Squares

To estimate a 2SLS, use ivreg from the AER package. ivreg, at a minimum, requires a formula that specifies the dependent and independent variables, instruments that identify instrumental variables, and the data. So the form of the call is, for example, ivreg(Y ~ X1 + X2 | Z1 + Z2 + X2, dataframe).28 Where X1 is the endogenous variable, X2 is exogenous and Z1 and Z2 are instruments for X1.

library(AER)

The classic example of endogeneity in economics is that of a demand equation, that is of quantity demanded as a function of price, Q=Q(P). There is no reason we can’t write P=P(Q) because a price determines quantity demanded, but we can’t have a quantity without a price. That is, price depends on quantity demanded which depends on price. To solve this problem we need an instrument that is exogenous to the demand equation but related to supply. This variable will induce changes in supply along the demand curve and thus changes in price. Since changes in supply will be correlated (cause) with changes in price, this new variable can serve as an instrument for price.

Let the demand equation be given by qd=β0+β1p+u, supply by qs=α0+α1p+v, and the market clearing equation by qd=qs=q These are known as the structural equations. Solving for p and q separately gives us the reduced form equations. Using the market clearing equation we know: β0+β1p+u=α0+α1p+v so, p=α0β0β1α1+vuβ1α1=λ0+ϵ1 and q=β1α0β0α1β1α1+β1vα1uβ1α1=μ0+ϵ2

Notice that we have two estimable equations now. We can obtain OLS estimates for the reduced form parameters as ˆλ0 and ˆμ0 as ˆλ0=ˉp=α0β0β1α1 and ˆμo=ˉq=β1α0β0α1β1α1 where ˉp and ˉq are the sample means of p and q.

What we want, however, are estimates of the structural parameters β0, β1, α0, and α1. We have two equations and four unknowns; we cannot estimate the four parameters from the the two OLS estimates, ˆλ0 and ˆμ0. That is, we cannot derive unique values for structural parameters from our estimates of the reduced form parameters. This is the essence of what’s known as the identification problem. If we can find a unique solution to the structural parameters from the OLS estimates of the reduced form parameters, then the equation is identified. The parameters of an identified equation are estimable.

Suppose the supply is now given by qs=α0+α1p+α2r+v where r is an exogenous variable. Solving for p and q yields the reduced form equations p=λ0+λ1r+ϵ1 and q=μ0+μ1r+ϵ2 where λ0=α0β0β1α1, λ1=α2β1α1, μ0=β0+β1λ0, and μ1=β1λ1. We can solve for unique values of ˆβ0=ˆμ0ˆμ1ˆλ1ˆλ0 and ˆβ1=ˆμ1ˆλ1. So the demand equation is identified. We can not obtain unique parameter estimates for the supply equation, however, so because ˆμ0=ˆβ1α1ˆβ0α1ˆβ1α1 and ˆμ1=ˆβ1α2ˆβ1α1 are only two equations with three unknowns. If we add an exogenous variable to the demand equation, both equations would be identified.29

This method for obtaining parameter estimates is called indirect least squares (ILS). Let’s use the truffles data set from the PoEdata package.30 Truffles is a data frame with 30 observations on 5 variables. p is the price per ounce of premium truffles in $, q is the quantity of truffles traded in ounces, ps is the price per ounce of choice truffles in $, di is monthly per capita disposable income in $1000 per month, and pf is the hourly rental fee in $ of a truffle pig.

library(PoEdata)
data("truffles")

Let the demand function be q=β0+β1p+u and the supply function be q=α0+α1p+α2pf+v

Estimate the two reduced form equations as follows:

truffles %$%
  lm(p ~ pf)

Call:
lm(formula = p ~ pf)

Coefficients:
(Intercept)           pf  
       4.34         2.57  
truffles %$%
  lm(q ~ pf)

Call:
lm(formula = q ~ pf)

Coefficients:
(Intercept)           pf  
     21.501       -0.134  

The reduced form parameter estimates are ˆλ0=3.343, ˆλ1=2.566, ˆμ0=21.5006, and ˆμ1=0.1337. The structural from parameter estimates for the demand equation are ˆβ1=0.13372.566=0.0521 and ˆβ0=21.5006(0.0521)4.343=21.7269. So are demand equation is qd=21.72690.0521p.

Below we see the two stage least square estimates are the same.

truffles %$%
  ivreg(q ~ p | pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p | pf)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.350  -2.662   0.148   3.931   9.152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.7269     4.6046    4.72  0.00006 ***
p            -0.0521     0.0718   -0.73     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.29 on 28 degrees of freedom
Multiple R-Squared: -0.268,	Adjusted R-squared: -0.313 
Wald test: 0.527 on 1 and 28 DF,  p-value: 0.474 

10.2 Explanatory power of the instruments

Now, let the demand for premium truffles be a function of the price premium truffles, disposable income, and the price of choice truffles. Let the supply of premium truffles be a function the price of premium truffles and the rental rate of a truffle pig. Suppose we’d like to estimate the demand equation. In this case, pf is the lone instrument for p. Assess the explanatory power of pf as an instrument as follows:

truffles %$%
  lm(p ~ pf + di + ps) %>% 
  tidy()
# A tibble: 4 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -32.5      7.98      -4.07 0.000387 
2 pf              1.35     0.299      4.54 0.000115 
3 di              7.60     1.72       4.41 0.000160 
4 ps              1.71     0.351      4.87 0.0000476

The t statistic exceeds 3, so pf is a good instrument for p.

Similarly we can estimate the supply of premium truffles as a function of the price of premium truffles and the rental rate of a truffle pig. Using the demand function from above, we now have two instruments for p in the supply equation, ps and di. Since there is only one exogenous variable in the supply equation, the F test for the instruments is simply the F test for overall significance for the regression pf=β0+β1ps+β2di+ϵ.

truffles %$%
  lm(pf ~ ps + di) %>% 
  glance()
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.407         0.363  4.25      9.27 8.63e-4     2  -84.4  177.  182.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The F statistic is 9.27 which is slightly below the rule of thumb of 10 for multiple instruments.

10.3 Estimating Simultaneous Equation Model

We can estimate the model posed above by estimating each equation as follows:

truffles %$%
  ivreg(q ~ p + ps + di | p + ps + di + pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p + ps + di | p + ps + di + pf)

Residuals:
   Min     1Q Median     3Q    Max 
-7.155 -1.936 -0.374  2.396  6.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.0910     3.7116    0.29   0.7711   
p             0.0233     0.0768    0.30   0.7642   
ps            0.7100     0.2143    3.31   0.0027 **
di            0.0764     1.1909    0.06   0.9493   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.46 on 26 degrees of freedom
Multiple R-Squared: 0.496,	Adjusted R-squared: 0.438 
Wald test: 8.52 on 3 and 26 DF,  p-value: 0.000416 
truffles %$%
  ivreg(q ~ p + pf | p + ps + di + pf) %>% 
  summary()

Call:
ivreg(formula = q ~ p + pf | p + ps + di + pf)

Residuals:
   Min     1Q Median     3Q    Max 
-3.783 -0.853  0.227  0.758  3.347 

Coefficients:
            Estimate Std. Error t value           Pr(>|t|)    
(Intercept)  20.0328     1.2220    16.4 0.0000000000000015 ***
p             0.3380     0.0217    15.5 0.0000000000000054 ***
pf           -1.0009     0.0764   -13.1 0.0000000000003235 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 27 degrees of freedom
Multiple R-Squared: 0.902,	Adjusted R-squared: 0.895 
Wald test:  124 on 2 and 27 DF,  p-value: 0.0000000000000245 

  1. The data argument can be called with the expose operator %$%.↩︎

  2. The reader can verify that be saying adding the exogenous variable y to the demand equation to yield qd=β0+β1p+β2y+u and solving for the reduced form equations.↩︎

  3. Install the PoEdata package as follows: Install the remotes package with install.packages("remotes"). The remotes package allows you to install R packages from remote repositories such as GitHub. Install the PoEdata package by calling remotes::install_github("ccolonescu/PoEdata"). Finally, load the truffles data by calling data("truffles").↩︎