This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

12.2 The General IV Regression Model

The simple IV regression model is easily extended to a multiple regression model which we refer to as the general IV regression model. In this model we distinguish between four types of variables: the dependent variable, included exogenous variables, included endogenous variables and instrumental variables. Key Concept 12.1 summarizes the model and the common terminology. See Chapter 12.2 of the book for a more comprehensive discussion of the individual components of the general model.

Key Concept 12.1

The General Instrumental Variables Regression Model and Terminology

Yi=β0+β1X1i++βkXki+βk+1W1i++βk+rWri+ui,

with i=1,,n is the general instrumental variables regression model where

  • Yi is the dependent variable

  • β0,,βk+1 are 1+k+r unknown regression coefficients

  • X1i,,Xki are k endogenous regressors

  • W1i,,Wri are r exogenous regressors which are uncorrelated with ui

  • ui is the error term

  • Z1i,,Zmi are m instrumental variables

The coefficients are overidentified if m>k. If m<k, the coefficients are underidentified and when m=k they are exactly identified. For estimation of the IV regression model we require exact identification or overidentification.

While computing both stages of TSLS individually is not a big deal in (12.1), the simple regression model with a single endogenous regressor, Key Concept 12.2 clarifies why resorting to TSLS functions like ivreg() are more convenient when the set of potentially endogenous regressors (and instruments) is large.

Estimating regression models with TSLS using multiple instruments by means of ivreg() is straightforward. There are, however, some subtleties in correctly specifying the regression formula.

Assume that you want to estimate the model Yi=β0+β1X1i+β2X2i+W1i+ui where X1i and X2i are endogenous regressors that shall be instrumented by Z1i, Z2i and Z3i and W1i is an exogenous regressor. The corresponding data is available in a data.frame with column names y, x1, x1, w1, z1, z2 and z3. It might be tempting to specify the argument formula in your call of ivreg() as y ~ x1 + x2 + w1 | z1 + z2 + z3 which is wrong. As explained in the documentation of ivreg() (see ?ivreg), it is necessary to list all exogenous variables as instruments too, that is joining them by +’s on the right of the vertical bar: y ~ x1 + x2 + w1 | w1 + z1 + z2 + z3 where w1 is “instrumenting itself”.

If there is a large number of exogenous variables it may be convenient to provide an update formula with a . (this includes all variables except for the dependent variable) right after the | and to exclude all endogenous variables using a -. For example, if there is one exogenous regressor w1 and one endogenous regressor x1 with instrument z1, the appropriate formula would be y ~ w1 + x1 | w1 + z1 which is equivalent to y ~ w1 + x1 | . - x1 + z1.

Key Concept 12.2

Two-Stage Least Squares

Similarly to the simple IV regression model, the general IV model (12.5) can be estimated using the two-stage least squares estimator:

  1. First-stage regression(s)

    Run an OLS regression for each of the endogenous variables (X1i,,Xki) on all instrumental variables (Z1i,,Zmi), all exogenous variables (W1i,,Wri) and an intercept. Compute the fitted values (ˆX1i,,ˆXki).

  2. Second-stage regression

    Regress the dependent variable on the predicted values of all endogenous regressors, all exogenous variables and an intercept using OLS. This gives ˆβTSLS0,,ˆβTSLSk+r, the TSLS estimates of the model coefficients.

In the general IV regression model, the instrument relevance and instrument exogeneity assumptions are the same as in the simple regression model with a single endogenous regressor and only one instrument. See Key Concept 12.3 for a recap using the terminology of general IV regression.

Key Concept 12.3

Two Conditions for Valid Instruments

For Z1i,,Zmi to be a set of valid instruments, the following two conditions must be fulfilled:

  1. Instrument Relevance:

    if there are k endogenous variables, r exogenous variables and mk instruments Z and the ˆX1i,,ˆXki are the predicted values from the k population first stage regressions, it must hold that (ˆX1i,,ˆXki,W1i,,Wri,1) are not perfectly multicollinear. 1 denotes the constant regressor which equals 1 for all observations.

    Note: If there is only one endogenous regressor Xi, there must be at least one non-zero coefficient on the Z and the W in the population regression for this condition to be valid: if all of the coefficients are zero, all the ˆXi are just the mean of X such that there is perfect multicollinearity.

  2. Instrument Exogeneity:

    All m instruments must be uncorrelated with the error term,

    ρZ1i,ui=0,,ρZmi,ui=0.

One can show that if the IV regression assumptions presented in Key Concept 12.4 hold, the TSLS estimator in (12.5) is consistent and normally distributed when the sample size is large. Appendix 12.3 of the book deals with a proof in the special case with a single regressor, a single instrument and no exogenous variables. The reasoning behind this carries over to the general IV model. Chapter 18 of the book proves a more complicated explanation for the general case.

For our purposes it is sufficient to bear in mind that validity of the assumptions stated in Key Concept 12.4 allows us to obtain valid statistical inference using R functions which compute t-Tests, F-Tests and confidence intervals for model coefficients.

Key Concept 12.4

The IV Regression Assumptions

For the general IV regression model in Key Concept 12.1 we assume the following:

  1. E(ui|W1i,,Wri)=0.

  2. (X1i,,Xki,W1i,,Wri,Z1i,,Zmi) are i.i.d. draws from their joint distribution.

  3. All variables have nonzero finite fourth moments, i.e., outliers are unlikely.

  4. The Zs are valid instruments (see Key Concept 12.3).

Application to the Demand for Cigarettes

The estimated elasticity of the demand for cigarettes in (12.1) is 1.08. Although (12.1) was estimated using IV regression it is plausible that this IV estimate is biased: in this model, the TSLS estimator is inconsistent for the true β1 if the instrument (the real sales tax per pack) correlates with the error term. This is likely to be the case since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model.

log(Qcigarettesi)=β0+β1log(Pcigarettesi)+β2log(incomei)+ui

Before estimating (12.6) using ivreg() we define income as real per capita income rincome and append it to the data set CigarettesSW.

# add rincome to the dataset
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)

c1995 <- subset(CigarettesSW, year == "1995")
# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + 
                    salestax, data = c1995)

coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.43066    1.25939  7.4883 1.935e-09 ***
## log(rprice)  -1.14338    0.37230 -3.0711  0.003611 ** 
## log(rincome)  0.21452    0.31175  0.6881  0.494917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We obtain

^log(Qcigarettesi)=9.42(1.26)1.14(0.37)log(Pcigarettesi)+0.21(0.31)log(incomei).

Following the book we add the cigarette-specific taxes (cigtaxi) as a further instrumental variable and estimate again using TSLS.

# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)

c1995 <- subset(CigarettesSW, year == "1995")
# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 
                    log(rincome) + salestax + cigtax, data = c1995)

coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.89496    0.95922 10.3157 1.947e-13 ***
## log(rprice)  -1.27742    0.24961 -5.1177 6.211e-06 ***
## log(rincome)  0.28040    0.25389  1.1044    0.2753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the two instruments salestaxi and cigtaxi we have m=2 and k=1 so the coefficient on the endogenous regressor log(Pcigarettesi) is overidentified. The TSLS estimate of (12.6) is

^log(Qcigarettesi)=9.89(0.96)1.28(0.25)log(Pcigarettesi)+0.28(0.25)log(incomei).

Should we trust the estimates presented in (12.7) or rather rely on (12.8)? The estimates obtained using both instruments are more precise since in (12.8) all standard errors reported are smaller than in (12.7). In fact, the standard error for the estimate of the demand elasticity is only two thirds of the standard error when the sales tax is the only instrument used. This is due to more information being used in estimation (12.8). If the instruments are valid, (12.8) can be considered more reliable.

However, without insights regarding the validity of the instruments it is not sensible to make such a statement. This stresses why checking instrument validity is essential. Chapter 12.3 briefly discusses guidelines in checking instrument validity and presents approaches that allow to test for instrument relevance and exogeneity under certain conditions. These are then used in an application to the demand for cigarettes in Chapter 12.4.