Chapter 10 Random Regressors

rm(list=ls()) #Removes all items in Environment!
library(AER) #for `ivreg()`
library(lmtest) #for `coeftest()` and `bptest()`.
library(broom) #for `glance(`) and `tidy()`
library(PoEdata) #for PoE4 datasets
library(car) #for `hccm()` robust standard errors
library(knitr) #for making neat tables with `kable()`

In most data coming from natural, non-controlled phenomena, both the dependent and independent variables are random. In many cases, some of the independent variables are also correlated with the error term in a regression model, which makes the OLS method inappropriate. Regressors (\(x\)) that are correlated with the error term are called endogeneous; likewise, those that are not are called exogeneous. The remedy for this violation of the linear regression assumption is the use of instrumental variables, or instruments, which are variables (\(z\)) that do not directly influence the response but are correlated with the endogenous regressor in question.

10.1 The Instrumental Variables (IV) Method

A strong instrument, one that is highly correlated with the endogenous regressor it concerns, reduces the variance of the estimated coefficient. Assume the multiple regression model in Equation \ref{eq:multreg10}, where regressors \(x_{2}\) to \(x_{K-1}\) are exogenous and \(x_{K}\) is endogenous. The IV method consists in two stages: first regress \(x_{K}\) on all the other regressors and all the instruments and create the fitted values series, \(\hat{x}_{K}\); second, regress the initial equation, in which \(x_{K}\) is replaced by \(\hat{x}_{K}\). Therefore, the IV method is often called two-stage least squares, or 2SLS.

\[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+...+\beta_{K}x_{K}+e \label{eq:multreg10} \end{equation}\]

Consider the \(wage\) model in Equation \ref{eq:wagelm10} using the \(mroz\) dataset. The notorious difficulty with this model is that the error term may include some unobserved attributes, such as personal ability, that determine both wage and education. In other words, the independent variable \(educ\) is correlated with the error term, is endogenous.

\[\begin{equation} log(wage)=\beta_{1}+\beta_{2}educ+\beta_{3}exper+\beta_{4}exper^2+e \label{eq:wagelm10} \end{equation}\]

An instrument that may address the endogeneity of \(educ\) is \(mothereduc\), of which we can reasonably assume that it does not directly influence the daughter’s wage, but it influences her education.

Let us first carry out an explicit two-stage model with only one instrument, \(mothereduc\). The first stage is to regress \(educ\) on other regressors and the instrument, as Equation \ref{eq:firstStageEduc10} shows.

\[\begin{equation} educ=\gamma_{1}+\gamma_{2}exper+\gamma_{3}exper^2+\theta_{1}mothereduc+\nu_{educ} \label{eq:firstStageEduc10} \end{equation}\]
data("mroz", package="PoEdata")
mroz1 <- mroz[mroz$lfp==1,] #restricts sample to lfp=1
educ.ols <- lm(educ~exper+I(exper^2)+mothereduc, data=mroz1)
kable(tidy(educ.ols), digits=4, align='c',caption=
  "First stage in the 2SLS model for the 'wage' equation")
Table 10.1: First stage in the 2SLS model for the ‘wage’ equation
term estimate std.error statistic p.value
(Intercept) 9.7751 0.4239 23.0605 0.0000
exper 0.0489 0.0417 1.1726 0.2416
I(exper^2) -0.0013 0.0012 -1.0290 0.3040
mothereduc 0.2677 0.0311 8.5992 0.0000

The \(p\)-value for \(mothereduc\) is very low (see Table 10.1), indicating a strong correlation between this instrument and the endogenous variable \(educ\) aven after controling for other variables. The second stage in the two-stage procedure is to create the fitted values of \(educ\) from the first stage (Equation \ref{eq:firstStageEduc10}) and plug them into the model of interest, Euation \ref{eq:wagelm10} to replace the original variable \(educ\).

educHat <- fitted(educ.ols)
wage.2sls <- lm(log(wage)~educHat+exper+I(exper^2), data=mroz1)
kable(tidy(wage.2sls), digits=4, align='c',caption=
  "Second stage in the 2SLS model for the 'wage' equation")
Table 10.2: Second stage in the 2SLS model for the ‘wage’ equation
term estimate std.error statistic p.value
(Intercept) 0.1982 0.4933 0.4017 0.6881
educHat 0.0493 0.0391 1.2613 0.2079
exper 0.0449 0.0142 3.1668 0.0017
I(exper^2) -0.0009 0.0004 -2.1749 0.0302

The results of the explicit \(2SLS\) procedure are shown in Table 10.2; keep n mind, however, that the standard errors calculated in this way are incorrect; the correct method is to use a dedicated software function to solve an instrumental variable model. In \(R\), such a function is ivreg().

data("mroz", package="PoEdata")
mroz1 <- mroz[mroz$lfp==1,] #restricts sample to lfp=1.
mroz1.ols <- lm(log(wage)~educ+exper+I(exper^2), data=mroz1)
mroz1.iv <- ivreg(log(wage)~educ+exper+I(exper^2)|
            exper+I(exper^2)+mothereduc, data=mroz1)
mroz1.iv1 <- ivreg(log(wage)~educ+exper+I(exper^2)|
stargazer(mroz1.ols, wage.2sls, mroz1.iv, mroz1.iv1,
  title="Wage equation: OLS, 2SLS, and IV models compared",
  type=.stargazertype, # "html" or "latex" (in index.Rmd) 
  keep.stat="n",  # what statistics to print
#  single.row=TRUE,
  intercept.bottom=FALSE, #moves the intercept coef to top
  column.labels=c("OLS","explicit 2SLS", "IV mothereduc", 
                  "IV mothereduc and fathereduc"),
  dep.var.labels.include = FALSE,
  model.numbers = FALSE,
  dep.var.caption="Dependent variable: wage",
  star.char=NULL) #supresses the stars)
Wage equation: OLS, 2SLS, and IV models compared
Dependent variable: wage
OLS explicit 2SLS IV mothereduc IV mothereduc and fathereduc
Constant -0.5220 0.1982 0.1982 0.0481
(0.1986) (0.4933) (0.4729) (0.4003)
educ 0.1075 0.0493 0.0614
(0.0141) (0.0374) (0.0314)
educHat 0.0493
exper 0.0416 0.0449 0.0449 0.0442
(0.0132) (0.0142) (0.0136) (0.0134)
I(exper2) -0.0008 -0.0009 -0.0009 -0.0009
(0.0004) (0.0004) (0.0004) (0.0004)
Observations 428 428 428 428

The table titled “Wage equation: OLS, 2SLS, and IV compared” shows that the importance of education in determining wage decreases in the IV model. It also shows that the explicit 2SLS model and the IV model with only \(mothered\) instrument yield the same coefficients (the \(educ\) in the IV model is equivalent to the \(educHat\) in 2SLS), but the standard errors are different. The correct ones are those provided by the IV model.

A few observations are in order concerning the above code sequence. First, since some of the individuals are not in the labor force, their wages are zero and the log cannot be calculated. I excluded those observations using only those for which \(lpf\) is equal to 1. Second, the instrument list in the command ivreg includes both the instrument itself (\(mothereduc\)) and all exogenous regressors, which are, so to speak, their own instruments. The vertical bar character | separates the proper regressor list from the instrument list.

To test for weak instruments in the \(wage\) equation, we just test the joint significance of the instruments in an \(educ\) model as shown in Equation \ref{eq:educlm10}.

\[\begin{equation} educ=\gamma_{1}+\gamma_{2}exper+\gamma_{3} exper^2+\theta_{1} mothereduc+ \theta_{2}fathereduc+\nu \label{eq:educlm10} \end{equation}\]
educ.ols <- lm(educ~exper+I(exper^2)+mothereduc+fathereduc, 
tab <- tidy(educ.ols)
kable(tab, digits=4,
      caption="The 'educ' first-stage equation")
Table 10.3: The ‘educ’ first-stage equation
term estimate std.error statistic p.value
(Intercept) 9.1026 0.4266 21.3396 0.0000
exper 0.0452 0.0403 1.1236 0.2618
I(exper^2) -0.0010 0.0012 -0.8386 0.4022
mothereduc 0.1576 0.0359 4.3906 0.0000
fathereduc 0.1895 0.0338 5.6152 0.0000
linearHypothesis(educ.ols, c("mothereduc=0", "fathereduc=0"))
Res.Df RSS Df Sum of Sq F Pr(>F)
425 2219.22 NA NA NA NA
423 1758.58 2 460.641 55.4003 0

The test rejects the null hypothesis that both \(mothereduc\) and \(fathereduc\) coefficients are zero, indicating that at least one instrument is strong. A rule of thumb requires to soundly reject the null hypothesis at a value of the \(F\)-statistic greater than 10 or, for only one instrument, a \(t\)-statistic greater than 3.16, to make sure that an instrument is strong.

For a model to be identified the number of instruments should be at least equal to the number of endogenous variables. If there are more instruments than endogenous variables, the model is said to be overidentified.

10.2 Specification Tests

We have seen before how to test for weak instruments with only one instrument. This test can be extended to several instruments. The null hypothesis is \(H_{0}\): “All instruments are weak”.

Since using IV when it is not necessary worsens our estimates, we would like to test whether the variables that worry us are indeed endogenous. This problem is addressed by the Hausman test for endogeneity, where the null hypothesis is \(H_{0}:\;Cov(x,e)=0\). Thus, rejecting the null hypothesis indicates the existence of endogeneity and the need for instrumental variables.

The test for the validity of instruments (whether the instruments are corrrelated with the error term) can only be performed for the extra instruments, those that are in excess of the number of endogenous variables. This test is sometimes called a test for overidentifying restrictions, or the Sargan test. The null hypothesis is that the covariance between the instrument and the error term is zero, \(H_{0}:Cov(z,e)=0\). Thus, rejecting the null indicates that at least one of the extra instruments is not valid.

\(R\) automatically performs these three tests and reports the results in the output to the ivreg function.

summary(mroz1.iv1, diagnostics=TRUE)
## Call:
## ivreg(formula = log(wage) ~ educ + exper + I(exper^2) | exper + 
##     I(exper^2) + mothereduc + fathereduc, data = mroz1)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0986 -0.3196  0.0551  0.3689  2.3493 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.048100   0.400328    0.12   0.9044   
## educ         0.061397   0.031437    1.95   0.0515 . 
## exper        0.044170   0.013432    3.29   0.0011 **
## I(exper^2)  -0.000899   0.000402   -2.24   0.0257 * 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   2 423     55.40  <2e-16 ***
## Wu-Hausman         1 423      2.79   0.095 .  
## Sargan             1  NA      0.38   0.539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.675 on 424 degrees of freedom
## Multiple R-Squared: 0.136,   Adjusted R-squared: 0.13 
## Wald test: 8.14 on 3 and 424 DF,  p-value: 0.0000279

The results for the wage equation are as follows:

  • Weak instruments test: rejects the null, meaning that at least one instrument is strong
  • (Wu-)Hausman test for endogeneity: barely rejects the null that the variable of concern is uncorrelated with the error term, indicating that \(educ\) is marginally endogenous
  • Sargan overidentifying restrictions: does not reject the null, meaning that the extra instruments are valid (are uncorrelated with the error term).

The test for weak instruments might be unreliable with more than one endogenous regressor, though, because there is indeed one \(F\)-statistic for each endogenous regressor. An alternative is the Cragg-Donald test based on the statistic shown in Equation \ref{eq:CraggDonaldStat10}, where \(G\) is the number of exogenous regressors, \(B\) is the number of endogenous regressors, \(L\) is the number of external instruments, and \(r_{B}\) is the lowest canonical correlation (a measure of the correlation between the endogenous and the exogenous variables, calculated by the function cancor() in \(R\)).

\[\begin{equation} F=\frac{N-G-B}{L} \frac{r_{B}^2}{1-r_{B}^2} \label{eq:CraggDonaldStat10} \end{equation}\]

Let us look at the hours equation with two endogenous variables, \(mtr\) and \(educ\), and two external instruments, \(mothereduc\) and \(fathereduc\). One of the two exogenous regressors, \(nwifeinc\), is the family income net of the wife’s income; the other exogenous regressor, \(mtr\), is the wife’s marginal tax rate. Equation \ref{eq:HoursIVeqn10} shows this model; the dataset is mroz, restricted to women that are in the labor force.

\[\begin{equation} hours=\beta_{1}+\beta_{2}mtr+\beta_{3}educ+\beta_{4}kidsl6+\beta_{5}nwifeinc+e \label{eq:HoursIVeqn10} \end{equation}\]

The next code sequence uses the \(R\) function cancor() to calculate the lowest of two canonical correlations, \(r_{B}\), which is needed for the Cragg-Donald \(F\)-statistic in Equation \ref{eq:CraggDonaldStat10}.

data("mroz", package="PoEdata")
mroz1 <- mroz[which(mroz$wage>0),]
nwifeinc <- (mroz1$faminc-mroz1$wage*mroz1$hours)/1000
G<-2; L<-2; N<-nrow(mroz1)
x1 <- resid(lm(mtr~kidsl6+nwifeinc, data=mroz1))
x2 <- resid(lm(educ~kidsl6+nwifeinc, data=mroz1))
z1 <-resid(lm(mothereduc~kidsl6+nwifeinc, data=mroz1))
z2 <-resid(lm(fathereduc~kidsl6+nwifeinc, data=mroz1))
X <- cbind(x1,x2)
Y <- cbind(z1,z2)
rB <- min(cancor(X,Y)$cor)
CraggDonaldF <- ((N-G-L)/L)/((1-rB^2)/rB^2)

The result is the Cragg-Donald \(F=0.100806\), which is much smaller than the critical value of \(4.58\) given in Table 10E.1 of the textbook (Hill, Griffiths, and Lim 2011). This test rejects the null hypothesis of strong instruments, contradicting my previous result.


Hill, R.C., W.E. Griffiths, and G.C. Lim. 2011. Principles of Econometrics. Wiley.