Chapter 8 Heteroskedasticity

rm(list=ls()) #Removes all items in Environment!
library(lmtest) #for coeftest() and bptest().
library(broom) #for glance() and tidy()
library(PoEdata) #for PoE4 datasets
library(car) #for hccm() robust standard errors
library(sandwich)
library(knitr)
library(stargazer)

Reference for the package sandwich (Lumley and Zeileis 2015).

One of the assumptions of the Gauss-Markov theorem is homoskedasticity, which requires that all observations of the response (dependent) variable come from distributions with the same variance \(\sigma^2\). In many economic applications, however, the spread of \(y\) tends to depend on one or more of the regressors \(x\). For example, in the food simple regression model (Equation \ref{eq:foodagain8}) expenditure on food stays closer to its mean (regression line) at lower incomes and to be more spread about its mean at higher incomes. Think just that people have more choices at higher income whether to spend their extra income on food or something else.

\[\begin{equation} food\_exp_{i}=\beta_{1}+\beta_{2}income_{i}+e_{i} \label{eq:foodagain8} \end{equation}\]

In the presence of heteroskedasticity, the coefficient estimators are still unbiased, but their variance is incorrectly calculated by the usual OLS method, which makes confidence intervals and hypothesis testing incorrect as well. Thus, new methods need to be applied to correct the variances.

8.1 Spotting Heteroskedasticity in Scatter Plots

When the variance of \(y\), or of \(e\), which is the same thing, is not constant, we say that the response or the residuals are heteroskedastic. Figure 8.1 shows, again, a scatter diagram of the food dataset with the regression line to show how the observations tend to be more spread at higher income.

data("food",package="PoEdata")
mod1 <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, type="p",
     xlab="income", ylab="food expenditure")
abline(mod1)
Heteroskedasticity in the 'food' data

Figure 8.1: Heteroskedasticity in the ‘food’ data

Another useful method to visualize possible heteroskedasticity is to plot the residuals against the regressors suspected of creating heteroskedasticity, or, more generally, against the fitted values of the regression. Figure 8.2 shows both these options for the simple food_exp model.

res <- residuals(mod1)
yhat <- fitted(mod1)
plot(food$income,res, xlab="income", ylab="residuals")
plot(yhat,res, xlab="fitted values", ylab="residuals")
Residual plots in the 'food' modelResidual plots in the 'food' model

Figure 8.2: Residual plots in the ‘food’ model

8.2 Heteroskedasticity Tests

Suppose the regression model we want to test for heteroskedasticity is the one in Equation \ref{eq:multireggen8}. \[\begin{equation} y_{i}=\beta_{1}+\beta_{2}x_{i2}+...+\beta_{K}x_{iK}+ e_{i} \label{eq:multireggen8} \end{equation}\]

The test we are construction assumes that the variance of the errors is a function \(h\) of a number of regressors \(z_{s}\), which may or may not be present in the initial regression model that we want to test. Equation \ref{eq:hetfctn8} shows the general form of the variance function.

\[\begin{equation} var(y_{i})=E(e_{i}^2)=h(\alpha_{1}+\alpha_{2}z_{i2}+...+\alpha_{S}z_{iS}) \label{eq:hetfctn8} \end{equation}\] The variance \(var(y_{i})\) is constant only if all the coefficients of the regressors \(z\) in Equation \ref{eq:hetfctn8} are zero, which provides the null hypothesis of our heteroskedasticity test shown in Equation \ref{eq:hetHo8}. \[\begin{equation} H_{0}: \alpha_{2}=\alpha_{3}=\,...\,\alpha_{S}=0 \label{eq:hetHo8} \end{equation}\] Since we do not know the true variances of the response variable \(y_{i}\), all we can do is to estimate the residuals from the initial model in Equation \ref{eq:multireggen8} and replace \(e_{i}^2\) in Equation \ref{eq:hetfctn8} with the estimated residuals. For a linear function \(h()\), the test equation is, finally, Equation \ref{eq:hetres8}. \[\begin{equation} \hat{e}_{i}^2=\alpha_{1}+\alpha_{2}z_{i2}+...+\alpha_{S}z_{iS}+\nu_{i} \label{eq:hetres8} \end{equation}\]

Te relevant test statistic is \(\chi ^2\), given by Equation \ref{eq:chisq8}, where \(R^2\) is the one resulted from Equation \ref{eq:hetres8}.

\[\begin{equation} \chi ^2=N\times R^2 \sim \chi ^{2}_{(S-1)} \label{eq:chisq8} \end{equation}\]

The Breusch-Pagan heteroskedasticiy test uses the method we have just described, where the regressors \(z_{s}\) are the variables \(x_{k}\) in the initial model. Let us apply this test to the food model. The function to determine a critical value of the \(\chi ^2\) distribution for a significance level \(\alpha\) and \(S-1\) degrees of freedom is qchisq(1-alpha, S-1).

alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
#The test equation:
modres <- lm(ressq~income, data=food)
N <- nobs(modres)
gmodres <- glance(modres)
S <- gmodres$df #Number of Betas in model
#Chi-square is always a right-tail test
chisqcr <- qchisq(1-alpha, S-1)
Rsqres <- gmodres$r.squared
chisq <- N*Rsqres
pval <- 1-pchisq(chisq,S-1)

Our test yields a value of the test statistic \(\chi ^2\) of \(7.38\), which is to be compared to the critical \(\chi^{2}_{cr}\) having \(S-1=1\) degrees of freedom and \(\alpha = 0.05\). This critical value is \(\chi ^{2}_{cr}=3.84\). Since the calculated \(\chi ^2\) exceeds the critical value, we reject the null hypothesis of homoskedasticity, which means there is heteroskedasticity in our data and model. Alternatively, we can find the \(p\)-value corresponding to the calculated \(\chi^{2}\), \(p=0.007\).

Let us now do the same test, but using a White version of the residuals equation, in its quadratic form.

modres <- lm(ressq~income+I(income^2), data=food)
gmodres <- glance(modres)
Rsq <- gmodres$r.squared
S <- gmodres$df #Number of Betas in model
chisq <- N*Rsq
pval <- 1-pchisq(chisq, S-1)

The calculated \(p\)-value in this version is \(p=0.023\), which also implies rejection of the null hypothesis of homoskedasticity.

The function bptest() in package lmtest does (the robust version of) the Breusch-Pagan test in \(R\). The following code applies this function to the basic food equation, showing the results in Table 8.1, where ‘statistic’ is the calculated \(\chi^2\).

mod1 <- lm(food_exp~income, data=food)
kable(tidy(bptest(mod1)), 
caption="Breusch-Pagan heteroskedasticity test")
Table 8.1: Breusch-Pagan heteroskedasticity test
statistic p.value parameter method
7.38442 0.006579 1 studentized Breusch-Pagan test
The Goldfeld-Quandt heteroskedasticity test is useful when the regression model to be tested includes an indicator variable among its regressors. The test compares the variance of one group of the indicator variable (say group 1) to the variance of the benchmark group (say group \(0\)), as the null hypothesis in Equation\ref{eq:gqnull8} shows. \[\begin{equation} H_{0}:\hat{\sigma}^{2}_{1}=\hat{\sigma}^{2}_{0},\;\;\;\; H_{A}:\hat{\sigma}^{2}_{1}\neq \hat{\sigma}^{2}_{0} \label{eq:gqnull8} \end{equation}\]

The test statistic when the null hyppthesis is true, given in Equation \ref{eq:gqf8}, has an \(F\) distribution with its two degrees of freedom equal to the degrees of freedom of the two subsamples, respectively \(N_{1}-K\) and \(N_{0}-K\).

\[\begin{equation} F=\frac{\hat{\sigma}^{2}_{1}}{\hat{\sigma}^{2}_{0}} \label{eq:gqf8} \end{equation}\]

Let us apply this test to a \(wage\) equation based on the dataset \(cps2\), where \(metro\) is an indicator variable equal to \(1\) if the individual lives in a metropolitan area and \(0\) for rural area. I will split the dataset in two based on the indicator variable \(metro\) and apply the regression model (Equation \ref{eq:hetwage8}) separately to each group.

\[\begin{equation} wage=\beta_{1}+\beta_{2}educ+\beta_{3}exper+\beta_{4}metro+e \label{eq:hetwage8} \end{equation}\]
alpha <- 0.05 #two tail, will take alpha/2
data("cps2", package="PoEdata")
#Create the two groups, m (metro) and r (rural)
m <- cps2[which(cps2$metro==1),]
r <- cps2[which(cps2$metro==0),]
wg1 <- lm(wage~educ+exper, data=m)
wg0 <- lm(wage~educ+exper, data=r)
df1 <- wg1$df.residual #Numerator degrees of freedom
df0 <- wg0$df.residual #Denominatot df
sig1squared <- glance(wg1)$sigma^2
sig0squared <- glance(wg0)$sigma^2
fstat <- sig1squared/sig0squared
Flc <- qf(alpha/2, df1, df0)#Left (lower) critical F
Fuc <- qf(1-alpha/2, df1, df0) #Right (upper) critical F

The results of these calculations are as follows: calculated \(F\) statistic \(F=2.09\), the lower tail critical value \(F_{lc}=0.81\), and the upper tail critical value \(F_{uc}=1.26\). Since the calculated amount is greater than the upper critical value, we reject the hypothesis that the two variances are equal, facing, thus, a heteroskedasticity problem. If one expects the variance in the metropolitan area to be higher and wants to test the (alternative) hypothesis \(H_{0}:\sigma^{2}_{1}\leq \sigma^{2}_{0},\;\;\;\; H_{A}:\sigma^{2}_{1}>\sigma^{2}_{0}\), one needs to re-calcuate the critical value for \(\alpha=0.05\) as follows:

Fc <- qf(1-alpha, df1, df0) #Right-tail test

The critical value for the right tail test is \(F_{c}=1.22\), which still implies rejecting the null hypothesis.

The Goldfeld-Quant test can be used even when there is no indicator variable in the model or in the dataset. One can split the dataset in two using an arbitrary rule. Let us apply the method to the basic \(food\) equation, with the data split in low-income (\(li\)) and high-income (\(hi\)) halves. The cutoff point is, in this case, the median income, and the hypothesis to be tested \[H_{0}: \sigma^{2}_{hi}\le \sigma^{2}_{li},\;\;\;\;H_{A}:\sigma^{2}_{hi} > \sigma^{2}_{li}\]

alpha <- 0.05
data("food", package="PoEdata")
medianincome <- median(food$income)
li <- food[which(food$income<=medianincome),]
hi <- food[which(food$income>=medianincome),]
eqli <- lm(food_exp~income, data=li)
eqhi <- lm(food_exp~income, data=hi)
dfli <- eqli$df.residual
dfhi <- eqhi$df.residual
sigsqli <- glance(eqli)$sigma^2
sigsqhi <- glance(eqhi)$sigma^2
fstat <- sigsqhi/sigsqli #The larger var in numerator
Fc <- qf(1-alpha, dfhi, dfli)
pval <- 1-pf(fstat, dfhi, dfli)

The resulting \(F\) statistic in the \(food\) example is \(F=3.61\), which is greater than the critical value \(F_{cr}=2.22\), rejecting the null hypothesis in favour of the alternative hypothesis that variance is higher at higher incomes. The \(p\)-value of the test is \(p=0.0046\).

In the package lmtest, \(R\) has a specialized function to perform Goldfeld-Quandt tests, the function gqtest which takes, among other arguments, the formula describing the model to be tested, a break point specifying how the data should be split (percentage of the number of observations), what is the alternative hypothesis (“greater”, “two.sided”, or “less”), how the data should be ordered (order.by=), and data=. Let us apply gqtest() to the \(food\) example with the same partition as we have just did before.

foodeq <- lm(food_exp~income, data=food)
tst <- gqtest(foodeq, point=0.5, alternative="greater",
       order.by=food$income)
kable(tidy(tst), 
  caption="R function `gqtest()` with the 'food' equation")
Table 8.2: R function gqtest() with the ‘food’ equation
df1 df2 statistic p.value method
18 18 3.61476 0.004596 Goldfeld-Quandt test

Please note that the results from applying gqtest() (Table 8.2 are the same as those we have already calculated.

8.3 Heteroskedasticity-Consistent Standard Errors

Since the presence of heteroskedasticity makes the lest-squares standard errors incorrect, there is a need for another method to calculate them. White robust standard errors is such a method.

The \(R\) function that does this job is hccm(), which is part of the car package and yields a heteroskedasticity-robust coefficient covariance matrix. This matrix can then be used with other functions, such as coeftest() (instead of summary), waldtest() (instead of anova), or linearHypothesis() to perform hypothesis testing. The function hccm() takes several arguments, among which is the model for which we want the robust standard errors and the type of standard errors we wish to calculate. type can be “constant” (the regular homoskedastic errors), “hc0”, “hc1”, “hc2”, “hc3”, or “hc4”; “hc1” is the default type in some statistical software packages. Let us compute robust standard errors for the basic \(food\) equation and compare them with the regular (incorrect) ones.

foodeq <- lm(food_exp~income,data=food)
kable(tidy(foodeq),caption=
 "Regular standard errors in the 'food' equation")
Table 8.3: Regular standard errors in the ‘food’ equation
term estimate std.error statistic p.value
(Intercept) 83.4160 43.41016 1.92158 0.062182
income 10.2096 2.09326 4.87738 0.000019
cov1 <- hccm(foodeq, type="hc1") #needs package 'car'
food.HC1 <- coeftest(foodeq, vcov.=cov1)
kable(tidy(food.HC1),caption=
  "Robust (HC1) standard errors in the 'food' equation")
Table 8.4: Robust (HC1) standard errors in the ‘food’ equation
term estimate std.error statistic p.value
(Intercept) 83.4160 27.46375 3.03731 0.004299
income 10.2096 1.80908 5.64356 0.000002

When comparing Tables 8.3 and 8.4, it can be observed that the robust standard errors are smaller and, since the coefficients are the same, the \(t\)-statistics are higher and the \(p\)-values are smaller. Lower \(p\)-values with robust standard errors is, however, the exception rather than the rule.

Next is an example of using robust standard errors when performing a fictitious linear hypothesis test on the basic ‘andy’ model, to test the hypothesis \(H_{0}: \beta_{2}+\beta_{3}=0\)

data("andy", package="PoEdata")
andy.eq <- lm(sales~price+advert, data=andy)
bp <- bptest(andy.eq) #Heteroskedsticity test
b2 <- coef(andy.eq)[["price"]]
b3 <- coef(andy.eq)[["advert"]]
H0 <- "price+advert=0"
kable(tidy(linearHypothesis(andy.eq, H0,
  vcov=hccm(andy.eq, type="hc1"))),
  caption="Linear hypothesis with robust standard errors")
Table 8.5: Linear hypothesis with robust standard errors
res.df df statistic p.value
73 NA NA NA
72 1 23.387 7e-06
kable(tidy(linearHypothesis(andy.eq, H0)), 
  caption="Linear hypothesis with regular standard errors")
Table 8.6: Linear hypothesis with regular standard errors
res.df rss df sumsq statistic p.value
73 2254.71 NA NA NA NA
72 1718.94 1 535.772 22.4415 0.000011

This example demonstrates how to introduce robust standards errors in a linearHypothesis function. It also shows that, when heteroskedasticity is not significant (bptst does not reject the homoskedasticity hypothesis) the robust and regular standard errors (and therefore the \(F\) statistics of the tests) are very similar.

Just for completeness, I should mention that a similar function, with similar uses is the function vcov, which can be found in the package sandwich.

8.4 GLS: Known Form of Variance

Let us consider the regression equation given in Equation \ref{eq:genheteq8}), where the errors are assumed heteroskedastic.

\[\begin{equation} y_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i},\;\;\;var(e_{i})=\sigma_{i} \label{eq:genheteq8} \end{equation}\]

Heteroskedasticity implies different variances of the error term for each observation. Ideally, one should be able to estimate the \(N\) variances in order to obtain reliable standard errors, but this is not possible. The second best in the absence of such estimates is an assumption of how variance depends on one or several of the regressors. The estimator obtained when using such an assumption is called a generalized least squares estimator, gls, which may involve a structure of the errors as proposed in Equation \ref{eq:glsvardef8}, which assumes a linear relationship between variance and the regressor \(x_{i}\) with the unknown parameter \(\sigma^2\) as a proportionality factor.

\[\begin{equation} var(e_{i})=\sigma_{i}^2=\sigma ^2 x_{i} \label{eq:glsvardef8} \end{equation}\]

One way to circumvent guessing a proportionality factor in Equation \ref{eq:glsvardef8} is to transform the initial model in Equation \ref{eq:genheteq8} such that the error variance in the new model has the structure proposed in Equation \ref{eq:glsvardef8}. This can be achieved if the initial model is divided through by \(\sqrt x_{i}\) and estimate the new model shown in Equation \ref{eq:glsstar8}. If Equation \ref{eq:glsstar8} is correct, then the resulting estimator is BLUE.

\[\begin{equation} y_{i}^{*}=\beta_{1}x_{i1}^{*}+\beta_{2}x_{i2}^{*}+e_{i}^{*} \label{eq:glsstar8} \end{equation}\]

In general, if the initial variables are multiplied by quantities that are specific to each observation, the resulting estimator is called a weighted least squares estimator, wls. Unlike the robust standard errors method for heteroskedasticity correction, gls or wls methods change the estimates of regression coefficients.

The function lm() can do wls estimation if the argument weights is provided under the form of a vector of the same size as the other variables in the model. \(R\) takes the square roots of the weights provided to multiply the variables in the regression. Thus, if you wish to multiply the model by \(\frac{1}{\sqrt {x_{i}}}\), the weights should be \(w_{i}=\frac{1}{x_{i}}\).

Let us apply these ideas to re-estimate the \(food\) equation, which we have determined to be affected by heteroskedasticity.

w <- 1/food$income
food.wls <- lm(food_exp~income, weights=w, data=food)
vcvfoodeq <- coeftest(foodeq, vcov.=cov1)
kable(tidy(foodeq), 
  caption="OLS estimates for the 'food' equation")
Table 8.7: OLS estimates for the ‘food’ equation
term estimate std.error statistic p.value
(Intercept) 83.4160 43.41016 1.92158 0.062182
income 10.2096 2.09326 4.87738 0.000019
kable(tidy(food.wls),
  caption="WLS estimates for the 'food' equation" )
Table 8.8: WLS estimates for the ‘food’ equation
term estimate std.error statistic p.value
(Intercept) 78.6841 23.78872 3.30762 0.002064
income 10.4510 1.38589 7.54100 0.000000
kable(tidy(vcvfoodeq),caption=
"OLS estimates for the 'food' equation with robust standard errors" )
Table 8.9: OLS estimates for the ‘food’ equation with robust standard errors
term estimate std.error statistic p.value
(Intercept) 83.4160 27.46375 3.03731 0.004299
income 10.2096 1.80908 5.64356 0.000002

Tables 8.7, 8.8, and 8.9 compare ordinary least square model to a weighted least squares model and to OLS with robust standard errors. The WLS model multiplies the variables by \(1 \, / \, \sqrt{income}\), where the weights provided have to be \(w=1\,/\, income\). The effect of introducing the weights is a slightly lower intercept and, more importantly, different standard errors. Please note that the WLS standard errors are closer to the robust (HC1) standard errors than to the OLS ones.

8.5 Grouped Data

We have seen already (Equation \ref{eq:gqnull8}) how a dichotomous indicator variable splits the data in two groups that may have different variances. The generalized least squares method can account for group heteroskedasticity, by choosing appropriate weights for each group; if the variables are transformed by multiplying them by \(1/\sigma_{j}\), for group \(j\), the resulting model is homoskedastic. Since \(\sigma_{j}\) is unknown, we replace it with its estimate \(\hat \sigma_{j}\). This method is named feasible generalized least squares.

data("cps2", package="PoEdata")
rural.lm <- lm(wage~educ+exper, data=cps2, subset=(metro==0))
sigR <- summary(rural.lm)$sigma
metro.lm <- lm(wage~educ+exper, data=cps2, subset=(metro==1))
sigM <- summary(metro.lm)$sigma
cps2$wght <- rep(0, nrow(cps2))
# Create a vector of weights
for (i in 1:1000) 
{
  if (cps2$metro[i]==0){cps2$wght[i] <- 1/sigR^2}
  else{cps2$wght[i] <- 1/sigM^2}
}
wge.fgls <- lm(wage~educ+exper+metro, weights=wght, data=cps2)
wge.lm <- lm(wage~educ+exper+metro, data=cps2)
wge.hce <- coeftest(wge.lm, vcov.=hccm(wge.lm, data=cps2))
stargazer(rural.lm, metro.lm, wge.fgls,wge.hce,
  header=FALSE, 
  title="OLS vs. FGLS estimates for the 'cps2' data",
  type=.stargazertype, # "html" or "latex" (in index.Rmd) 
  keep.stat="n",  # what statistics to print
  omit.table.layout="n",
  star.cutoffs=NA,
  digits=3, 
#  single.row=TRUE,
  intercept.bottom=FALSE, #moves the intercept coef to top
  column.labels=c("Rural","Metro","FGLS", "HC1"),
  dep.var.labels.include = FALSE,
  model.numbers = FALSE,
  dep.var.caption="Dependent variable: wage",
  model.names=FALSE,
  star.char=NULL) #supresses the stars
OLS vs. FGLS estimates for the ‘cps2’ data
Dependent variable: wage
Rural Metro FGLS HC1
Constant -6.166 -9.052 -9.398 -9.914
(1.899) (1.189) (1.020) (1.218)
educ 0.956 1.282 1.196 1.234
(0.133) (0.080) (0.069) (0.084)
exper 0.126 0.135 0.132 0.133
(0.025) (0.018) (0.015) (0.016)
metro 1.539 1.524
(0.346) (0.346)
Observations 192 808 1,000

The table titled “OLS, vs. FGLS estimates for the ‘cps2’ data” helps comparing the coefficients and standard errors of four models: OLS for rural area, OLS for metro area, feasible GLS with the whole dataset but with two types of weights, one for each area, and, finally, OLS with heteroskedasticity-consistent (HC1) standard errors. Please be reminded that the regular OLS standard errors are not to be trusted in the presence of heteroskedasticity.

The previous code sequence needs some explanation. It runs two regression models, rural.lm and metro.lm just to estimate \(\hat \sigma_{R}\) and \(\hat \sigma_{M}\) needed to calculate the weights for each group. The subsets, this time, were selected directly in the lm() function through the argument subset=, which takes as argument some logical expression that may involve one or more variables in the dataset. Then, I create a new vector of a size equal to the number of observations in the dataset, a vector that will be populated over the next few code lines with weights. I choose to create this vector as a new column of the dataset cps2, a column named wght. With this the hard part is done; I just need to run an lm() model with the option weights=wght and that gives my FGLS coefficients and standard errors.

The next lines make a for loop runing through each observation. If observation \(i\) is a rural area observation, it receives a weight equal to \(1/\sigma_{R}^2\); otherwise, it receives the weight \(1/\sigma_{M}^2\). Why did I square those \(sigmas\)? Because, remember, the argument weights in the lm() function requires the square of the factor multiplying the regression model in the WLS method.

The remaining part of the code repeats models we ran before and places them in one table for making comparison easier.

8.6 GLS: Unknown Form of Variance

Suppose we wish to estimate the model in Equation \ref{eq:genericeq8}, where the errors are known to be heteroskedastic but their variance is an unknown function of \(S\) some variables \(z_{s}\) that could be among the regressors in our model or other variables. \[\begin{equation} y_{i}=\beta_{1}+\beta_{2}x_{i2}+...\beta_{k}x_{iK}+e_{i} \label{eq:genericeq8} \end{equation}\]

Equation \ref{eq:varfuneq8} uses the residuals from Equation \ref{eq:genericeq8} as estimates of the variances of the error terms and serves at estimating the functional form of the variance. If the assumed functional form of the variance is the exponential function \(var(e_{i})=\sigma_{i}^{2}=\sigma ^2 x_{i}^{\gamma}\), then the regressors \(z_{is}\) in Equation \ref{eq:varfuneq8} are the logs of the initial regressors \(x_{is}\), \(z_{is}=log(x_{is})\).

\[\begin{equation} ln(\hat{e}_{i}^{2})=\alpha_{1}+\alpha_{2}z_{i2}+...+\alpha_{S}z_{iS}+\nu_{i} \label{eq:varfuneq8} \end{equation}\]

The variance estimates for each error term in Equation \ref{eq:genericeq8} are the fitted values, \(\hat \sigma_{i}^2\) of Equation \ref{eq:varfuneq8}, which can then be used to construct a vector of weights for the regression model in Equation \ref{eq:genericeq8}. Let us follow these steps on the \(food\) basic equation where we assume that the variance of error term \(i\) is an unknown exponential function of income. So, the purpose of the following code fragment is to determine the weights and to supply them to the lm() function. Remember, lm() multiplies each observation by the square root of the weight you supply. For instance, if you want to multiply the observations by \(1/\sigma_{i}\), you should supply the weight \(w_{i}=1/\sigma_{i}^2\).

data("food", package="PoEdata")
food.ols <- lm(food_exp~income, data=food)
ehatsq <- resid(food.ols)^2
sighatsq.ols  <- lm(log(ehatsq)~log(income), data=food)
vari <- exp(fitted(sighatsq.ols))
food.fgls <- lm(food_exp~income, weights=1/vari, data=food)
stargazer(food.ols, food.HC1, food.wls, food.fgls,
  header=FALSE, 
  title="Comparing various 'food' models",
  type=.stargazertype, # "html" or "latex" (in index.Rmd) 
  keep.stat="n",  # what statistics to print
  omit.table.layout="n",
  star.cutoffs=NA,
  digits=3, 
#  single.row=TRUE,
  intercept.bottom=FALSE, #moves the intercept coef to top
  column.labels=c("OLS","HC1","WLS","FGLS"),
  dep.var.labels.include = FALSE,
  model.numbers = FALSE,
  dep.var.caption="Dependent variable: 'food expenditure'",
  model.names=FALSE,
  star.char=NULL) #supresses the stars
Comparing various ‘food’ models
Dependent variable: ‘food expenditure’
OLS HC1 WLS FGLS
Constant 83.416 83.416 78.684 76.054
(43.410) (27.464) (23.789) (9.713)
income 10.210 10.210 10.451 10.633
(2.093) (1.809) (1.386) (0.972)
Observations 40 40 40

The table titled “Comparing various ‘food’ models” shows that the FGLS with unknown variances model substantially lowers the standard errors of the coefficients, which in turn increases the \(t\)-ratios (since the point estimates of the coefficients remain about the same), making an important difference for hypothesis testing.

For a few classes of variance functions, the weights in a GLS model can be calculated in \(R\) using the varFunc() and varWeights() functions in the package nlme.

8.7 Heteroskedasticity in the Linear Probability Model

As we have already seen, the linear probability model is, by definition, heteroskedastic, with the variance of the error term given by its binomial distribution parameter \(p\), the probability that \(y\) is equal to 1, \(var(y)=p(1-p)\), where \(p\) is defined in Equation \ref{eq:binomialp8}.

\[\begin{equation} p=\beta_{1}+\beta_{2}x_{2}+...+\beta_{K}x_{K}+e \label{eq:binomialp8} \end{equation}\]

Thus, the linear probability model provides a known variance to be used with GLS, taking care that none of the estimated variances is negative. One way to avoid negative or greater than one probabilities is to artificially limit them to the interval \((0,1)\).

Let us revise the \(coke\) model in dataset coke using this structure of the variance.

data("coke", package="PoEdata")
coke.ols <- lm(coke~pratio+disp_coke+disp_pepsi, data=coke)
coke.hc1 <- coeftest(coke.ols, vcov.=hccm(coke.ols, type="hc1"))
p <- fitted(coke.ols)
# Truncate negative or >1 values of p
pt<-p
pt[pt<0.01] <- 0.01
pt[pt>0.99] <- 0.99
sigsq <- pt*(1-pt)
wght <- 1/sigsq
coke.gls.trunc <- lm(coke~pratio+disp_coke+disp_pepsi, 
                  data=coke, weights=wght)
# Eliminate negative or >1 values of p
p1 <- p
p1[p1<0.01 | p1>0.99] <- NA
sigsq <- p1*(1-p1)
wght <- 1/sigsq
coke.gls.omit <- lm(coke~pratio+disp_coke+disp_pepsi, 
                  data=coke, weights=wght)
stargazer(coke.ols, coke.hc1, coke.gls.trunc, coke.gls.omit,
  header=FALSE, 
  title="Comparing various 'coke' models",
  type=.stargazertype, # "html" or "latex" (in index.Rmd) 
  keep.stat="n",  # what statistics to print
  omit.table.layout="n",
  star.cutoffs=NA,
  digits=4, 
#  single.row=TRUE,
  intercept.bottom=FALSE, #moves the intercept coef to top
  column.labels=c("OLS","HC1","GLS-trunc","GLS-omit"),
  dep.var.labels.include = FALSE,
  model.numbers = FALSE,
  dep.var.caption="Dependent variable: 'choice of coke'",
  model.names=FALSE,
  star.char=NULL) #supresses the stars
Comparing various ‘coke’ models
Dependent variable: ‘choice of coke’
OLS HC1 GLS-trunc GLS-omit
Constant 0.8902 0.8902 0.6505 0.8795
(0.0655) (0.0653) (0.0568) (0.0594)
pratio -0.4009 -0.4009 -0.1652 -0.3859
(0.0613) (0.0604) (0.0444) (0.0527)
disp_coke 0.0772 0.0772 0.0940 0.0760
(0.0344) (0.0339) (0.0399) (0.0353)
disp_pepsi -0.1657 -0.1657 -0.1314 -0.1587
(0.0356) (0.0344) (0.0354) (0.0360)
Observations 1,140 1,140 1,124

References

Lumley, Thomas, and Achim Zeileis. 2015. Sandwich: Robust Covariance Matrix Estimators. https://CRAN.R-project.org/package=sandwich.