Chapter 6 Further Inference in Multiple Regression

rm(list=ls())
library(PoEdata) #for PoE datasets
library(knitr) #for referenced tables with kable()
library(xtable) #makes data frame for kable
library(printr) #automatically prints output nicely
library(effects)
library(car)
library(AER)
library(broom) #for tidy lm output and function glance()
library(stats)

New package: broom (Robinson 2016).

6.1 Joint Hypotheses and the F-statistic

A joint hypothesis is a set of relationships among regression parameters, relationships that need to be simultaneously true according to the null hypothesis. Joint hypotheses can be tested using the \(F\)-statistic that we have already met. Its formula is given by Equation \ref{eq:FstatFormula6}. The \(F\)-statistic has an \(F\) distribution with the degrees of freedom \(J\) and \(N-K\). \[\begin{equation} F=\frac{(SSE_{R}-SSE_{U})\,/\,J}{SSE_{U}\,/\,(N-K)} \sim F_{(J,\;N-K)} \label{eq:FstatFormula6} \end{equation}\]

In Equation \ref{eq:FstatFormula6} the subscript \(U\) stands for “unrestricted,” that is, the initial regression equation; the “restricted” equation is a new equation, obtained from the initial one, with the relationships in the null hypothesis assumed to hold. For example, if the initial equation is Equation \ref{eq:initeqF6} and the null hypothesis is Equation \ref{eq:nullHforF6}, then the restricted equation is Equation \ref{eq:restreqforF6}.

\[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}+e \label{eq:initeqF6} \end{equation}\] \[\begin{equation} H_{0}:\beta_{2}=0\;\;AND\;\;\beta_{3}=0; \;\;\;H_{A}:\beta_{2}\neq 0\;\;OR\;\;\beta_{3}\neq 0 \label{eq:nullHforF6} \end{equation}\] \[\begin{equation} y=\beta_{1}+\beta_{4}x_{4}+e \label{eq:restreqforF6} \end{equation}\]

The symbol \(J\) in the \(F\) formula (Equation \ref{eq:FstatFormula6}) is the first (numerator) degrees of freedom of the \(F\) statistic and is equal to the number of simultaneous restrictions in the null hypothesis (Equation \ref{eq:nullHforF6}); the second (the denominator) degrees of freedom of the \(F\)-statistic is \(N-K\), which is the usual degrees of freedom of the unrestricted regression model (Equation \ref{eq:initeqF6}). The practical procedure to test a joint hypothesis like the one in Equation \ref{eq:nullHforF6} is to estimate the two regressions (unrestricted and restricted) and to calculate the \(F\)-statistic.

6.2 Testing Simultaneous Hypotheses

Let’s look, again, at the quadratic form of the andy equation (Equation \ref{eq:quadraticandyagain6}).

\[\begin{equation} sales=\beta_{1}+\beta_{2}price+\beta_{3}advert+\beta_{4}advert^2+e \label{eq:quadraticandyagain6} \end{equation}\]

Equation \ref{eq:quadraticandyagain6} has two terms that involve the regressor \(advert\), of which at least one needs to be significant for a relationship between \(advert\) and \(sales\) to be established. To test if such a relationship exists, we can formulate the following test:

\[\begin{equation} H_{0}:\beta_{3}=0\;\; AND \;\;\beta_{4}=0;\;\;\;H_{A}:\beta_{3}\neq 0\;\; OR\;\; \beta_{4}\neq 0 \label{jointhypandysq6} \end{equation}\]

I have already mentioned that \(R\) can do an \(F\) test quite easily (remember the function linearHypothesis?), but for learning purposes let us calculate the \(F\)-statistic in steps. The next code sequence uses information in the anova-type object, which, remember, can be visualized simply by typing the name of the object in the RStudio’s Console window.

alpha <- 0.05
data("andy", package="PoEdata")
N <- NROW(andy) #Number of observations in dataset
K <- 4 #Four Betas in the unrestricted model
J <- 2 #Because Ho has two restrictions
fcr <- qf(1-alpha, J, N-K)
mod1 <- lm(sales~price+advert+I(advert^2), data=andy)
anov <- anova(mod1)
anov # prints 'anova' table for the unrestricted model
Df Sum Sq Mean Sq F value Pr(>F)
price 1 1219.091 1219.0910 56.49523 0.000000
advert 1 177.448 177.4479 8.22331 0.005441
I(advert^2) 1 186.858 186.8585 8.65941 0.004393
Residuals 71 1532.084 21.5787 NA NA
SSEu <- anov[4, 2]
mod2 <- lm(sales~price, data=andy) # restricted
anov <- anova(mod2)
anov # prints the 'anova' table for the restrictred model
Df Sum Sq Mean Sq F value Pr(>F)
price 1 1219.09 1219.091 46.9279 0
Residuals 73 1896.39 25.978 NA NA
SSEr <- anov[2,2]
fval <- ((SSEr-SSEu)/J) / (SSEu/(N-K))
pval <- 1-pf(fval, J, N-K)

The calculated \(F\)-statistic is \(fval=8.441\) and the critical value corresponding to a significance level \(\alpha =0.05\) is \(3.126\), which rejects the null hypothesis that both \(\beta_{3}\) and \(\beta_{4}\) are zero. The \(p\)-value of the test is \(p=0.0005\).

Using the linearHypothesis() function should produce the same result:

Hnull <- c("advert=0", "I(advert^2)=0")
linearHypothesis(mod1,Hnull)
Res.Df RSS Df Sum of Sq F Pr(>F)
73 1896.39 NA NA NA NA
71 1532.08 2 364.306 8.44136 0.000514

The table generated by the linearHypothesis() function shows the same values of the \(F\)-statistic and \(p\)-value that we have calculated before, as well as the residual sum of squares for the restricted and unrestricted models. Please note how I formulate the joint hypothesis as a vector of character values in which the names of the variables perfectly match those in the unrestricted model.

Testing the overall significance of a model amounts to testing the joint hypothesis that all the slope coefficients are zero. \(R\) does automatically this test and the resulting \(F\)-statistic and \(p\)-value are reported in the regression output.

summary(mod1)
## 
## Call:
## lm(formula = sales ~ price + advert + I(advert^2), data = andy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.255  -3.143  -0.012   2.851  11.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  109.719      6.799   16.14  < 2e-16 ***
## price         -7.640      1.046   -7.30  3.2e-10 ***
## advert        12.151      3.556    3.42   0.0011 ** 
## I(advert^2)   -2.768      0.941   -2.94   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.65 on 71 degrees of freedom
## Multiple R-squared:  0.508,  Adjusted R-squared:  0.487 
## F-statistic: 24.5 on 3 and 71 DF,  p-value: 5.6e-11

The \(F\)-statistic can be retrieved from summary(mod1) or by using the function glance(modelname) in package broom, as shown in the following code lines. The function tidy, also from package broom organizes regression output (mainly the coefficients and their statistics) in a neat table. Both glance and tidy create output in the form of data.frame, which makes it suitable for use by other functions such as kable and ggplot2. Please also note that tidy(mod1) and tidy(summary(mod1)) produce the same result, as shown in Tables 6.1 and 6.2. As always, we can use the function names to obtain a list of the quantities available in the output of the glance function.

smod1 <- summary(mod1)
kable(tidy(smod1), caption="Tidy 'summary(mod1)' output")
Table 6.1: Tidy ‘summary(mod1)’ output
term estimate std.error statistic p.value
(Intercept) 109.71904 6.799045 16.13742 0.000000
price -7.64000 1.045939 -7.30444 0.000000
advert 12.15124 3.556164 3.41695 0.001052
I(advert^2) -2.76796 0.940624 -2.94269 0.004393
fval <- smod1$fstatistic
library(broom)
kable(tidy(mod1), caption="'Tidy(mod1)' output")
Table 6.2: ‘Tidy(mod1)’ output
term estimate std.error statistic p.value
(Intercept) 109.71904 6.799045 16.13742 0.000000
price -7.64000 1.045939 -7.30444 0.000000
advert 12.15124 3.556164 3.41695 0.001052
I(advert^2) -2.76796 0.940624 -2.94269 0.004393
glance(mod1)$statistic #Retrieves the F-statistic
## [1] 24.4593
names(glance(mod1)) #Shows what is available in 'glance'
##  [1] "r.squared"     "adj.r.squared" "sigma"         "statistic"    
##  [5] "p.value"       "df"            "logLik"        "AIC"          
##  [9] "BIC"           "deviance"      "df.residual"
kable(glance(mod1), 
 caption="Function 'glance(mod1)' output", digits=2, 
 col.names=(c("Rsq","AdjRsq","sig","F",
 "pF","K","logL","AIC","BIC","dev","df.res")))
Table 6.3: Function ‘glance(mod1)’ output
Rsq AdjRsq sig F pF K logL AIC BIC dev df.res
0.51 0.49 4.65 24.46 0 4 -219.55 449.11 460.7 1532.08 71

Table 6.3 shows a summary of the quadratic andy model (mod1), where I have changed the names of various items so that the table fits the width of the page. When retrieving these variables, make sure you use the original names as indicated by the names(glance(mod1)) command.

When testing a two-tail single (not joint) null hypothesis, the \(t\) and \(F\) tests are equivalent. However, one-tail tests, single or joint cannot be easily performed by an \(F\) test.

Let us solve one more exercise involving a joint hypothesis with linear combinations of regression coefficients. Suppose we want to test the simultaneous hypotheses that the monthly advertising expenditure \(advert_{0}=\$1900\) in the quadratic andy model (Equation \ref{eq:quadraticandyagain6}) satisfies the profit-maximizing condition \(\beta_{3}+2\beta_{4}advert_{0}=1\) , and that, when \(price=\$6\) and \(advert=\$1900\) sales revenue is \(\$80\,000\).

hyp <- c("advert+3.8*I(advert^2)=1", 
"(Intercept)+6*price+1.9*advert+3.61*I(advert^2)=80")
lhout <- tidy(linearHypothesis(mod1,hyp))
kable(lhout,
 caption="Joint hypotheses with the 'linearHypothesis' function")
Table 6.4: Joint hypotheses with the ‘linearHypothesis’ function
res.df rss df sumsq statistic p.value
73 1779.86 NA NA NA NA
71 1532.08 2 247.776 5.74123 0.004885

Table 6.4 includes the \(F\)-statistic of the test, \(F=5.741229\) and its \(p\)-value \(p=0.004885\). Please be aware that the function tidy changes the names in the output of linearHypothesis. A useful exercise is to compare the raw output of linearHypothesis with the output generated by tidy(linearHypothesis(mod1)). There are several other possibilities to compare two regression models, such as a restricted and unrestricted ones in \(R\), such as the anova() function or Wald tests. These are going to be mentioned in later chapters.

6.3 Omitted Variable Bias

Consider the general model with two regressors in Equation \ref{eq:omittedgeneral}, a model that I will call the true model. \[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+e \label{eq:omittedgeneral} \end{equation}\]

Suppose we are only interested in estimating \(\beta_{2}\), but there is no data available for \(x_{3}\), or for other reasons \(x_{3}\) is omitted from the model in Equation \ref{eq:omittedgeneral}. What is the error in the estimate of \(\beta_{2}\) introduced by omitting \(x_{3}\)? Equation \ref{eq:omittedequation6} shows what is left of the true model after omitting \(x_{3}\).

\[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+u \label{eq:omittedequation6} \end{equation}\]

Let \(b_{2}^*\) be the estimate of \(\beta_{2}\) when \(x_{3}\) is omitted. Equation \ref{eq:omittedbiasformula6} gives the bias in this simple, two-regressor case. The formula shows that bias depends on the direct relationship between the omitted regressor and response through \(\beta_{3}\), as well as the correlation between the omitted and the included regressors. When \(\beta_{3}\) and \(cov(x_{2},\,x_{3})\) are both positive or both negative the bias is positive (the incorrect model overestimates the true \(\beta_{2}\)), and when they are of opposite signs the bias is negative (\(\beta_{2}\) is underestimated)

\[\begin{equation} bias(b_{2}^*)=E(b_{2})^*-\beta_{2}=\beta_{3}\frac{\widehat{cov(x_{2},x_{3})}}{var(x_{2})} \label{eq:omittedbiasformula6} \end{equation}\]

The example in this section uses the dataset \(edu\_inc\), and two models: one where the response variable family income (\(faminc\)) is explained by the regressors husband’s education (\(he\)) and wife’s education (\(we\)), and another model, where the \(we\) regressor is omitted. The purpose is to compare the estimates coming from the two models and see if there is a significant difference between them.

data("edu_inc", package="PoEdata")
mod1 <- lm(faminc~he+we, data=edu_inc)
mod2 <- lm(faminc~he, data=edu_inc)
kable(tidy(mod1), caption="The correct model")
Table 6.5: The correct model
term estimate std.error statistic p.value
(Intercept) -5533.63 11229.533 -0.492775 0.622426
he 3131.51 802.908 3.900209 0.000112
we 4522.64 1066.327 4.241328 0.000027
kable(tidy(mod2), 
  caption="The incorrect model ('we' omitted)")
Table 6.6: The incorrect model (‘we’ omitted)
term estimate std.error statistic p.value
(Intercept) 26191.27 8541.108 3.06650 0.002304
he 5155.48 658.457 7.82964 0.000000

The marginal effect of husband’s education is much lower in the incorrect model. Let us apply the logic of Equation \ref{eq:omittedbiasformula6} to the \(edu\_inc\) model. The direct effect of the omitted regressor (\(we\)) on response (\(faminc\)) is likely to be positive in theory (higher education generates higher income); the correlation between husband’s and wife’s education is also likely to be positive if we believe that people generally marry persons within their entourage. Thus, we should expect that omitting the regressor \(we\) should produce an overestimated marginal effect of \(he\). Our data happen to confirm this supposition, though there is some chance that they might not.

Understanding the problem of omitted variable is very important because it can justify the choice of a particular model. If one is not interested in the effect of variable \(x_{3}\) on \(y\) and can convince that \(x_{3}\) is uncorrelated with \(x_{2}\), one can argue with criticism about omitting the important regressor \(x_{3}\).

6.4 Irrelevant Variables

We have seen the effect of omitting a relevant regressor (the effect is biased estimates and lower variances of the included regressors). But what happens if irrelevant variables are incorrectly included? Not surprisingly, this increases the variances (lowers the precision) of the other variables in the model. The next example uses the same (\(edu\_inc\)) dataset as above, but includes two artificially generated variables, \(xtra\_x5\) and \(xtra\_x6\) that are correlated with \(he\) and \(we\) but, obviously, have no role in determining \(y\). Let us compare two models, of which one includes these irrelevant variables.

mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mod4 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
kable(tidy(mod3), caption="Correct 'faminc' model")
Table 6.7: Correct ‘faminc’ model
term estimate std.error statistic p.value
(Intercept) -7755.33 11162.935 -0.694739 0.487599
he 3211.53 796.703 4.031022 0.000066
we 4776.91 1061.164 4.501574 0.000009
kl6 -14310.92 5003.928 -2.859937 0.004447
kable(tidy(mod4), 
 caption="Incorrect 'faminc' with irrelevant variables")
Table 6.8: Incorrect ‘faminc’ with irrelevant variables
term estimate std.error statistic p.value
(Intercept) -7558.613 11195.41 -0.675153 0.499948
he 3339.792 1250.04 2.671750 0.007838
we 5868.677 2278.07 2.576165 0.010329
kl6 -14200.184 5043.72 -2.815419 0.005100
xtra_x5 888.843 2242.49 0.396364 0.692037
xtra_x6 -1067.186 1981.69 -0.538524 0.590499

A comparison of the two models shown in Tables 6.7 and 6.8 indicates that the inclusion of the two irrelevant variables has increased the marginal effects, standard errors, and the \(p\)-values of \(he\) and \(we\). Thus, including irrelevant variables may incorrectly diminish the significance of the “true” regressors.

6.5 Model Selection Criteria

The main tools of building a model should be economic theory, sound reasoning based on economic principles, and making sure that the model satisfies the Gauss-Markov assumptions. One should also consider the possibility of omitted variable bias and the exclusion of irrelevant variables that may increase variablility in the estimates. After all these aspects have been considered and a model established, there are a few quantities that help comparing different models. These are \(R^2\), adjusted \(R^2\) (\(\bar R ^2\)), the Akaike information criterion (\(AIC\)), and the Schwarz (or Bayesian information) criterion (\(SC\) or \(BIC\)).

We have already seen how to calculate the coefficient of determination, \(R^2\) and how it measures the distance between the regression line and the observation points. A major disadvantage of \(R^2\) is that it increases every time a new regressor is included in the model, whether the regressor is relevant or not. The idea of counterbalancing this property of \(R^2\) has lead to a new measure, adjusted \(R^2\), denoted by \(\bar R ^2\), given by Equation \ref{eq:adjrsq6}. \[\begin{equation} \bar R^2=1-\frac{SSE\,/\,(N-K)}{SST\,/\,(N-1)} \label{eq:adjrsq6} \end{equation}\] Adjusted \(R^2\), while addressing the problem with \(R^2\), introduces other problems. In general, no single goodness of fit measure is perfect. The Akaike information criterion (\(AIC\)) and the Schwarz criterion use the same idea of penalizing the introduction of extra regressors. Their formulas are given in Equations \ref{eq:aicformula6} and \ref{eq:scformula6}. \[\begin{equation} AIC=ln \left( \frac{SSE}{N}\right) +\frac{2K}{N} \label{eq:aicformula6} \end{equation}\] \[\begin{equation} SC=ln \left( \frac{SSE}{N} \right) + \frac{K\,ln(N)}{N} \label{eq:scformula6} \end{equation}\]

Among several models, the best fit is the one that maximizes \(R^2\) or \(\bar R^2\). On the contrary, the best model must minimize \(AIC\) or \(BIC\). Some computer packages, \(R\) included, calculate \(AIC\) and \(BIC\) differentlly than Equations \ref{eq:aicformula6} and \ref{eq:scformula6} indicate. However, the ranking of the various models is the same.

The following code sequence needs some explanation. Function as.numeric extracts only the numbers from an object such as glance(mod1), which also contains row and column names. The purpose is to put together a table with information comming from several models. Function rbind gathers several rows in a matrix, which then is made into a data.frame and given the name tab. The part of code [,c(1,2,8,9)] at the end of rbind instructs \(R\) to pick all rows, but only columns 1, 2, 8, and 9 from the glance table. Function row.names assigns or changes the row names in a data frame; finally, kable, which we have already encountered several times, prints the table, assigns column names, and gives a caption to the table. While there are many ways to create a table in \(R\), I use kable from package knitr because it allows me to cross-reference tables within this book. kable only works with data frames.

mod1 <- lm(faminc~he, data=edu_inc)
mod2 <- lm(faminc~he+we, data=edu_inc)
mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mod4 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
r1 <- as.numeric(glance(mod1))
r2 <- as.numeric(glance(mod2))
r3 <- as.numeric(glance(mod3))
r4 <- as.numeric(glance(mod4))
tab <- data.frame(rbind(r1, r2, r3, r4))[,c(1,2,8,9)]
row.names(tab) <- c("he","he, we","he, we, kl6",
                    "he, we, kl6, xtra_x5, xtra_x6")
kable(tab, 
 caption="Model comparison, 'faminc' ", digits=4, 
 col.names=c("Rsq","AdjRsq","AIC","BIC"))
Table 6.9: Model comparison, ‘faminc’
Rsq AdjRsq AIC BIC
he 0.1258 0.1237 10316.7 10328.8
he, we 0.1613 0.1574 10300.9 10317.1
he, we, kl6 0.1772 0.1714 10294.7 10315.0
he, we, kl6, xtra_x5, xtra_x6 0.1778 0.1681 10298.4 10326.8

Tabla 6.9 shows the four model selection criteria for four different models based on the \(edu\_inc\) dataset, with the first column showing the variables included in each model. It is noticeable that three of the criteria indicate the third model as the best fit, while one, namely \(R^2\) prefers the model that includes the irrelevant variables \(xtra\_x5\) and \(xtra\_x6\).

As a side note, a quick way of extracting the information criteria from an lm() object is illustrated in the following code fragment.

library(stats)
smod1 <- summary(mod1)
Rsq <- smod1$r.squared
AdjRsq <- smod1$adj.r.squared
aic <- AIC(mod1)
bic <- BIC(mod1)
c(Rsq, AdjRsq, aic, bic)
## [1]     0.125801     0.123749 10316.651535 10328.828905

Another potentially useful tool for building an appropriate model is the Ramsey specification test, RESET. This method automatically adds higher-order polynomial terms to your model and tests the joint hypothesis that their coefficients are all zero. Thus, the null hypothesis of the test is \(H_{0}\): “No higher-order polynomial terms are necessary”; if we reject the null hypothesis we need to consider including such terms.

The \(R\) function that performs a RESET test is resettest, which requires the following argumets: formula, the formula of the model to be tested or the name of an already calculated lm object; power, a set of integers indicating the powers of the polynomial terms to be included; type, which could be one of “fitted”, “regressor”, or “princomp”, indicating whether the aditional terms should be powers of the regressors, fitted values, or the first principal component of the regressor matrix; and, finally, data, which specifies the dataset to be used if a formula has been provided and not a model object. The following code applies the test to the complete \(faminc\) model, first using only quadratic terms of the fitted values, then using both quadratic and cubic terms.

mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
resettest(mod3, power=2, type="fitted")
## 
##  RESET test
## 
## data:  mod3
## RESET = 5.984, df1 = 1, df2 = 423, p-value = 0.0148
resettest(mod3, power=2:3, type="fitted")
## 
##  RESET test
## 
## data:  mod3
## RESET = 3.123, df1 = 2, df2 = 422, p-value = 0.0451

The number labeled as RESET in the output is the \(F\)-statistic of the test under the null hypothesis followed by the two types of degrees of freedom of the \(F\) distribution and the \(p\)-value. In our case both \(p\)-values are slightly lower than \(0.05\), indicating that the model marginally fails the specification test and some higher order terms may be necessary.

6.6 Collinearity

There is collinearity among regressors when two or more regressors move closely with each other or display little variability. A consequence of collinearity is large variance in the estimated parameters, which increases the chances of not finding them significantly different from zero. The estimates are, however, unbiased since (imperfect) collinearity does not technically violate the Gauss-Markov assumptions. Collinearity tends to show insignificant coefficients even when measures of goodness-of-fit such as \(R^2\) or overall significance (the \(F\)-statistic) may be quite large.

Let us consider the example of the dataset cars, where mpg is miles per gallon, cyl is number of cylinders, eng is engine displacement in cubic inches, and wgt is the weight of the vehicle in pounds.

data("cars", package="PoEdata")
mod1 <- lm(mpg~cyl, data=cars)
kable(tidy(mod1), caption="A simple linear 'mpg' model")
Table 6.10: A simple linear ‘mpg’ model
term estimate std.error statistic p.value
(Intercept) 42.91551 0.834867 51.4040 0
cyl -3.55808 0.145676 -24.4247 0

This naive model suggests a strong effect of the number of cylinders on fuel economy, but if we introduce more terms in the equation this result changes substantially.

mod2 <- lm(mpg~cyl+eng+wgt, data=cars)
kable(tidy(mod2), caption="Multivariate 'mpg' model")
Table 6.11: Multivariate ‘mpg’ model
term estimate std.error statistic p.value
(Intercept) 44.370962 1.480685 29.966509 0.000000
cyl -0.267797 0.413067 -0.648313 0.517166
eng -0.012674 0.008250 -1.536225 0.125298
wgt -0.005708 0.000714 -7.995143 0.000000

In the model summarized in Table 6.11 the number of cylinders becomes insignificant alltogether, a sharp change with respect to the previous specification. This high sensitivity of the estimates when other variables are introduced is also a sign of collinearity. Indeed, it is reasonable to believe that the characteristics of the vehicles vary together: heavier vehicles have more cylinders and bigger engines.

A test that may be useful in detecting collinearity is to calculate the variance inflation factor, \(VIF\), for each regressor. The rule of thumb is that a regressor produces collinearity if its \(VIF\) is greater than \(10\). Equation \ref{eq:vifformula6} shows the formula for the variance inflation factor, where \(R_{k}^2\) is the \(R^2\) from regressing the variable \(x_{k}\) on all the remaining regressors. \[\begin{equation} VIF_{k}=\frac{1}{1-R_{k}^2} \label{eq:vifformula6} \end{equation}\]
mod2 <- lm(mpg~cyl+eng+wgt, data=cars)
tab <- tidy(vif(mod2))
kable(tab, 
caption="Variance inflation factors for the 'mpg' regression model",
  col.names=c("regressor","VIF"))
Table 6.12: Variance inflation factors for the ‘mpg’ regression model
regressor VIF
cyl 10.51551
eng 15.78646
wgt 7.78872

The results in Table 6.12 show that the regressors cyl and eng fail the collinearity test, having \(VIF\)s greater than 10.

6.7 Prediction and Forecasting

We have previously discussed the semantic difference between prediction and forecasting, with prediction meaning the estimation of an expected value of the response and forecasting meaning an estimate of a particular value of the response. We mentioned that, for the same vector of regressors, prediction has a narrower confidence interval than forecasting because forecasting includes, besides the uncertainty of the expected value of the response, the variablility of a particular observation about its mean. I essentially repeat the same procedure here for the quadratic andy model, which regresses \(sales\) on \(price\), \(advert\), and \(advert^2\). The key \(R\) function to calculate both predictions and forecasts is the function predict, with the following arguments: model, which is the name of a model object; newdata, which contains the new data points where prediction is desired; if newdata is missing, predictions are calculated for all observations in the dataset; interval, which can be “none”, “confidence”, or “prediction”, and tells \(R\) whether we want only a point estimate of the response, a prediction with its confidence interval, or a forecast with its confidence interval; level, which is the confidence level we want; if missing, level is \(95\%\); other arguments (see ?predict() for more details).

predpoint <- data.frame(price=6, advert=1.9)
mod3 <- lm(sales~price+advert+I(advert^2), data=andy)
kable(tidy(predict(mod3, newdata=predpoint,
    interval="prediction")), 
    caption="Forecasting in the quadratic 'andy' model")
Table 6.13: Forecasting in the quadratic ‘andy’ model
fit lwr upr
76.974 67.5326 86.4155

Table 6.13 displays the point estimate and forecast interval estimate for the data point \(price=\$6\) and \(advert=1.9\), which, remember, stands for advertising expenditure of \(\$1900\).

References

Robinson, David. 2016. Broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.