# Chapter 6 Further Inference in Multiple Regression

```
rm(list=ls())
library(PoEdata) #for PoE datasets
library(knitr) #for referenced tables with kable()
library(xtable) #makes data frame for kable
library(printr) #automatically prints output nicely
library(effects)
library(car)
library(AER)
library(broom) #for tidy lm output and function glance()
library(stats)
```

New package: `broom`

(Robinson 2016).

## 6.1 Joint Hypotheses and the F-statistic

A joint hypothesis is a set of relationships among regression parameters, relationships that need to be simultaneously true according to the null hypothesis. Joint hypotheses can be tested using the \(F\)-statistic that we have already met. Its formula is given by Equation \ref{eq:FstatFormula6}. The \(F\)-statistic has an \(F\) distribution with the degrees of freedom \(J\) and \(N-K\). \[\begin{equation} F=\frac{(SSE_{R}-SSE_{U})\,/\,J}{SSE_{U}\,/\,(N-K)} \sim F_{(J,\;N-K)} \label{eq:FstatFormula6} \end{equation}\]In Equation \ref{eq:FstatFormula6} the subscript \(U\) stands for “unrestricted,” that is, the initial regression equation; the “restricted” equation is a new equation, obtained from the initial one, with the relationships in the null hypothesis assumed to hold. For example, if the initial equation is Equation \ref{eq:initeqF6} and the null hypothesis is Equation \ref{eq:nullHforF6}, then the restricted equation is Equation \ref{eq:restreqforF6}.

\[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}+e \label{eq:initeqF6} \end{equation}\] \[\begin{equation} H_{0}:\beta_{2}=0\;\;AND\;\;\beta_{3}=0; \;\;\;H_{A}:\beta_{2}\neq 0\;\;OR\;\;\beta_{3}\neq 0 \label{eq:nullHforF6} \end{equation}\] \[\begin{equation} y=\beta_{1}+\beta_{4}x_{4}+e \label{eq:restreqforF6} \end{equation}\]The symbol \(J\) in the \(F\) formula (Equation \ref{eq:FstatFormula6}) is the first (*numerator*) degrees of freedom of the \(F\) statistic and is equal to the number of simultaneous restrictions in the null hypothesis (Equation \ref{eq:nullHforF6}); the second (the *denominator*) degrees of freedom of the \(F\)-statistic is \(N-K\), which is the usual degrees of freedom of the unrestricted regression model (Equation \ref{eq:initeqF6}). The practical procedure to test a joint hypothesis like the one in Equation \ref{eq:nullHforF6} is to estimate the two regressions (unrestricted and restricted) and to calculate the \(F\)-statistic.

## 6.2 Testing Simultaneous Hypotheses

Let’s look, again, at the quadratic form of the *andy* equation (Equation \ref{eq:quadraticandyagain6}).

Equation \ref{eq:quadraticandyagain6} has two terms that involve the regressor \(advert\), of which at least one needs to be significant for a relationship between \(advert\) and \(sales\) to be established. To test if such a relationship exists, we can formulate the following test:

\[\begin{equation} H_{0}:\beta_{3}=0\;\; AND \;\;\beta_{4}=0;\;\;\;H_{A}:\beta_{3}\neq 0\;\; OR\;\; \beta_{4}\neq 0 \label{jointhypandysq6} \end{equation}\]I have already mentioned that \(R\) can do an \(F\) test quite easily (remember the function `linearHypothesis`

?), but for learning purposes let us calculate the \(F\)-statistic in steps. The next code sequence uses information in the `anova`

-type object, which, remember, can be visualized simply by typing the name of the object in the RStudio’s `Console`

window.

```
alpha <- 0.05
data("andy", package="PoEdata")
N <- NROW(andy) #Number of observations in dataset
K <- 4 #Four Betas in the unrestricted model
J <- 2 #Because Ho has two restrictions
fcr <- qf(1-alpha, J, N-K)
mod1 <- lm(sales~price+advert+I(advert^2), data=andy)
anov <- anova(mod1)
anov # prints 'anova' table for the unrestricted model
```

Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|

price | 1 | 1219.091 | 1219.0910 | 56.49523 | 0.000000 |

advert | 1 | 177.448 | 177.4479 | 8.22331 | 0.005441 |

I(advert^2) | 1 | 186.858 | 186.8585 | 8.65941 | 0.004393 |

Residuals | 71 | 1532.084 | 21.5787 | NA | NA |

```
SSEu <- anov[4, 2]
mod2 <- lm(sales~price, data=andy) # restricted
anov <- anova(mod2)
anov # prints the 'anova' table for the restrictred model
```

Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|

price | 1 | 1219.09 | 1219.091 | 46.9279 | 0 |

Residuals | 73 | 1896.39 | 25.978 | NA | NA |

```
SSEr <- anov[2,2]
fval <- ((SSEr-SSEu)/J) / (SSEu/(N-K))
pval <- 1-pf(fval, J, N-K)
```

The calculated \(F\)-statistic is \(fval=8.441\) and the critical value corresponding to a significance level \(\alpha =0.05\) is \(3.126\), which rejects the null hypothesis that both \(\beta_{3}\) and \(\beta_{4}\) are zero. The \(p\)-value of the test is \(p=0.0005\).

Using the `linearHypothesis()`

function should produce the same result:

```
Hnull <- c("advert=0", "I(advert^2)=0")
linearHypothesis(mod1,Hnull)
```

Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|

73 | 1896.39 | NA | NA | NA | NA |

71 | 1532.08 | 2 | 364.306 | 8.44136 | 0.000514 |

The table generated by the `linearHypothesis()`

function shows the same values of the \(F\)-statistic and \(p\)-value that we have calculated before, as well as the residual sum of squares for the restricted and unrestricted models. Please note how I formulate the joint hypothesis as a vector of character values in which the names of the variables perfectly match those in the unrestricted model.

Testing the overall significance of a model amounts to testing the joint hypothesis that all the slope coefficients are zero. \(R\) does automatically this test and the resulting \(F\)-statistic and \(p\)-value are reported in the regression output.

`summary(mod1)`

```
##
## Call:
## lm(formula = sales ~ price + advert + I(advert^2), data = andy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.255 -3.143 -0.012 2.851 11.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.719 6.799 16.14 < 2e-16 ***
## price -7.640 1.046 -7.30 3.2e-10 ***
## advert 12.151 3.556 3.42 0.0011 **
## I(advert^2) -2.768 0.941 -2.94 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.65 on 71 degrees of freedom
## Multiple R-squared: 0.508, Adjusted R-squared: 0.487
## F-statistic: 24.5 on 3 and 71 DF, p-value: 5.6e-11
```

The \(F\)-statistic can be retrieved from `summary(mod1)`

or by using the function `glance(modelname)`

in package `broom`

, as shown in the following code lines. The function `tidy`

, also from package `broom`

organizes regression output (mainly the coefficients and their statistics) in a neat table. Both `glance`

and `tidy`

create output in the form of `data.frame`

, which makes it suitable for use by other functions such as `kable`

and `ggplot2`

. Please also note that `tidy(mod1)`

and `tidy(summary(mod1))`

produce the same result, as shown in Tables 6.1 and 6.2. As always, we can use the function `names`

to obtain a list of the quantities available in the output of the `glance`

function.

```
smod1 <- summary(mod1)
kable(tidy(smod1), caption="Tidy 'summary(mod1)' output")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 109.71904 | 6.799045 | 16.13742 | 0.000000 |

price | -7.64000 | 1.045939 | -7.30444 | 0.000000 |

advert | 12.15124 | 3.556164 | 3.41695 | 0.001052 |

I(advert^2) | -2.76796 | 0.940624 | -2.94269 | 0.004393 |

`fval <- smod1$fstatistic`

```
library(broom)
kable(tidy(mod1), caption="'Tidy(mod1)' output")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 109.71904 | 6.799045 | 16.13742 | 0.000000 |

price | -7.64000 | 1.045939 | -7.30444 | 0.000000 |

advert | 12.15124 | 3.556164 | 3.41695 | 0.001052 |

I(advert^2) | -2.76796 | 0.940624 | -2.94269 | 0.004393 |

`glance(mod1)$statistic #Retrieves the F-statistic`

`## [1] 24.4593`

`names(glance(mod1)) #Shows what is available in 'glance'`

```
## [1] "r.squared" "adj.r.squared" "sigma" "statistic"
## [5] "p.value" "df" "logLik" "AIC"
## [9] "BIC" "deviance" "df.residual"
```

```
kable(glance(mod1),
caption="Function 'glance(mod1)' output", digits=2,
col.names=(c("Rsq","AdjRsq","sig","F",
"pF","K","logL","AIC","BIC","dev","df.res")))
```

Rsq | AdjRsq | sig | F | pF | K | logL | AIC | BIC | dev | df.res |
---|---|---|---|---|---|---|---|---|---|---|

0.51 | 0.49 | 4.65 | 24.46 | 0 | 4 | -219.55 | 449.11 | 460.7 | 1532.08 | 71 |

Table 6.3 shows a summary of the quadratic *andy* model (`mod1`

), where I have changed the names of various items so that the table fits the width of the page. When retrieving these variables, make sure you use the original names as indicated by the `names(glance(mod1))`

command.

When testing a two-tail single (not joint) null hypothesis, the \(t\) and \(F\) tests are equivalent. However, one-tail tests, single or joint cannot be easily performed by an \(F\) test.

Let us solve one more exercise involving a joint hypothesis with linear combinations of regression coefficients. Suppose we want to test the simultaneous hypotheses that the monthly advertising expenditure \(advert_{0}=\$1900\) in the quadratic *andy* model (Equation \ref{eq:quadraticandyagain6}) satisfies the profit-maximizing condition \(\beta_{3}+2\beta_{4}advert_{0}=1\) , and that, when \(price=\$6\) and \(advert=\$1900\) sales revenue is \(\$80\,000\).

```
hyp <- c("advert+3.8*I(advert^2)=1",
"(Intercept)+6*price+1.9*advert+3.61*I(advert^2)=80")
lhout <- tidy(linearHypothesis(mod1,hyp))
kable(lhout,
caption="Joint hypotheses with the 'linearHypothesis' function")
```

res.df | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|

73 | 1779.86 | NA | NA | NA | NA |

71 | 1532.08 | 2 | 247.776 | 5.74123 | 0.004885 |

Table 6.4 includes the \(F\)-statistic of the test, \(F=5.741229\) and its \(p\)-value \(p=0.004885\). Please be aware that the function `tidy`

changes the names in the output of `linearHypothesis`

. A useful exercise is to compare the raw output of `linearHypothesis`

with the output generated by `tidy(linearHypothesis(mod1))`

. There are several other possibilities to compare two regression models, such as a restricted and unrestricted ones in \(R\), such as the `anova()`

function or Wald tests. These are going to be mentioned in later chapters.

## 6.3 Omitted Variable Bias

Consider the general model with two regressors in Equation \ref{eq:omittedgeneral}, a model that I will call the*true*model. \[\begin{equation} y=\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+e \label{eq:omittedgeneral} \end{equation}\]

Suppose we are only interested in estimating \(\beta_{2}\), but there is no data available for \(x_{3}\), or for other reasons \(x_{3}\) is omitted from the model in Equation \ref{eq:omittedgeneral}. What is the error in the estimate of \(\beta_{2}\) introduced by omitting \(x_{3}\)? Equation \ref{eq:omittedequation6} shows what is left of the *true* model after omitting \(x_{3}\).

Let \(b_{2}^*\) be the estimate of \(\beta_{2}\) when \(x_{3}\) is omitted. Equation \ref{eq:omittedbiasformula6} gives the bias in this simple, two-regressor case. The formula shows that bias depends on the direct relationship between the omitted regressor and response through \(\beta_{3}\), as well as the correlation between the omitted and the included regressors. When \(\beta_{3}\) and \(cov(x_{2},\,x_{3})\) are both positive or both negative the bias is positive (the incorrect model overestimates the true \(\beta_{2}\)), and when they are of opposite signs the bias is negative (\(\beta_{2}\) is underestimated)

\[\begin{equation} bias(b_{2}^*)=E(b_{2})^*-\beta_{2}=\beta_{3}\frac{\widehat{cov(x_{2},x_{3})}}{var(x_{2})} \label{eq:omittedbiasformula6} \end{equation}\]The example in this section uses the dataset \(edu\_inc\), and two models: one where the response variable *family income* (\(faminc\)) is explained by the regressors *husband’s education* (\(he\)) and *wife’s education* (\(we\)), and another model, where the \(we\) regressor is omitted. The purpose is to compare the estimates coming from the two models and see if there is a significant difference between them.

```
data("edu_inc", package="PoEdata")
mod1 <- lm(faminc~he+we, data=edu_inc)
mod2 <- lm(faminc~he, data=edu_inc)
kable(tidy(mod1), caption="The correct model")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -5533.63 | 11229.533 | -0.492775 | 0.622426 |

he | 3131.51 | 802.908 | 3.900209 | 0.000112 |

we | 4522.64 | 1066.327 | 4.241328 | 0.000027 |

```
kable(tidy(mod2),
caption="The incorrect model ('we' omitted)")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 26191.27 | 8541.108 | 3.06650 | 0.002304 |

he | 5155.48 | 658.457 | 7.82964 | 0.000000 |

The marginal effect of husband’s education is much lower in the incorrect model. Let us apply the logic of Equation \ref{eq:omittedbiasformula6} to the \(edu\_inc\) model. The direct effect of the omitted regressor (\(we\)) on response (\(faminc\)) is likely to be positive in theory (higher education generates higher income); the correlation between husband’s and wife’s education is also likely to be positive if we believe that people generally marry persons within their entourage. Thus, we should expect that omitting the regressor \(we\) should produce an overestimated marginal effect of \(he\). Our data happen to confirm this supposition, though there is some chance that they might not.

Understanding the problem of omitted variable is very important because it can justify the choice of a particular model. If one is not interested in the effect of variable \(x_{3}\) on \(y\) and can convince that \(x_{3}\) is uncorrelated with \(x_{2}\), one can argue with criticism about omitting the important regressor \(x_{3}\).

## 6.4 Irrelevant Variables

We have seen the effect of omitting a relevant regressor (the effect is biased estimates and lower variances of the included regressors). But what happens if irrelevant variables are incorrectly included? Not surprisingly, this increases the variances (lowers the precision) of the other variables in the model. The next example uses the same (\(edu\_inc\)) dataset as above, but includes two artificially generated variables, \(xtra\_x5\) and \(xtra\_x6\) that are correlated with \(he\) and \(we\) but, obviously, have no role in determining \(y\). Let us compare two models, of which one includes these irrelevant variables.

```
mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mod4 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
kable(tidy(mod3), caption="Correct 'faminc' model")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -7755.33 | 11162.935 | -0.694739 | 0.487599 |

he | 3211.53 | 796.703 | 4.031022 | 0.000066 |

we | 4776.91 | 1061.164 | 4.501574 | 0.000009 |

kl6 | -14310.92 | 5003.928 | -2.859937 | 0.004447 |

```
kable(tidy(mod4),
caption="Incorrect 'faminc' with irrelevant variables")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -7558.613 | 11195.41 | -0.675153 | 0.499948 |

he | 3339.792 | 1250.04 | 2.671750 | 0.007838 |

we | 5868.677 | 2278.07 | 2.576165 | 0.010329 |

kl6 | -14200.184 | 5043.72 | -2.815419 | 0.005100 |

xtra_x5 | 888.843 | 2242.49 | 0.396364 | 0.692037 |

xtra_x6 | -1067.186 | 1981.69 | -0.538524 | 0.590499 |

A comparison of the two models shown in Tables 6.7 and 6.8 indicates that the inclusion of the two irrelevant variables has increased the marginal effects, standard errors, and the \(p\)-values of \(he\) and \(we\). Thus, including irrelevant variables may incorrectly diminish the significance of the “true” regressors.

## 6.5 Model Selection Criteria

The main tools of building a model should be economic theory, sound reasoning based on economic principles, and making sure that the model satisfies the Gauss-Markov assumptions. One should also consider the possibility of omitted variable bias and the exclusion of irrelevant variables that may increase variablility in the estimates. After all these aspects have been considered and a model established, there are a few quantities that help comparing different models. These are \(R^2\), adjusted \(R^2\) (\(\bar R ^2\)), the Akaike information criterion (\(AIC\)), and the Schwarz (or Bayesian information) criterion (\(SC\) or \(BIC\)).

We have already seen how to calculate the coefficient of determination, \(R^2\) and how it measures the distance between the regression line and the observation points. A major disadvantage of \(R^2\) is that it increases every time a new regressor is included in the model, whether the regressor is relevant or not. The idea of counterbalancing this property of \(R^2\) has lead to a new measure, adjusted \(R^2\), denoted by \(\bar R ^2\), given by Equation \ref{eq:adjrsq6}. \[\begin{equation} \bar R^2=1-\frac{SSE\,/\,(N-K)}{SST\,/\,(N-1)} \label{eq:adjrsq6} \end{equation}\] Adjusted \(R^2\), while addressing the problem with \(R^2\), introduces other problems. In general, no single goodness of fit measure is perfect. The Akaike information criterion (\(AIC\)) and the Schwarz criterion use the same idea of penalizing the introduction of extra regressors. Their formulas are given in Equations \ref{eq:aicformula6} and \ref{eq:scformula6}. \[\begin{equation} AIC=ln \left( \frac{SSE}{N}\right) +\frac{2K}{N} \label{eq:aicformula6} \end{equation}\] \[\begin{equation} SC=ln \left( \frac{SSE}{N} \right) + \frac{K\,ln(N)}{N} \label{eq:scformula6} \end{equation}\]Among several models, the best fit is the one that maximizes \(R^2\) or \(\bar R^2\). On the contrary, the best model must **minimize** \(AIC\) or \(BIC\). Some computer packages, \(R\) included, calculate \(AIC\) and \(BIC\) differentlly than Equations \ref{eq:aicformula6} and \ref{eq:scformula6} indicate. However, the ranking of the various models is the same.

The following code sequence needs some explanation. Function `as.numeric`

extracts only the numbers from an object such as `glance(mod1)`

, which also contains row and column names. The purpose is to put together a table with information comming from several models. Function `rbind`

gathers several rows in a matrix, which then is made into a `data.frame`

and given the name `tab`

. The part of code `[,c(1,2,8,9)]`

at the end of `rbind`

instructs \(R\) to pick all rows, but only columns `1, 2, 8, and 9`

from the `glance`

table. Function `row.names`

assigns or changes the row names in a data frame; finally, `kable`

, which we have already encountered several times, prints the table, assigns column names, and gives a caption to the table. While there are many ways to create a table in \(R\), I use `kable`

from package `knitr`

because it allows me to cross-reference tables within this book. `kable`

only works with data frames.

```
mod1 <- lm(faminc~he, data=edu_inc)
mod2 <- lm(faminc~he+we, data=edu_inc)
mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mod4 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
r1 <- as.numeric(glance(mod1))
r2 <- as.numeric(glance(mod2))
r3 <- as.numeric(glance(mod3))
r4 <- as.numeric(glance(mod4))
tab <- data.frame(rbind(r1, r2, r3, r4))[,c(1,2,8,9)]
row.names(tab) <- c("he","he, we","he, we, kl6",
"he, we, kl6, xtra_x5, xtra_x6")
kable(tab,
caption="Model comparison, 'faminc' ", digits=4,
col.names=c("Rsq","AdjRsq","AIC","BIC"))
```

Rsq | AdjRsq | AIC | BIC | |
---|---|---|---|---|

he | 0.1258 | 0.1237 | 10316.7 | 10328.8 |

he, we | 0.1613 | 0.1574 | 10300.9 | 10317.1 |

he, we, kl6 | 0.1772 | 0.1714 | 10294.7 | 10315.0 |

he, we, kl6, xtra_x5, xtra_x6 | 0.1778 | 0.1681 | 10298.4 | 10326.8 |

Tabla 6.9 shows the four model selection criteria for four different models based on the \(edu\_inc\) dataset, with the first column showing the variables included in each model. It is noticeable that three of the criteria indicate the third model as the best fit, while one, namely \(R^2\) prefers the model that includes the irrelevant variables \(xtra\_x5\) and \(xtra\_x6\).

As a side note, a quick way of extracting the information criteria from an `lm()`

object is illustrated in the following code fragment.

```
library(stats)
smod1 <- summary(mod1)
Rsq <- smod1$r.squared
AdjRsq <- smod1$adj.r.squared
aic <- AIC(mod1)
bic <- BIC(mod1)
c(Rsq, AdjRsq, aic, bic)
```

`## [1] 0.125801 0.123749 10316.651535 10328.828905`

Another potentially useful tool for building an appropriate model is the Ramsey specification test, RESET. This method automatically adds higher-order polynomial terms to your model and tests the joint hypothesis that their coefficients are all zero. Thus, the null hypothesis of the test is \(H_{0}\): “No higher-order polynomial terms are necessary”; if we reject the null hypothesis we need to consider including such terms.

The \(R\) function that performs a RESET test is `resettest`

, which requires the following argumets: `formula`

, the formula of the model to be tested or the name of an already calculated `lm`

object; `power`

, a set of integers indicating the powers of the polynomial terms to be included; `type`

, which could be one of “fitted”, “regressor”, or “princomp”, indicating whether the aditional terms should be powers of the regressors, fitted values, or the first principal component of the regressor matrix; and, finally, `data`

, which specifies the dataset to be used if a formula has been provided and not a model object. The following code applies the test to the complete \(faminc\) model, first using only quadratic terms of the fitted values, then using both quadratic and cubic terms.

```
mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
resettest(mod3, power=2, type="fitted")
```

```
##
## RESET test
##
## data: mod3
## RESET = 5.984, df1 = 1, df2 = 423, p-value = 0.0148
```

`resettest(mod3, power=2:3, type="fitted")`

```
##
## RESET test
##
## data: mod3
## RESET = 3.123, df1 = 2, df2 = 422, p-value = 0.0451
```

The number labeled as RESET in the output is the \(F\)-statistic of the test under the null hypothesis followed by the two types of degrees of freedom of the \(F\) distribution and the \(p\)-value. In our case both \(p\)-values are slightly lower than \(0.05\), indicating that the model marginally fails the specification test and some higher order terms may be necessary.

## 6.6 Collinearity

There is collinearity among regressors when two or more regressors move closely with each other or display little variability. A consequence of collinearity is large variance in the estimated parameters, which increases the chances of not finding them significantly different from zero. The estimates are, however, unbiased since (imperfect) collinearity does not technically violate the Gauss-Markov assumptions. Collinearity tends to show insignificant coefficients even when measures of goodness-of-fit such as \(R^2\) or overall significance (the \(F\)-statistic) may be quite large.

Let us consider the example of the dataset *cars*, where *mpg* is miles per gallon, *cyl* is number of cylinders, *eng* is engine displacement in cubic inches, and *wgt* is the weight of the vehicle in pounds.

```
data("cars", package="PoEdata")
mod1 <- lm(mpg~cyl, data=cars)
kable(tidy(mod1), caption="A simple linear 'mpg' model")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 42.91551 | 0.834867 | 51.4040 | 0 |

cyl | -3.55808 | 0.145676 | -24.4247 | 0 |

This naive model suggests a strong effect of the number of cylinders on fuel economy, but if we introduce more terms in the equation this result changes substantially.

```
mod2 <- lm(mpg~cyl+eng+wgt, data=cars)
kable(tidy(mod2), caption="Multivariate 'mpg' model")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 44.370962 | 1.480685 | 29.966509 | 0.000000 |

cyl | -0.267797 | 0.413067 | -0.648313 | 0.517166 |

eng | -0.012674 | 0.008250 | -1.536225 | 0.125298 |

wgt | -0.005708 | 0.000714 | -7.995143 | 0.000000 |

In the model summarized in Table 6.11 the number of cylinders becomes insignificant alltogether, a sharp change with respect to the previous specification. This high sensitivity of the estimates when other variables are introduced is also a sign of collinearity. Indeed, it is reasonable to believe that the characteristics of the vehicles vary together: heavier vehicles have more cylinders and bigger engines.

A test that may be useful in detecting collinearity is to calculate the*variance inflation factor*, \(VIF\), for each regressor. The rule of thumb is that a regressor produces collinearity if its \(VIF\) is greater than \(10\). Equation \ref{eq:vifformula6} shows the formula for the variance inflation factor, where \(R_{k}^2\) is the \(R^2\) from regressing the variable \(x_{k}\) on all the remaining regressors. \[\begin{equation} VIF_{k}=\frac{1}{1-R_{k}^2} \label{eq:vifformula6} \end{equation}\]

```
mod2 <- lm(mpg~cyl+eng+wgt, data=cars)
tab <- tidy(vif(mod2))
kable(tab,
caption="Variance inflation factors for the 'mpg' regression model",
col.names=c("regressor","VIF"))
```

regressor | VIF |
---|---|

cyl | 10.51551 |

eng | 15.78646 |

wgt | 7.78872 |

The results in Table 6.12 show that the regressors `cyl`

and `eng`

fail the collinearity test, having \(VIF\)s greater than 10.

## 6.7 Prediction and Forecasting

We have previously discussed the semantic difference between prediction and forecasting, with prediction meaning the estimation of an expected value of the response and forecasting meaning an estimate of a particular value of the response. We mentioned that, for the same vector of regressors, prediction has a narrower confidence interval than forecasting because forecasting includes, besides the uncertainty of the expected value of the response, the variablility of a particular observation about its mean. I essentially repeat the same procedure here for the quadratic *andy* model, which regresses \(sales\) on \(price\), \(advert\), and \(advert^2\). The key \(R\) function to calculate both predictions and forecasts is the function `predict`

, with the following arguments: `model`

, which is the name of a model object; `newdata`

, which contains the new data points where prediction is desired; if `newdata`

is missing, predictions are calculated for all observations in the dataset; `interval`

, which can be “none”, “confidence”, or “prediction”, and tells \(R\) whether we want only a point estimate of the response, a prediction with its confidence interval, or a forecast with its confidence interval; `level`

, which is the confidence level we want; if missing, `level`

is \(95\%\); other arguments (see `?predict()`

for more details).

```
predpoint <- data.frame(price=6, advert=1.9)
mod3 <- lm(sales~price+advert+I(advert^2), data=andy)
kable(tidy(predict(mod3, newdata=predpoint,
interval="prediction")),
caption="Forecasting in the quadratic 'andy' model")
```

fit | lwr | upr |
---|---|---|

76.974 | 67.5326 | 86.4155 |

Table 6.13 displays the point estimate and forecast interval estimate for the data point \(price=\$6\) and \(advert=1.9\), which, remember, stands for advertising expenditure of \(\$1900\).

### References

Robinson, David. 2016. *Broom: Convert Statistical Analysis Objects into Tidy Data Frames*. https://CRAN.R-project.org/package=broom.