Example
Model selection for Motor Trend Car Road Tests data
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A full description of the data can be found with the R command
These data contain 32 observations on 11 (numeric) variables.
mpg
Miles/(US) galloncyl
Number of cylindersdisp
Displacement (cu.in.)hp
Gross horsepowerdrat
Rear axle ratiowt
Weight (1000 lbs)qsec
1/4 mile timevs
Engine (0 = V-shaped, 1 = straight)am
Transmission (0 = automatic, 1 = manual)gear
Number of forward gearscarb
Number of carburetors
We want to model mpg
using the remaining variables as explanatory variables.
Initially, you may just fit a multiple linear regression, checking model assumptions, as follows.
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
The first thing to note is that we do not have any significant explanatory variables. What does this mean? Does this then suggest that the set of explanatory variables we have are not good predictors or miles per gallon? The answer to these questions is no. To see this, let’s consider pair wise correlations and relationships with this set of explanatory variables and our response variable.
There is a lot of information in the plot above. The main thing to note is that mpg
seems to be linearly related to some of the possible explanatory variables, and many of the explanatory variables are related to each other.
To illustrate the issue that we face, suppose now I just looked at horse power, hp
, are fit a simple linear regression
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
## Estimate Std. Error t value Pr(>|t|)
## -0.02148212 0.02176858 -0.98684065 0.33495531
## Estimate Std. Error t value Pr(>|t|)
## -6.822828e-02 1.011930e-02 -6.742389e+00 1.787835e-07
In both models, we find hp
to be negatively related to mpg
, given the negative coefficient estimates. In the full model, we obtained a larger estimated standard error, in comparison to the simple linear regression.
Multicollinearity
When we have many explanatory variables, the problem of multicollinearity often arises. This is when two or more explanatory variables are themselves related in an approximately linear way. The more correlation present between explanatory variables, then the more difficult it becomes to assess the amount of variation in the response variable explained by each explanatory variable.
In the miles per gallon example, we began by fitting a model with the full set of explanatory variables which resulted in finding no significant relationship between the explanatory variables and the response variable. The issue was not with our coefficient estimates, but with the estimated standard errors. The estimated standard errors of the coefficients tended to be inflated which created difficulty in inferring significant relationships.
Consider the Church of England dataset from the previous lecture. We have already seen that there is a strong linear relationship between electoral roll and % of attendance. The effect of this is that the regression parameter estimates can change dramatically if both variables are included in the model, while the fit to the data, as measured by the sum-of-squares, changes very little. When this effect is extreme, estimation can be numerically rather unstable. An effective solution is to construct a simpler model by omitting variables as we did in the Church of England example.