Example

Model selection for Motor Trend Car Road Tests data

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A full description of the data can be found with the R command

?mtcars

These data contain 32 observations on 11 (numeric) variables.

  1. mpg Miles/(US) gallon
  2. cyl Number of cylinders
  3. disp Displacement (cu.in.)
  4. hp Gross horsepower
  5. drat Rear axle ratio
  6. wt Weight (1000 lbs)
  7. qsec 1/4 mile time
  8. vs Engine (0 = V-shaped, 1 = straight)
  9. am Transmission (0 = automatic, 1 = manual)
  10. gear Number of forward gears
  11. carb Number of carburetors

We want to model mpg using the remaining variables as explanatory variables.

Initially, you may just fit a multiple linear regression, checking model assumptions, as follows.

model <- lm(mpg ~ ., data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The first thing to note is that we do not have any significant explanatory variables. What does this mean? Does this then suggest that the set of explanatory variables we have are not good predictors or miles per gallon? The answer to these questions is no. To see this, let’s consider pair wise correlations and relationships with this set of explanatory variables and our response variable.

library(GGally)
ggpairs(mtcars,upper=list(continuous=wrap("cor",size=3,col="black"))) 

There is a lot of information in the plot above. The main thing to note is that mpg seems to be linearly related to some of the possible explanatory variables, and many of the explanatory variables are related to each other.

To illustrate the issue that we face, suppose now I just looked at horse power, hp, are fit a simple linear regression

model2 <- lm(mpg ~ hp, data = mtcars)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
#hp coefficient estimate from full model
summary(model)$coefficient["hp",]
##    Estimate  Std. Error     t value    Pr(>|t|) 
## -0.02148212  0.02176858 -0.98684065  0.33495531
#hp coefficient estimate from simple linear regression
summary(model2)$coefficient["hp",]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -6.822828e-02  1.011930e-02 -6.742389e+00  1.787835e-07

In both models, we find hp to be negatively related to mpg, given the negative coefficient estimates. In the full model, we obtained a larger estimated standard error, in comparison to the simple linear regression.

Multicollinearity

When we have many explanatory variables, the problem of multicollinearity often arises. This is when two or more explanatory variables are themselves related in an approximately linear way. The more correlation present between explanatory variables, then the more difficult it becomes to assess the amount of variation in the response variable explained by each explanatory variable.

In the miles per gallon example, we began by fitting a model with the full set of explanatory variables which resulted in finding no significant relationship between the explanatory variables and the response variable. The issue was not with our coefficient estimates, but with the estimated standard errors. The estimated standard errors of the coefficients tended to be inflated which created difficulty in inferring significant relationships.

Consider the Church of England dataset from the previous lecture. We have already seen that there is a strong linear relationship between electoral roll and % of attendance. The effect of this is that the regression parameter estimates can change dramatically if both variables are included in the model, while the fit to the data, as measured by the sum-of-squares, changes very little. When this effect is extreme, estimation can be numerically rather unstable. An effective solution is to construct a simpler model by omitting variables as we did in the Church of England example.