Example

Model selection for Motor Trend Car Road Tests data

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A full description of the data can be found with the R command

?mtcars

These data contain 32 observations on 11 (numeric) variables.

  1. mpg Miles/(US) gallon
  2. cyl Number of cylinders
  3. disp Displacement (cu.in.)
  4. hp Gross horsepower
  5. drat Rear axle ratio
  6. wt Weight (1000 lbs)
  7. qsec 1/4 mile time
  8. vs Engine (0 = V-shaped, 1 = straight)
  9. am Transmission (0 = automatic, 1 = manual)
  10. gear Number of forward gears
  11. carb Number of carburetors

Our goal is to select the variable which predicts the variable mpg (miles per gallon) from four possible predictors disp, hp, wt and qsec.

The below table shows all possible combinations of variables for the model and the values of various criteria for variable selection.

library(knitr)
library(kableExtra)
library(olsrr)
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
model.selection<-ols_step_all_possible(model)
model.selection
##    Index N      Predictors  R-Square Adj. R-Square Mallow's Cp
## 3      1 1              wt 0.7528328     0.7445939  0.70869536
## 1      2 1            disp 0.7183433     0.7089548  0.67512054
## 2      3 1              hp 0.6024373     0.5891853  0.50969578
## 4      4 1            qsec 0.1752963     0.1478062  0.07541973
## 8      5 2           hp wt 0.8267855     0.8148396  0.78108710
## 10     6 2         wt qsec 0.8264161     0.8144448  0.77856272
## 6      7 2         disp wt 0.7809306     0.7658223  0.72532105
## 5      8 2         disp hp 0.7482402     0.7308774  0.69454380
## 7      9 2       disp qsec 0.7215598     0.7023571  0.66395284
## 9     10 2         hp qsec 0.6368769     0.6118339  0.52014395
## 14    11 3      hp wt qsec 0.8347678     0.8170643  0.78199548
## 11    12 3      disp hp wt 0.8268361     0.8082829  0.76789526
## 13    13 3    disp wt qsec 0.8264170     0.8078189  0.76988533
## 12    14 3    disp hp qsec 0.7541953     0.7278591  0.68301440
## 15    15 4 disp hp wt qsec 0.8351443     0.8107212  0.77102968

You can plot these results to shows the information from the table, with the red triangle indicating the best scoring combination for each value of N (the number of predictors included in the model). The number next to the best values is the row number of that model in the above table. Please try this in R using the code provided.

plot(model.selection)

Notice in the model.selection output, each model contains between one and four variables. ols_best_subset picks out the best model for each variable size (i.e. a model with one variable, a model with two variable, a model with three variables and a model with four variables) only uses those models and compares across the best models of different size. The following output shows only one model from each size and compares them.

model <- lm(mpg ~ disp + hp + wt +
    qsec, data = mtcars)
model.selection <- ols_step_best_subset(model)

library(data.table)
data.table(Model = model.selection$predictors,
    AIC = model.selection$aic,
    Cp = model.selection$cp, BIC = model.selection$sbc,
    R2 = model.selection$rsquare,
    R2adj = model.selection$adjr)
## Null data.table (0 rows and 0 cols)

Now we can see that different criteria chooses different models as the best. For example the \(R^2\)(adj) criteria chooses Model 3 (highest \(R^2\)(adj) value) i.e. the predictors hp, wt and qsec, whereas AIC picks Model 2 (lowest AIC) with predictors hp and wt. \(C_p\) suggest Model 2 or Model 3 (notice this is not an exact rule here) and BIC picks Model 2.