Example
Model selection for Motor Trend Car Road Tests data
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A full description of the data can be found with the R command
These data contain 32 observations on 11 (numeric) variables.
mpgMiles/(US) galloncylNumber of cylindersdispDisplacement (cu.in.)hpGross horsepowerdratRear axle ratiowtWeight (1000 lbs)qsec1/4 mile timevsEngine (0 = V-shaped, 1 = straight)amTransmission (0 = automatic, 1 = manual)gearNumber of forward gearscarbNumber of carburetors
Our goal is to select the variable which predicts the variable mpg (miles per gallon) from four possible predictors disp, hp, wt and qsec.
The below table shows all possible combinations of variables for the model and the values of various criteria for variable selection.
library(knitr)
library(kableExtra)
library(olsrr)
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
model.selection<-ols_step_all_possible(model)
model.selection## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 3 1 1 wt 0.7528328 0.7445939 0.70869536
## 1 2 1 disp 0.7183433 0.7089548 0.67512054
## 2 3 1 hp 0.6024373 0.5891853 0.50969578
## 4 4 1 qsec 0.1752963 0.1478062 0.07541973
## 8 5 2 hp wt 0.8267855 0.8148396 0.78108710
## 10 6 2 wt qsec 0.8264161 0.8144448 0.77856272
## 6 7 2 disp wt 0.7809306 0.7658223 0.72532105
## 5 8 2 disp hp 0.7482402 0.7308774 0.69454380
## 7 9 2 disp qsec 0.7215598 0.7023571 0.66395284
## 9 10 2 hp qsec 0.6368769 0.6118339 0.52014395
## 14 11 3 hp wt qsec 0.8347678 0.8170643 0.78199548
## 11 12 3 disp hp wt 0.8268361 0.8082829 0.76789526
## 13 13 3 disp wt qsec 0.8264170 0.8078189 0.76988533
## 12 14 3 disp hp qsec 0.7541953 0.7278591 0.68301440
## 15 15 4 disp hp wt qsec 0.8351443 0.8107212 0.77102968
You can plot these results to shows the information from the table, with the red triangle indicating the best scoring combination for each value of N (the number of predictors included in the model). The number next to the best values is the row number of that model in the above table. Please try this in R using the code provided.
Notice in the model.selection output, each model contains between one and four variables. ols_best_subset picks out the best model for each variable size (i.e. a model with one variable, a model with two variable, a model with three variables and a model with four variables) only uses those models and compares across the best models of different size. The following output shows only one model from each size and compares them.
model <- lm(mpg ~ disp + hp + wt +
qsec, data = mtcars)
model.selection <- ols_step_best_subset(model)
library(data.table)
data.table(Model = model.selection$predictors,
AIC = model.selection$aic,
Cp = model.selection$cp, BIC = model.selection$sbc,
R2 = model.selection$rsquare,
R2adj = model.selection$adjr)## Null data.table (0 rows and 0 cols)
Now we can see that different criteria chooses different models as the best. For example the \(R^2\)(adj) criteria chooses Model 3 (highest \(R^2\)(adj) value) i.e. the predictors
hp, wt and qsec, whereas AIC picks Model 2 (lowest AIC) with predictors hp and wt. \(C_p\) suggest Model 2 or Model 3 (notice this is not an exact rule here) and BIC picks Model 2.