Example
Model selection for Motor Trend Car Road Tests data
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A full description of the data can be found with the R command
These data contain 32 observations on 11 (numeric) variables.
mpg
Miles/(US) galloncyl
Number of cylindersdisp
Displacement (cu.in.)hp
Gross horsepowerdrat
Rear axle ratiowt
Weight (1000 lbs)qsec
1/4 mile timevs
Engine (0 = V-shaped, 1 = straight)am
Transmission (0 = automatic, 1 = manual)gear
Number of forward gearscarb
Number of carburetors
Our goal is to select the variable which predicts the variable mpg
(miles per gallon) from four possible predictors disp
, hp
, wt
and qsec
.
The below table shows all possible combinations of variables for the model and the values of various criteria for variable selection.
library(knitr)
library(kableExtra)
library(olsrr)
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
model.selection<-ols_step_all_possible(model)
model.selection
## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 3 1 1 wt 0.7528328 0.7445939 0.70869536
## 1 2 1 disp 0.7183433 0.7089548 0.67512054
## 2 3 1 hp 0.6024373 0.5891853 0.50969578
## 4 4 1 qsec 0.1752963 0.1478062 0.07541973
## 8 5 2 hp wt 0.8267855 0.8148396 0.78108710
## 10 6 2 wt qsec 0.8264161 0.8144448 0.77856272
## 6 7 2 disp wt 0.7809306 0.7658223 0.72532105
## 5 8 2 disp hp 0.7482402 0.7308774 0.69454380
## 7 9 2 disp qsec 0.7215598 0.7023571 0.66395284
## 9 10 2 hp qsec 0.6368769 0.6118339 0.52014395
## 14 11 3 hp wt qsec 0.8347678 0.8170643 0.78199548
## 11 12 3 disp hp wt 0.8268361 0.8082829 0.76789526
## 13 13 3 disp wt qsec 0.8264170 0.8078189 0.76988533
## 12 14 3 disp hp qsec 0.7541953 0.7278591 0.68301440
## 15 15 4 disp hp wt qsec 0.8351443 0.8107212 0.77102968
You can plot these results to shows the information from the table, with the red triangle indicating the best scoring combination for each value of N (the number of predictors included in the model). The number next to the best values is the row number of that model in the above table. Please try this in R
using the code provided.
Notice in the model.selection output, each model contains between one and four variables. ols_best_subset
picks out the best
model for each variable size (i.e. a model with one variable, a model with two variable, a model with three variables and a model with four variables) only uses those models and compares across the best models of different size. The following output shows only one model from each size and compares them.
model <- lm(mpg ~ disp + hp + wt +
qsec, data = mtcars)
model.selection <- ols_step_best_subset(model)
library(data.table)
data.table(Model = model.selection$predictors,
AIC = model.selection$aic,
Cp = model.selection$cp, BIC = model.selection$sbc,
R2 = model.selection$rsquare,
R2adj = model.selection$adjr)
## Null data.table (0 rows and 0 cols)
Now we can see that different criteria chooses different models as the best. For example the \(R^2\)(adj) criteria chooses Model 3 (highest \(R^2\)(adj) value) i.e. the predictors
hp
, wt
and qsec
, whereas AIC picks Model 2 (lowest AIC) with predictors hp
and wt
. \(C_p\) suggest Model 2 or Model 3 (notice this is not an exact rule here) and BIC picks Model 2.