10.4 Random Forests

Random forests improve bagged trees by way of a small tweak that de-correlates the trees. As in bagging, the algorithm builds a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of mtry predictors is chosen as split candidates from the full set of p predictors. A fresh sample of mtry predictors is taken at each split. Typically \(mtry \sim \sqrt{b}\). Bagged trees are thus a special case of random forests where mtry = p.

10.4.0.1 Bagging Classification Example

Again using the OJ data set to predict Purchase, this time I’ll use the bagging method by specifying method = "treebag". I’ll use tuneLength = 5 and not worry about tuneGrid anymore. Caret has no hyperparameters to tune with this model.

oj.bag = train(Purchase ~ ., 
               data = oj_train, 
               method = "treebag",  # for bagging
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "ROC",  # evaluate hyperparamter combinations with ROC
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # k=10 folds
                 savePredictions = "final",       # save predictions for the optimal tuning parameters
                      classProbs = TRUE,  # return class probabilities in addition to predicted values
                      summaryFunction = twoClassSummary  # for binary response variable
                      )
                    )
oj.bag
## Bagged CART 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 771, 772, 771, 771, 771, 772, ... 
## Resampling results:
## 
##   ROC   Sens  Spec
##   0.85  0.82  0.72
#plot(oj.bag$)
oj.pred <- predict(oj.bag, oj_test, type = "raw")
plot(oj_test$Purchase, oj.pred, 
     main = "Bagging Classification: Predicted vs. Actual",
     xlab = "Actual",
     ylab = "Predicted")

(oj.conf <- confusionMatrix(data = oj.pred, 
                            reference = oj_test$Purchase))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 110  17
##         MM  20  66
##                                           
##                Accuracy : 0.826           
##                  95% CI : (0.769, 0.875)  
##     No Information Rate : 0.61            
##     P-Value [Acc > NIR] : 0.00000000000712
##                                           
##                   Kappa : 0.637           
##                                           
##  Mcnemar's Test P-Value : 0.742           
##                                           
##             Sensitivity : 0.846           
##             Specificity : 0.795           
##          Pos Pred Value : 0.866           
##          Neg Pred Value : 0.767           
##              Prevalence : 0.610           
##          Detection Rate : 0.516           
##    Detection Prevalence : 0.596           
##       Balanced Accuracy : 0.821           
##                                           
##        'Positive' Class : CH              
## 
oj.bag.acc <- as.numeric(oj.conf$overall[1])
rm(oj.pred)
rm(oj.conf)
#plot(oj.bag$, oj.bag$finalModel$y)
plot(varImp(oj.bag), main="Variable Importance with Simple Classication")

10.4.0.2 Random Forest Classification Example

Now I’ll try it with the random forest method by specifying method = "ranger". I’ll stick with tuneLength = 5. Caret tunes three hyperparameters:

  • mtry: number of randomly selected predictors. Default is sqrt(p).
  • splitrule: splitting rule. For classification, options are “gini” (default) and “extratrees”.
  • min.node.size: minimal node size. Default is 1 for classification.
oj.frst = train(Purchase ~ ., 
               data = oj_train, 
               method = "ranger",  # for random forest
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "ROC",  # evaluate hyperparamter combinations with ROC
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # 10 folds
                 savePredictions = "final",       # save predictions for the optimal tuning parameter1
                 classProbs = TRUE,  # return class probabilities in addition to predicted values
                 summaryFunction = twoClassSummary  # for binary response variable
                 )
               )
oj.frst
## Random Forest 
## 
## 857 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 772, 771, 772, 770, 772, 772, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   ROC   Sens  Spec
##    2    gini        0.86  0.87  0.69
##    2    extratrees  0.85  0.88  0.63
##    5    gini        0.87  0.85  0.72
##    5    extratrees  0.86  0.86  0.69
##    9    gini        0.87  0.84  0.73
##    9    extratrees  0.87  0.85  0.69
##   13    gini        0.86  0.83  0.74
##   13    extratrees  0.87  0.83  0.71
##   17    gini        0.86  0.81  0.74
##   17    extratrees  0.86  0.83  0.71
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 9, splitrule = gini
##  and min.node.size = 1.
plot(oj.frst)

oj.pred <- predict(oj.frst, oj_test, type = "raw")
plot(oj_test$Purchase, oj.pred, 
     main = "Random Forest Classification: Predicted vs. Actual",
     xlab = "Actual",
     ylab = "Predicted")

(oj.conf <- confusionMatrix(data = oj.pred, 
                            reference = oj_test$Purchase))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 109  16
##         MM  21  67
##                                           
##                Accuracy : 0.826           
##                  95% CI : (0.769, 0.875)  
##     No Information Rate : 0.61            
##     P-Value [Acc > NIR] : 0.00000000000712
##                                           
##                   Kappa : 0.639           
##                                           
##  Mcnemar's Test P-Value : 0.511           
##                                           
##             Sensitivity : 0.838           
##             Specificity : 0.807           
##          Pos Pred Value : 0.872           
##          Neg Pred Value : 0.761           
##              Prevalence : 0.610           
##          Detection Rate : 0.512           
##    Detection Prevalence : 0.587           
##       Balanced Accuracy : 0.823           
##                                           
##        'Positive' Class : CH              
## 
oj.frst.acc <- as.numeric(oj.conf$overall[1])
rm(oj.pred)
rm(oj.conf)
#plot(oj.bag$, oj.bag$finalModel$y)
#plot(varImp(oj.frst), main="Variable Importance with Simple Classication")

The model algorithm explains “ROC was used to select the optimal model using the largest value. The final values used for the model were mtry = 9, splitrule = extratrees and min.node.size = 1.” You can see the results of tuning grid combinations in the associated plot of ROC AUC vs mtry grouped by splitting rule.

The bagging (accuracy = 0.80751) and random forest (accuracy = 0.81690) models faired pretty well, but the manual classification tree is still in first place. There’s still gradient boosting to investigate!

rbind(data.frame(model = "Manual Class", Accuracy = round(oj_model_1b_cm$overall["Accuracy"], 5)), 
      data.frame(model = "Caret w.tuneGrid", Accuracy = round(oj_model_3_cm$overall["Accuracy"], 5)),
      data.frame(model = "Bagging", Accuracy = round(oj.bag.acc, 5)),
      data.frame(model = "Random Forest", Accuracy = round(oj.frst.acc, 5))
) %>% arrange(desc(Accuracy))
##              model Accuracy
## 1     Manual Class     0.86
## 2 Caret w.tuneGrid     0.85
## 3          Bagging     0.83
## 4    Random Forest     0.83

10.4.0.3 Bagging Regression Example

Again using the Carseats data set to predict Sales, this time I’ll use the bagging method by specifying method = "treebag". I’ll use tuneLength = 5 and not worry about tuneGrid anymore. Caret has no hyperparameters to tune with this model.

carseats.bag = train(Sales ~ ., 
               data = carseats_train, 
               method = "treebag",  # for bagging
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "RMSE",  # evaluate hyperparamter combinations with RMSE
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # 10 folds
                 savePredictions = "final"       # save predictions for the optimal tuning parameter1
                 )
               )
carseats.bag
## Bagged CART 
## 
## 321 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 289, 289, 289, 288, 289, 289, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE
##   1.7   0.65      1.4
#plot(carseats.bag$finalModel)
carseats.pred <- predict(carseats.bag, carseats_test, type = "raw")
plot(carseats_test$Sales, carseats.pred, 
     main = "Bagging Regression: Predicted vs. Actual",
     xlab = "Actual",
     ylab = "Predicted")
abline(0, 1)

(carseats.bag.rmse <- RMSE(pred = carseats.pred,
                           obs = carseats_test$Sales))
## [1] 1.9
rm(carseats.pred)
plot(varImp(carseats.bag), main="Variable Importance with Regression Bagging")

10.4.0.4 Random Forest Regression Example

Now I’ll try it with the random forest method by specifying method = "ranger". I’ll stick with tuneLength = 5. Caret tunes three hyperparameters:

  • mtry: number of randomly selected predictors
  • splitrule: splitting rule. For regression, options are “variance” (default), “extratrees”, and “maxstat”.
  • min.node.size: minimal node size
carseats.frst = train(Sales ~ ., 
               data = carseats_train, 
               method = "ranger",  # for random forest
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "RMSE",  # evaluate hyperparamter combinations with RMSE
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # 10 folds
                 savePredictions = "final"       # save predictions for the optimal tuning parameter1
                 )
               )
carseats.frst
## Random Forest 
## 
## 321 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 289, 289, 289, 289, 289, 288, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE  Rsquared  MAE
##    2    variance    1.8   0.70      1.4
##    2    extratrees  1.9   0.65      1.5
##    4    variance    1.6   0.73      1.3
##    4    extratrees  1.7   0.69      1.4
##    6    variance    1.5   0.73      1.2
##    6    extratrees  1.6   0.70      1.3
##    8    variance    1.5   0.72      1.2
##    8    extratrees  1.6   0.71      1.3
##   11    variance    1.6   0.72      1.2
##   11    extratrees  1.6   0.71      1.3
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 6, splitrule = variance
##  and min.node.size = 5.
plot(carseats.frst)

carseats.pred <- predict(carseats.frst, carseats_test, type = "raw")
plot(carseats_test$Sales, carseats.pred, 
     main = "Random Forest Regression: Predicted vs. Actual",
     xlab = "Actual",
     ylab = "Predicted")
abline(0, 1)

(carseats.frst.rmse <- RMSE(pred = carseats.pred,
                           obs = carseats_test$Sales))
## [1] 1.8
rm(carseats.pred)
#plot(varImp(carseats.frst), main="Variable Importance with Regression Random Forest")

The model algorithm explains “RMSE was used to select the optimal model using the smallest value. The final values used for the model were mtry = 11, splitrule = variance and min.node.size = 5.” You can see the results of tuning grid combinations in the associated plot of ROC AUC vs mtry grouped by splitting rule.

The bagging and random forest models faired very well - they took over the first and second place!

rbind(data.frame(model = "Manual ANOVA", RMSE = round(carseats_model_1_pruned_rmse, 5)), 
      data.frame(model = "ANOVA w.tuneGrid", RMSE = round(carseats_model_3_pruned_rmse, 5)),
      data.frame(model = "Bagging", RMSE = round(carseats.bag.rmse, 5)),
      data.frame(model = "Random Forest", RMSE = round(carseats.frst.rmse, 5))
) %>% arrange(RMSE)
##              model RMSE
## 1    Random Forest  1.8
## 2          Bagging  1.9
## 3 ANOVA w.tuneGrid  2.3
## 4     Manual ANOVA  2.4