Chapter 27 Random Forest

Random forests are a more advanced application of decision trees/CART. In fact, they are technically an ensemble algorithm because they combine the results of multiple decision trees to classify data. Because random forests is a more complex algorithm, it needs a few more hyperparameters to work. We include three here (but there are more): mtry indicates the number of variables to be randomly collected and split, splitrule creates a rule for how to split the data as decisions are made (you can learn more about the math from the original article), and min.node.size determines the depth for your tree (your algorithm will continue branching until it reaches the minimum node size, which is 5 data points).

To use random forests, we will change the method (to "ranger") and will add our tuneGrid.

rf_mod <- train(x = tw_to_train,
                y = as.factor(conservative_code),
                method = "ranger",
                trControl = trctrl,
                tuneGrid = data.frame(mtry = floor(sqrt(dim(tw_to_train)[2])),
                                      splitrule = "extratrees",
                                      min.node.size = 5))

print(rf_mod)

## Random Forest 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5180799  0.01320287
## 
## Tuning parameter 'mtry' was held constant at a value of 32
## Tuning
##  parameter 'splitrule' was held constant at a value of extratrees
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5

Then, we test the model on our test set.

27.1 Testing the Model

rf_predict <- predict(rf_mod, newdata = tw_to_test)

rf_confusion_matrix <- caret::confusionMatrix(rf_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
rf_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  1  1
##          1 16 20
##                                          
##                Accuracy : 0.5526         
##                  95% CI : (0.383, 0.7138)
##     No Information Rate : 0.5526         
##     P-Value [Acc > NIR] : 0.566866       
##                                          
##                   Kappa : 0.0122         
##                                          
##  Mcnemar's Test P-Value : 0.000685       
##                                          
##               Precision : 0.50000        
##                  Recall : 0.05882        
##                      F1 : 0.10526        
##              Prevalence : 0.44737        
##          Detection Rate : 0.02632        
##    Detection Prevalence : 0.05263        
##       Balanced Accuracy : 0.50560        
##                                          
##        'Positive' Class : 0              
##

Though the F-score is better than the decision tree algorithm, it is still lower compared to the other algorithms we have tested.

Learn more about Random Forests: * Analytics Vidhya explanation & Python tutorial
* Stack overflow thread on node sizes, useful for hyperparameters
* This UC Business Analytics tutorial does a great job discussing the hyper-parameters, including the ones not discussed in this tutorial. * A good tutorial on random forests that includes plots. * Another good tutorial on random forests, using the rf() function, rather than random forests in the ranger package.