Chapter 26 Decision Trees / CART

Let’s go now to our third algorithm, decision trees. Decisions trees are the most basic types of “tree-based models” (more complex tree-based models build on the decision tree). In tree-based models, the algorithm decides how to classify data points using a series of decisions. At each “node” of a decision tree, the data are split further and further until each data point is classified. Tree-based algorithms are popular because they can be used for both classification and regression tasks, but simple decision trees can be sensitive to over-fitting. Sometimes, you’ll see the name CART used instead of decision trees (CART stands for “Classification And Regression Trees”).

In the model we will use rpart, which takes one optional hyperparameter, cp. We won’t use it in this tutorial, but you can learn more about cp here.

dt_mod <- train(x = tw_to_train,
                y = as.factor(conservative_code),
                method = "rpart",
                trControl = trctrl
                )

print(dt_mod)
## CART 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      
##   0.00000000  0.4884660  -0.03241605
##   0.03658537  0.4871756  -0.03533007
##   0.12195122  0.5051809  -0.02564487
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1219512.

Now to check with the test data!

26.1 Testing the Model

You should be familiar with this process by now! First, we’ll use predict()

dt_predict <- predict(dt_mod, newdata = tw_to_test)

… and then assess with confusionMatrix()!

dt_confusion_matrix <- caret::confusionMatrix(dt_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
dt_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 17 21
##                                          
##                Accuracy : 0.5526         
##                  95% CI : (0.383, 0.7138)
##     No Information Rate : 0.5526         
##     P-Value [Acc > NIR] : 0.5668659      
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : 0.0001042      
##                                          
##               Precision :     NA         
##                  Recall : 0.0000         
##                      F1 :     NA         
##              Prevalence : 0.4474         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 0              
## 

Unfortunately, this algorithm does not work very well with our data. But, try it with the cp parameter and see if you can get a better result!

Learn more about decision trees:
* Learn about the algorithm behind the decision trees in this medium article.
* Learn about other tree-based models in this explanation (her posts, in general, are quite good.