Chapter 26 Decision Trees / CART
Let’s go now to our third algorithm, decision trees. Decisions trees are the most basic types of “tree-based models” (more complex tree-based models build on the decision tree). In tree-based models, the algorithm decides how to classify data points using a series of decisions. At each “node” of a decision tree, the data are split further and further until each data point is classified. Tree-based algorithms are popular because they can be used for both classification and regression tasks, but simple decision trees can be sensitive to over-fitting. Sometimes, you’ll see the name CART used instead of decision trees (CART stands for “Classification And Regression Trees”).
In the model we will use rpart
, which takes one optional hyperparameter, cp
. We won’t use it in this tutorial, but you can learn more about cp
here.
<- train(x = tw_to_train,
dt_mod y = as.factor(conservative_code),
method = "rpart",
trControl = trctrl
)
print(dt_mod)
## CART
##
## 92 samples
## 1035 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.4884660 -0.03241605
## 0.03658537 0.4871756 -0.03533007
## 0.12195122 0.5051809 -0.02564487
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1219512.
Now to check with the test data!
26.1 Testing the Model
You should be familiar with this process by now! First, we’ll use predict()
…
<- predict(dt_mod, newdata = tw_to_test) dt_predict
… and then assess with confusionMatrix()
!
<- caret::confusionMatrix(dt_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
dt_confusion_matrix dt_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 17 21
##
## Accuracy : 0.5526
## 95% CI : (0.383, 0.7138)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 0.5668659
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.0001042
##
## Precision : NA
## Recall : 0.0000
## F1 : NA
## Prevalence : 0.4474
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Unfortunately, this algorithm does not work very well with our data. But, try it with the cp
parameter and see if you can get a better result!
Learn more about decision trees:
* Learn about the algorithm behind the decision trees in this medium article.
* Learn about other tree-based models in this explanation (her posts, in general, are quite good.