Chapter 5 R Lab 4 - 05/05/2022
In this lecture we will learn how to implement classification trees and their extensions (bagging and random forest). Also, we will introduce the ROC curve and the AUC, useful in order to evaluate classification algorithms.
The following packages are required: rpart
, rpart.plot
, randomForest
and tidyverse
and pROC
.
library(rpart) #for CART
library(rpart.plot) #for plotting CART
library(randomForest) #for bagging+RF
library(MASS) # for QDA
library(tidyverse)
library(pROC)
5.1 Credit data and classification tree
The file credit.csv
contains information about the default status (yes or no) of more than 500 individuals. Together with the response variable named default
, 4 regressors are available. Here below the preview of the data:
= read.csv("/Users/marco/Google Drive Personale/Formazione/3_Unibg/2021-22/ML_tutor/RLabs/Lab4/credit.csv")
credit glimpse(credit)
## Rows: 522
## Columns: 5
## $ months_loan_duration <int> 48, 42, 24, 36, 30, 12, 48, 12, 24, 15, 24, 24, 6…
## $ percent_of_income <int> 2, 2, 3, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2, 1, 3, 1, 3…
## $ years_at_residence <int> 2, 4, 4, 2, 2, 1, 4, 1, 4, 4, 2, 2, 3, 3, 4, 2, 3…
## $ age <int> 22, 45, 53, 35, 28, 25, 24, 22, 60, 28, 32, 44, 4…
## $ default <chr> "yes", "no", "yes", "no", "yes", "yes", "yes", "n…
We transform the response variable into a factor:
$default = factor(credit$default)
creditlevels(credit$default)
## [1] "no" "yes"
The percentage distribution of the response variable is the following:
#method 1
prop.table(table(credit$default))
##
## no yes
## 0.5574713 0.4425287
#method 2
%>%
credit count(default) %>%
mutate(perc = n/sum(n)*100)
## default n perc
## 1 no 291 55.74713
## 2 yes 231 44.25287
As usual let’s try to summarise the data with tables and plots:
summary(credit)
## months_loan_duration percent_of_income years_at_residence age
## Min. : 6.00 Min. :1.000 Min. :1.000 Min. :19.00
## 1st Qu.:12.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:26.00
## Median :18.00 Median :3.000 Median :3.000 Median :31.50
## Mean :21.34 Mean :2.946 Mean :2.826 Mean :34.89
## 3rd Qu.:26.75 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:41.00
## Max. :72.00 Max. :4.000 Max. :4.000 Max. :75.00
## default
## no :291
## yes:231
##
##
##
##
%>%
credit ggplot()+
geom_boxplot(aes(x=default, y=months_loan_duration))
# higher median duration value in case of default
%>%
credit ggplot()+
geom_boxplot(aes(x=default, y=age))
# similar distribution between the two classes for age
%>%
credit ggplot()+
geom_bar(aes(x=percent_of_income,fill=default),
position="fill") +
facet_wrap(~years_at_residence)+
labs(title = "percent_of_income % distribution by default",
subtitle = "Analysis stratified by years_at_residence ")
# some small differences in the distribution of the default are visible across the different categories of percent_of_income and years_at_residce
We split the dataset in training (70%) and test set (30%) by using the approach described in Section 3.3:
# Create a vector of indexes with an 70% random sample
set.seed(13, sample.kind="Rejection")
= sample(1:nrow(credit), 0.7*nrow(credit))
train_index
# Subset the credit data frame to training indexes only
= credit[train_index, ]
credit_train %>%
credit_train count(default) %>%
mutate(n/sum(n))
## default n n/sum(n)
## 1 no 204 0.5589041
## 2 yes 161 0.4410959
# Exclude the training indexes to create the test set
= credit[-train_index, ] credit_test
5.2 Classification tree
To estimate the classification tree we use the rpart
already used before for the grade
data. Here the selected method is class
as we are dealing with a classification problem. The Gini index will be used for splitting:
set.seed(13, sample.kind="Rejection")
= rpart(formula = default ~.,
credit_tree data = credit_train,
method = "class")
credit_tree
## n= 365
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 365 161 no (0.5589041 0.4410959)
## 2) months_loan_duration< 47.5 336 137 no (0.5922619 0.4077381)
## 4) months_loan_duration< 17 159 50 no (0.6855346 0.3144654)
## 8) months_loan_duration< 7.5 33 5 no (0.8484848 0.1515152) *
## 9) months_loan_duration>=7.5 126 45 no (0.6428571 0.3571429)
## 18) months_loan_duration>=9.5 106 35 no (0.6698113 0.3301887) *
## 19) months_loan_duration< 9.5 20 10 no (0.5000000 0.5000000)
## 38) age>=32.5 9 2 no (0.7777778 0.2222222) *
## 39) age< 32.5 11 3 yes (0.2727273 0.7272727) *
## 5) months_loan_duration>=17 177 87 no (0.5084746 0.4915254)
## 10) age>=26.5 126 57 no (0.5476190 0.4523810)
## 20) months_loan_duration>=37.5 8 1 no (0.8750000 0.1250000) *
## 21) months_loan_duration< 37.5 118 56 no (0.5254237 0.4745763)
## 42) percent_of_income< 3.5 59 24 no (0.5932203 0.4067797) *
## 43) percent_of_income>=3.5 59 27 yes (0.4576271 0.5423729)
## 86) age< 30.5 21 9 no (0.5714286 0.4285714)
## 172) months_loan_duration< 22.5 7 1 no (0.8571429 0.1428571) *
## 173) months_loan_duration>=22.5 14 6 yes (0.4285714 0.5714286) *
## 87) age>=30.5 38 15 yes (0.3947368 0.6052632)
## 174) months_loan_duration>=25.5 11 4 no (0.6363636 0.3636364) *
## 175) months_loan_duration< 25.5 27 8 yes (0.2962963 0.7037037) *
## 11) age< 26.5 51 21 yes (0.4117647 0.5882353) *
## 3) months_loan_duration>=47.5 29 5 yes (0.1724138 0.8275862) *
It is possible to obtain the size of the tree by counting the number of nodes which are classified as leaf nodes:
sum(credit_tree$frame$var=="<leaf>")
## [1] 12
Here below we plot the classification tree by trying also different option of the rpart.plot
function (see also http://www.milbo.org/rpart-plot/prp.pdf):
rpart.plot(x = credit_tree)
rpart.plot(x = credit_tree, extra=3)
rpart.plot(x = credit_tree, extra=4)
rpart.plot(x = credit_tree, type=0, extra=3)
rpart.plot(x = credit_tree, type=0, extra=0) #super simplified version
We compute now the predicted category for the observations in the test set:
set.seed(13, sample.kind="Rejection")
= predict(credit_tree,
credit_treepred newdata = credit_test,
type = "class")
head(credit_treepred)
## 3 6 8 10 11 21
## no no no no yes no
## Levels: no yes
Given the predicted category, it is possible to create the confusion matrix and to compute the performance indexes as described in Section 3.2. Here we compute only the accuracy for the sake of simplicity:
# Calculate the confusion matrix for the test set
table(credit_treepred,
$default) credit_test
##
## credit_treepred no yes
## no 65 38
## yes 22 32
= mean(credit_treepred==credit_test$default)
acctree acctree
## [1] 0.6178344
We see that 61.78% of the test observations are correctly classified. This is quite a poor performance.
As described in Section 4.3, we evaluate the possibility of pruning the tree by using the information contained in the cptable
:
= data.frame(credit_tree$cptable)
cptab cptab
## CP nsplit rel.error xerror xstd
## 1 0.11801242 0 1.0000000 1.0000000 0.05891905
## 2 0.02795031 1 0.8819876 0.9192547 0.05826241
## 3 0.01552795 3 0.8260870 0.9440994 0.05849777
## 4 0.01242236 7 0.7577640 0.9627329 0.05865475
## 5 0.01035197 8 0.7453416 0.9440994 0.05849777
## 6 0.01000000 11 0.7142857 0.9440994 0.05849777
= cptab$CP[which.min(cptab$xerror)]
bestcp bestcp
## [1] 0.02795031
In this case the minimum value of the cross-validation error (xerror
) corresponds to a very small number of splits (only 1 split). In this case there is no possibility to simplify further the tree, unless we chosse nsplit=0
, a choise which is not interesting for us because it means that we are not using any predictor for the credit model. Thus, in this specific case it doesn’t make sense to implement the 1-SE approach described in Section 4.3. We consider directly the value of CP
corresponding to the lowest value of xerror
and to nsplit=1
.
Let’s prune the tree by using the chosen value of CP
:
= prune(credit_tree,cp=bestcp)
credit_tree_prunrpart.plot(x = credit_tree_prun, extra=3)
We can now compute the accuracy for the pruned tree:
= predict(credit_tree_prun,
credit_treepred_prun newdata = credit_test,
type = "class")
table(credit_treepred_prun,
$default) credit_test
##
## credit_treepred_prun no yes
## no 87 62
## yes 0 8
= mean(credit_treepred_prun==credit_test$default)
acctree_prun acctree_prun
## [1] 0.6050955
acctree
## [1] 0.6178344
The pruned tree is basically a stump and has an accuracy which is basically very similar to the original single tree. Given the similar accuracy, we prefer the pruned tree because it has the advantage of being simpler (or less complex).
We now compute for the selected tree the predicted probabilities that will be required later for the ROC curve:
# Calculate predicted probabilities
= predict(credit_tree_prun,
credit_treepred_prun_prob newdata = credit_test,
type = "prob")
head(credit_treepred_prun_prob)
## no yes
## 3 0.5922619 0.4077381
## 6 0.5922619 0.4077381
## 8 0.5922619 0.4077381
## 10 0.5922619 0.4077381
## 11 0.5922619 0.4077381
## 21 0.5922619 0.4077381
Note that in this case we obtain a matrix with 2 columns (one for each category) and a number of rows given by the number of observations in the test set. For each row the prediction (also reported in the vector credit_treepred_prun
) is given by the majority category (i.e. the category with the highest probability).
5.3 Bagging
Bagging has already been described in Section 4.4 for the case of a regression tree. It works similarly for a classification tree:
set.seed(1, sample.kind="Rejection")
= randomForest(formula = default ~ .,
credit_bag data = credit_train,
mtry = ncol(credit_train)-1, #minus the response
importance = T, #to estimate predictors importance
ntree = 500) #500 by default
credit_bag
##
## Call:
## randomForest(formula = default ~ ., data = credit_train, mtry = ncol(credit_train) - 1, importance = T, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 46.58%
## Confusion matrix:
## no yes class.error
## no 126 78 0.3823529
## yes 92 69 0.5714286
The output provides the out-of-bag (OOB) estimate of the error rate (an unbiased estimate of the test set error) and the corresponding confusion matrix. The former could be used for discriminating between different random forest classifiers (estimated, for example, with different values of ntree
).
We analyze now graphically the importance of the regressors (this output is obtained by setting importance = T
during the training phase:
varImpPlot(credit_bag)
The figure displays:
- in the left plot the mean decrease (across all \(B\) trees) of accuracy in prediction on the OOB samples when a given variable is excluded. This measures how much accuracy the model losses by excluding each variable.
- in the right plot the mean decrease (across all \(B\) trees) in node purity on the OOB samples when a given variable is excluded. For classification, the node impurity is measured by the Gini index. This measures how much each variable contributes to the homogeneity of the nodes in the resulting random forest.
In general the higher the value of the mean decrease accuracy/Gini, the higher the importance of the variable in the model.
For the considered case we see that months_loan_duration
and age
are the two most important regressors (the former was used also in the single pruned tree).
We compute now the predictions and the test confusion matrix and accuracy given by bagging:
= predict(credit_bag,
credit_bag_pred newdata=credit_test,
type="class")
head(credit_bag_pred)
## 3 6 8 10 11 21
## no no no no yes no
## Levels: no yes
= mean((credit_test$default == credit_bag_pred))
accbag
acctree_prun
## [1] 0.6050955
accbag
## [1] 0.5605096
By comparing accuracy, bagging performs slightly worse with respect to the single prune tree.
We also compute the predicted probabilities that will be needed later for ROC:
= predict(credit_bag,
credit_bag_pred_prob newdata=credit_test,
type = "prob")
5.4 Random forest
Random forest is very similar to bagging. The only difference is in the specification of the mtry
argument. For a classification problem we consider a number of regressors to be selected randomly to be equal to \(\sqrt{p}\). Note that we need to change this value when we are conducting regression CARTs (see ?randomForest):
set.seed(1, sample.kind="Rejection")
= randomForest(formula = default ~ .,
credit_rf data = credit_train,
mtry = sqrt(ncol(credit_train)-1),
importance = T,
ntree = 500)
print(credit_rf)
##
## Call:
## randomForest(formula = default ~ ., data = credit_train, mtry = sqrt(ncol(credit_train) - 1), importance = T, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 46.03%
## Confusion matrix:
## no yes class.error
## no 131 73 0.3578431
## yes 95 66 0.5900621
As before we plot the measure for the importance of the regressors:
varImpPlot(credit_rf)
Finally, we compute the predictions of default
by using the random forest model:
= predict(object = credit_rf,
credit_rfpred newdata = credit_test,
type = "class")
# Calculate the confusion matrix for the test set
table(credit_test$default, credit_rfpred)
## credit_rfpred
## no yes
## no 65 22
## yes 42 28
= mean((credit_test$default == credit_rfpred))
accrf #pruned acctree_prun
## [1] 0.6050955
#bagged accbag
## [1] 0.5605096
#random forest accrf
## [1] 0.5923567
# Compute predicted probabilities
= predict(object = credit_rf,
credit_rf_pred_prob newdata = credit_test,
type = "prob")
By comparing accuracy, we conclude the random forest performs better than bagging and similarly to the single pruned tree. However, all the accuracies are very low.
5.5 Model comparison by using the ROC curves
To obtain the ROC curve we use the function roc contained in the pROC package. It is necessary to specify as arguments the vector of observed cat- egories (response) and the vector of the probabilities for the “Yes” category (predictor).
By using the ROC curve introduced in Section ??, we compute the implemented models for the credit
problem. For the seek of comparizon, we will include another classification algorithm, QDA. This can be useful in order to check if the CART methodology is working good enough with our data. Of course, the statistics background behind each algorithm should be valid!
<- MASS::qda(default ~.,
qdamod data=credit_train)
<- predict(qdamod, newdata=credit_test)
credit_qda_pred # We are interest in this object -> credit_qda_pred$posterior[,"yes"]
Given the ROC curve it is possible to plot it and to compute the area under the curve (auc). The higher the AUC, the better is the performance of our algorithm. On the other side, the higher the AUC, the nearer our algorithm performances are to the (utopic) best performance algorithm, that has both sensitivity and specificity equal to 1 (or, in other words, 100% of accuracy).
= roc(credit_test$default,
roc_qda $posterior[,"yes"]) credit_qda_pred
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
ggroc(roc_qda)
auc(roc_qda)
## Area under the curve: 0.729
Now let’s plot together all the ROC curves, in order to have a look at their different performances. First of all, we need to create different “ROC” objects, one for each trained algorithm:
= roc(credit_test$default,
roc_tree "yes"]) credit_treepred_prun_prob[,
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
= roc(credit_test$default,
roc_bag "yes"]) credit_bag_pred_prob[,
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
= roc(credit_test$default,
roc_rf "yes"]) credit_rf_pred_prob[,
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
Then we should plot them using a list. Use some interactivity in order to help yourself in the analysis:
ggroc(list(qda= roc_qda,
tree = roc_tree,
bag = roc_bag,
rf = roc_rf)) %>% plotly::ggplotly()
Note that the roc curve for the single tree is practically a straight line. This is basically due to the fact that the single tree is a stump and specificity/sensitivity (computed using predictions) don’t change that much by modification of the probability threshold. The bagging and random forest roc curves are very similar. At the same time, if QDA statistical requirements hold, it would be the algorithm of our choice.
Finally we compute the area under the curve (AUC) for the three methods:
auc(roc_qda)
## Area under the curve: 0.729
auc(roc_tree)
## Area under the curve: 0.5571
auc(roc_bag)
## Area under the curve: 0.6071
auc(roc_rf)
## Area under the curve: 0.6099
All the AUC values are very similar and all very low. They denote a poor performance of all the classifiers. I think this is a problem mainly related to the regressors that are used in the analysis which are not informative enough for describing the pattern of the default variable.
Extra homework for you: Starting from this point, why don’t you try to evaluate other classification algorithms (as LDA, KNN and logistic)?
5.6 Exercises Lab 4
5.6.1 Exercise 1
We use the OJ
dataset which is part of the ISLR
library. The data refers to 1070 purchases where the customer either purchased Citrus Hill (CH) or Minute Maid Orange Juice (MM). The response variable is Purchase
. A number of characteristics of the customers are recorded (see ?OJ
).
library(ISLR)
library(tidyverse)
dim(OJ)
## [1] 1070 18
glimpse(OJ)
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, CH,…
## $ WeekofPurchase <dbl> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 240, …
## $ StoreID <dbl> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2, 2…
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75, 1…
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99, 1…
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40, 0…
## $ SpecialCH <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SpecialMM <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0…
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0.965…
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59, 1…
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75, 1…
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -0.1…
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0.000…
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0.000…
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24, 0…
## $ STORE <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2…
- Create a training set containing a random sample of 800 observations (use as seed the value 4455) and a test set containing the remaining observations.
- Consider
Purchase
as response variable. Provide and plot its frequency distribution for the training observations. - Fit a tree to the training data for
Purchase
. Use all the other variables as predictors.Plot the tree. Which is the most important variable? Try different specifications of the
extra
andtype
argument of therpart.plot
function (see?rpart.plot
)Extract all the information available for observation n. 146 in the training data set. Which is the
Purchase
prediction for this observation using the estimated tree?Predict the response for the test data and produce a confusion matrix. What is the test errorr rate?
- Do you think is it convenient to prune the tree? If yes, estimate the pruned tree and plot it.
- If you decided to prune the tree, compute the test error rate for the pruned tree. Compare the test error rate of the pruned and unpruned tree. Which is better?
5.6.2 Exercise 2
Consider the BreastCancer
dataset included in the mlbench
library.
library(mlbench)
data(BreastCancer)
glimpse(BreastCancer)
## Rows: 699
## Columns: 11
## $ Id <chr> "1000025", "1002945", "1015425", "1016277", "1017023",…
## $ Cl.thickness <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, …
## $ Cell.size <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,…
## $ Cell.shape <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,…
## $ Marg.adhesion <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,…
## $ Epith.c.size <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, …
## $ Bare.nuclei <fct> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, …
## $ Bl.cromatin <fct> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, …
## $ Normal.nucleoli <fct> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, …
## $ Mitoses <fct> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, …
## $ Class <fct> benign, benign, benign, benign, benign, malignant, ben…
Remove missing values and the variable named
Id
(identifier), which is useless and would confuse the machine learning algorithms.Partition the data set for 80% training and 20% evaluation (seed=2).
Use the training data for growing a tree for predicting
Class
as a function of all the other variables. Plot the tree.Compute the predictions (classes and probabilities, separately) for the test observations. Compute the accuracy.
Implement bagging and compare its performance (in terms of accuracy) with the full classification tree implemented before. Compute also probabilities that will be used at the end for the AUC.
Implement random forest and compare its performance (in terms of accuracy) with the methods implemented before. Compute also probabilities that will be used at the end for the AUC.
For all the considered models compute the AUC and plot the ROCs.