Chapter 3 Modeling
In this chapter, we will use two different metrics to build and assess our models:
- Accuracy
- Receiver Operating Characteristic (ROC)
3.1 Data splitting and data balancing
We have seen in the EDA that variables age, amount and duration are right skewed. Therevore, we proceed to a log transformation of those features. In order to normalize them, we scale those variables.
Then, we split the dataset into a training test and a test set. The training set is used to find optimal hyperparameters when tuning them. For this process, we will use a 5-fold cross-validation applied on the training set. The test set will be used to compare and evaluate the models in order to know how well the models can generalize new data.
Therefore, we split the database into two parts:
- Training set (80%)
- Test set (20%)
#First we tried to scale the data to normalize them
<- german %>% mutate(LogAge = log(age),
german Log1pAmount = log1p(amount),
Log1pDuration = log1p(duration))
$LogAgestd <- scale(german$LogAge)
german$Log1pAmountstd <- scale(german$Log1pAmount)
german$Log1pDurationstd <- scale(german$Log1pDuration)
german
<- german %>% select(-age, -amount, -obs, -duration, -LogAge, -Log1pAmount, -Log1pDuration)
german
#Splitting the dataset into a training (80%) and a test set (20%)
set.seed(245156)
<- sample(x=1:2, size = nrow(german), replace = TRUE, prob = c(0.8, 0.2))
index <- german[index == 1,]
german.tr <- german[index == 2,] german.te
In the EDA, we observed an unbalanced dataset. We have much more credits classified as good (700) than bad (300). In order to have a correct prediction, we need to balance the number of good et bad credits in our training set. It will thus balance the sensitivity and the specificity. Because the importance is to predict well risky credits in order to avoid huge losses whithin the company, we need a good specificity but the balance between both measures is important.
As we can see with the risk variable in the training set, data are unbalanced.
#As we can see with the risk variable, the data are unbalanced
table(german.tr$risk) %>% kable(align = "c", col.names = c("Variable", "Frequence"))
Variable | Frequence |
---|---|
bad | 236 |
good | 569 |
Below, we display the number of bad and good credits after having balanced the training set.
#Balancing the data
<- sum(german.tr$risk == "bad")
n.bad <- sum(german.tr$risk == "good")
n.good
set.seed(245156)
<- which(german.tr$risk == "bad")
index.bad <- sample(x = which(german.tr$risk == "good"), size = n.bad, replace = FALSE)
index.good
<- german.tr[c(index.bad, index.good),]
german.tr.bal
table(german.tr.bal$risk) %>% kable(align = "c", col.names = c("Variable", "Frequence"))
Variable | Frequence |
---|---|
bad | 236 |
good | 236 |
3.2 Accuracy metric: training and fitting the models
By using the CARET package, we need some parameters first to train our models. It will be used for each model with accuracy metric.
As mentioned before, we will use a five cross-validation to subset the training set (into a validation set) in order to assess how well the model will generalize new data on the test set.
<- trainControl(method = "cv", number = 5)
train_control <- "Accuracy" metric
3.2.1 Logistic regression
A logistic regression is a binary classifier and is used for output with two classes.
We will use the AIC selection in order to have a model with more significant variables.
set.seed(123)
= train(
fit_glm_AIC form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
method = "glmStepAIC",
metric = metric,
family = "binomial"
)
Our confusion matrix plot shows that sensitivity and specificity are quite well balanced. However, accuracy is quite low.
<- predict(fit_glm_AIC, newdata = german.te)
pred_glm_aic <- confusionMatrix(data = pred_glm_aic, reference = german.te$risk)
cm_glm_aic draw_confusion_matrix(cm_glm_aic)
By analysing more precisely our model, we observe that it contains insignificant variables according to the p-values. We decided to remove them from the model.
<- fit_glm_AIC$finalModel
pvalues%>% tidy() %>%
pvalues select(-statistic) %>%
mutate(p.value = paste(round(p.value, 4), pval_star(p.value))) %>%
kable(digits = 3,
col.names = c("Parameter", "Estimate", "Standard Error", "P.value")
%>%
) kable_styling(bootstrap_options = "striped")
Parameter | Estimate | Standard Error | P.value |
---|---|---|---|
(Intercept) | -1.547 | 0.547 | 0.0047 ** |
chk_acct | 0.516 | 0.095 | 0 *** |
history | 0.449 | 0.122 | 2e-04 *** |
used_car | 1.571 | 0.432 | 3e-04 *** |
radio.tv | 0.431 | 0.270 | 0.1101 |
education | -1.215 | 0.581 | 0.0363 * |
sav_acct | 0.199 | 0.079 | 0.0119 * |
employment | 0.202 | 0.098 | 0.0395 * |
male_single | 0.537 | 0.241 | 0.0256 * |
guarantor | 0.955 | 0.491 | 0.0519 |
prop_unkn_none | -0.657 | 0.328 | 0.045 * |
other_install | -0.508 | 0.284 | 0.0733 |
rent | -0.651 | 0.301 | 0.0303 * |
num_credits | -0.342 | 0.219 | 0.1182 |
job | -0.449 | 0.193 | 0.02 * |
telephone | 0.487 | 0.258 | 0.0593 |
foreign | 1.405 | 0.705 | 0.0462 * |
Log1pDurationstd | -0.538 | 0.122 | 0 *** |
The logistic regression has no parameter to tune unlike other models that we will use.
#Logistic regression
set.seed(123)
= train(
fit_glm_AIC form = risk ~ chk_acct + history + used_car + education + sav_acct + employment + male_single +
+ rent + job + foreign + Log1pDurationstd,
prop_unkn_none data = german.tr.bal,
trControl = train_control,
method = "glmStepAIC",
metric = metric,
family = "binomial"
)
<- predict(fit_glm_AIC, newdata = german.te)
pred_glm_aic <- confusionMatrix(data = pred_glm_aic, reference = german.te$risk)
cm_glm_aic draw_confusion_matrix(cm_glm_aic)
Even if our final model decreased in accuracy, the gab between sensitivity and specificity is smaller. Moreover, because we removed insignificant variables, the final model makes more sense.
Finally, we compute the VIF coefficients to look for multicollineratity which is not the case as we can see below (VIF coefficients <5).
vif(pvalues) %>% kable(col.names = "VIF Coefficient") %>% kable_styling(bootstrap_options = "striped")
VIF Coefficient | |
---|---|
chk_acct | 1.101125 |
history | 1.267096 |
used_car | 1.200954 |
radio.tv | 1.124241 |
education | 1.085464 |
sav_acct | 1.057604 |
employment | 1.142739 |
male_single | 1.169750 |
guarantor | 1.126230 |
prop_unkn_none | 1.189223 |
other_install | 1.078695 |
rent | 1.094990 |
num_credits | 1.317337 |
job | 1.372632 |
telephone | 1.308709 |
foreign | 1.061026 |
Log1pDurationstd | 1.155109 |
3.2.2 Nearest neighbour classification (KNN)
The K-NN model computes the distance between observations in order to classify them. It means that the model counts the classes of the K nearest instances and the class with the most observations is selected for the K observations.
#KNN
set.seed(456)
= train(
fit_knn_tuned ~ .,
risk data = german.tr.bal,
method = "knn",
metric = metric,
trControl = train_control,
tuneGrid = expand.grid(k = seq(1, 101, by = 1))
)
We are looking for the optimal K.
plot(fit_knn_tuned)
<- fit_knn_tuned$finalModel$k
K paste("The optimal K of our model is", K) %>% kable(col.names = NULL, align = "l")
The optimal K of our model is 33 |
<- predict(fit_knn_tuned, newdata = german.te)
pred_knn <- confusionMatrix(data = pred_knn, reference = german.te$risk)
cmknn draw_confusion_matrix(cmknn)
The KNN model makes better predictions than the logistic regression. In addition, the scores for sensitivity and specificity are very well balanced.
3.2.3 Support Vector Machine (SVM)
The model, considered as a separation method, consists in looking for the linear optimal separation of the hyperplane in order to classify the observations.
With the CARET package, we can only tune the cost hyperparameter. The cost controls the tolerance to bad classification and corresponds to the smoothing of the border that classifies the observations. The larger is the cost (border is not smooth), the fewer missclassifications are allowed. If the cost is too large, it can lead to overfitting.
#SVM
<- expand.grid(cost = 10 ^ ((-2):1))
hp_svm set.seed(1953)
<- train(
fit_svm form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_svm,
method = "svmLinear2",
metric = metric
)
<- fit_svm$finalModel$cost
C paste("The optimal cost for our model is", C) %>% kable(col.names = NULL, align="l")
The optimal cost for our model is 0.01 |
<- predict(fit_svm, newdata = german.te)
pred_svm <- confusionMatrix(data = pred_svm, reference = german.te$risk)
cmsvm draw_confusion_matrix(cmsvm)
The score for sensitivity is much higher than for specificity. Sensitivity measures the proportion of true positives that are correctly identified. Thus, the model predicts more accurately credits that are considered good but has difficulty to predict bad credits. Therefore, this model does not really correspond to our expectations.
3.2.4 Neural network
A neural network combines several predictions of small nodes and contains lots of coefficients (weights) and parameters to tune. Neural networks are over-parametrized by nature and therefore, the solution is quite unstable. In order to avoid overfitting, we will use a weight decay to penalise the largest weights. We will also tune the size of the neural network to find the opimal one.
#Neural network
<- expand.grid(size = 2:10,
hp_nn decay = seq(0, 0.5, 0.05))
set.seed(2006)
<- train(
fit_nn form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_nn,
method = "nnet",
metric = metric,
verbose=F
)
Below, we display our neural network.
plotnet(fit_nn$finalModel)
<- fit_nn$finalModel$decay
decay paste("The best neural network has a decay of", decay) %>% kable(col.names = NULL, align="l")
The best neural network has a decay of 0.45 |
<- fit_nn$finalModel$n
archcat("The model architecture is the following:","\n", "- Number of layers =", length(arch), "\n",
"- Nodes in the first layer =", arch[1], "\n",
"- Nodes in the second layer =", arch[2], "\n",
"- Nodes in the third layer (output) =", arch[3])
## The model architecture is the following:
## - Number of layers = 3
## - Nodes in the first layer = 30
## - Nodes in the second layer = 9
## - Nodes in the third layer (output) = 1
<- predict(fit_nn, newdata = german.te)
pred_nn <- confusionMatrix(data = pred_nn, reference = german.te$risk)
cmnn draw_confusion_matrix(cmnn)
Scores of the neural network model do not outperform the scores of the K-NN model.
3.2.5 Linear discriminant analysis (LDA)
The aim of the linear discriminant analysis is to explain and predict the class of an observation according to a linear combination of features that characterizes the classes.
There is no tuning parameter for this model.
#LDA
set.seed(1839)
<- train(risk ~ .,
fit_LDA data = german.tr.bal,
method = "lda",
metric = metric,
trControl = train_control)
<- predict(fit_LDA, newdata = german.te)
pred_lda <- confusionMatrix(data = pred_lda, reference = german.te$risk)
cmlda draw_confusion_matrix(cmlda)
Like the SVM model, there is a huge gab between sensitivity and specificity. So this model does not meet our needs as it is only good at predicting good credits.
3.2.6 Random Forest
A random forest model is composed by a multitude of decision trees forming a whole. Each individual tree predicts a class and the class receiving the most votes becomes the prediction model. Thus, the predictions of the individual trees are averaged to get a final prediction. The trees are uncorrelated and operate as a set. Therefore they outperform the individual models. The non-correlation between trees brings some stability to the final prediction.
To tune the model, we will play around the number of trees composing the random forest.
#Random forest
<- expand.grid(.mtry = (1:15))
hp_rf set.seed(531)
<- train(
fit_rf ~ .,
risk data = german.tr.bal,
method = 'rf',
metric = metric,
trControl = train_control,
tuneGrid = hp_rf
)
Here, we display the importance of the variables. We note that the most important variables are not necessarily the same as those chosen in the logistic regression with AIC and p-values.
varImp(fit_rf) %>% plot()
<- fit_rf$finalModel$mtry
ntree paste("Our random forest model is composed by", ntree, "trees") %>% kable(col.names = NULL, align="l")
Our random forest model is composed by 12 trees |
<- predict(fit_rf, newdata = german.te)
pred_rf <- confusionMatrix(data = pred_rf, reference = german.te$risk)
cmrf draw_confusion_matrix(cmrf)
Although the accuracy is close to the score obtained with the KNN model, the difference between specificity and sensitivity is too disparate.
3.2.7 Decision tree
A tree is a graphical representation of a set of rules. The importance of characteristics is clear and relationships can be interpreted.
For the decision tree model, we decided to tune the complexity parameter “cp” to be in the range 0.03 and 0.
The complexity parameter controls the optimal size of the tree. If the cost of adding another variable to the tree from the current node is above the value of the complexity parameter, we stop growing the tree. Therefore, the complexity parameter is the minimum improvement needed in the model at each node.
Because we use the CARET package, we do not need to prune the tree as the tree is already simplified.
#Decision tree
<- expand.grid(cp = seq(from = 0.03, to = 0, by = -0.003))
hp_ct
set.seed(1851) #allow repoducibility of the results
<- train(
fit_ct form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_ct,
method = "rpart",
metric = metric
)
Below, we diplay our final tree. The most important variable is the first used to build the tree. For our tree, the most important variable is “chk_acct” which corresponds to the checking account status.
fancyRpartPlot(fit_ct$finalModel, main = "Regression tree", caption = NULL)
<- predict(fit_ct, newdata = german.te)
pred_ct <- confusionMatrix(data = pred_ct, reference = german.te$risk)
cmct draw_confusion_matrix(cmct)
The confusion matrix demonstrates that the classification tree gets the worst scores so far. The gap between the specificity and the sensitivity is considerable.
3.2.8 Naive Bayes
Naive Bayes model uses a probabilistic approach and allows an interpretation of the features.
The aim is to predict the class of an instance by choosing the class that maximizes the conditional probability given the features.
In order to find the optimal model, we will try to use or not use the kernel density and we will set the laplace smoother value to zero. In addition, the adjust parameter will allow us to adjust the bandwidth of the kernel density. The larger is the number, the more flexible is the density estimate.
#Naive Bayes
<- expand.grid(
hp_nb usekernel = c(TRUE, FALSE),
laplace = 0,
adjust = seq(from = 0.1, to = 5, by = 0.5)
)
set.seed(2013)
<- train(
fit_nb form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_nb,
method = "naive_bayes",
metric = metric
)
Our best model uses kernel density as we can see on the following graph.
plot(fit_nb)
<- predict(fit_nb, newdata = german.te)
pred_nb <- confusionMatrix(data = pred_nb, reference = german.te$risk)
cmnb draw_confusion_matrix(cmnb)
Unlike other previous models, the naive bayes model is better at predicting the bad credits rather than the good ones. Detecting bad risk credits is a important goal of bank employees. Therefore, this model can be useful for them despite the quite low accuracy.
3.2.9 Evaluation of the models
A summary of the scores of each model can be found in the following table.
<- c(
accuracy
confusionMatrix(predict.train(fit_glm_AIC, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_nb, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_nn, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_rf, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_svm, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_LDA, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_ct, newdata = german.te),
$risk)$overall[1],
german.teconfusionMatrix(predict.train(fit_knn_tuned, newdata = german.te),
$risk)$overall[1]
german.te
)
<- c("fit_glm_AIC", "fit_nb", "fit_nn", "fit_rf", "fit_svm", "fit_LDA", "fit_ct", "fit_knn_tuned")
model
<- tibble(accuracy, model) %>% arrange(desc(accuracy))
results
%>% kable() %>% kable_styling() results
accuracy | model |
---|---|
0.7743590 | fit_knn_tuned |
0.7692308 | fit_nb |
0.7692308 | fit_rf |
0.7384615 | fit_LDA |
0.7333333 | fit_nn |
0.7333333 | fit_svm |
0.7179487 | fit_glm_AIC |
0.7025641 | fit_ct |
<-
best %>%
results slice_max(accuracy)
paste("According to the accuracy, the best model is", best$model,"with" ,"accuracy =", round(best$accuracy, digits = 4)) %>% kable(col.names = NULL, align="l")
According to the accuracy, the best model is fit_knn_tuned with accuracy = 0.7744 |
Although the KNN model has better accuracy, it does less well than the naive bayes model at predicting bad risk credits.
3.3 ROC metric: training and fitting the models
In this part, we reproduce the same models as before but instead of the accuracy, we will use the ROC metric to evaluate the models. We will also use the Leave-One-Out Cross-Validation (LOOCV) method to assess the tuned hyperparameters.
Due to the huge computational time required for the tuning of the hyperparameters, we use, for most models, the hyperparameters that have obtained the highest accuracy in the previous section.
<- trainControl(method = "LOOCV",
train_control classProbs = TRUE,
summaryFunction = twoClassSummary)
<- "ROC" metric
We first train the model on the training set, then we make the predictions on the test set.
#Logistic regression
set.seed(123)
= train(
fit_glm_AIC form = risk ~ chk_acct + history + used_car + education + sav_acct + employment + male_single + prop_unkn_none + rent + job + foreign + Log1pDurationstd,
data = german.tr.bal,
trControl = train_control,
method = "glmStepAIC",
metric = metric,
family = "binomial"
)
#KNN
set.seed(456)
= train(
fit_knn_tuned ~ .,
risk data = german.tr.bal,
method = "knn",
metric = metric,
trControl = train_control,
tuneGrid = expand.grid(k = 33)
)
#SVM
<- expand.grid(cost = 0.01)
hp_svm set.seed(1953)
<- train(
fit_svm form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_svm,
method = "svmLinear2",
metric = metric
)
#Neural network
<- expand.grid(size = 9,
hp_nn decay = 0.45)
set.seed(2006)
<- train(
fit_nn form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_nn,
method = "nnet",
metric = metric
)
#Discriminant analysis
set.seed(1839)
<- train(risk ~ .,
fit_LDA data = german.tr.bal,
method = "lda",
metric = metric,
trControl = train_control)
#Random forest
<- expand.grid(.mtry = 12)
hp_rf set.seed(531)
<- train(
fit_rf ~ .,
risk data = german.tr.bal,
method = 'rf',
tuneGrid = hp_rf,
metric = metric,
trControl = train_control
)
#Decision tree
<- expand.grid(cp = seq(from = 0.03, to = 0, by = -0.003))
hp_ct set.seed(1851)
<- train(
fit_ct form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_ct,
method = "rpart",
metric = metric
)
#Naive Bayes
<- expand.grid(
hp_nb usekernel = c(TRUE, FALSE),
laplace = 0,
adjust = seq(from = 0.1, to = 5, by = 0.5)
)
set.seed(2013)
<- train(
fit_nb form = risk ~ .,
data = german.tr.bal,
trControl = train_control,
tuneGrid = hp_nb,
method = "naive_bayes",
metric = metric
)
<- predict(fit_glm_AIC, newdata = german.te, type="prob")
pred_glm_AIC_roc <- predict(fit_knn_tuned, newdata = german.te, type="prob")
pred_knn_tuned_roc <- predict(fit_svm, newdata = german.te, type="prob")
pred_svm_roc <- predict(fit_nn, newdata = german.te, type="prob")
pred_nn_roc <- predict(fit_LDA, newdata = german.te, type="prob")
pred_lda_roc <- predict(fit_rf, newdata = german.te, type="prob")
pred_rf_roc <- predict(fit_ct, newdata = german.te, type="prob")
pred_ct_roc <- predict(fit_nb, newdata = german.te, type="prob") pred_nb_roc
3.3.1 Evaluation of the models
We plot the ROC curve which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point “c” is increased from 0 through to 1. A perfect model would predict perfectly the sensitivity, i.e. reaching 100% of correct answer. At the same time, it would make no mistake for predicting negative answers (1-specificity).
As a result, a perfect model would reach the upper left corner of our ROC graph. It means that we can estimate the model quality by computing the area of the ROC curve above the purely random model represented by the straight line. If this area is high, we have a good discrimination level. The area takes value between 0.5 and 1. A value above 0.8 would be considered as a good level.
#List of predictions
<- list(
preds_list 2],
pred_glm_AIC_roc[,2],
pred_knn_tuned_roc[,2],
pred_svm_roc[,2],
pred_nn_roc[,2],
pred_lda_roc[,2],
pred_rf_roc[,2],
pred_ct_roc[,2])
pred_nb_roc[,
#List of actual values (same for all)
<- length(preds_list)
m <- rep(list(german.te$risk), m)
actuals_list
#Plot the ROC curves
<- prediction(preds_list, actuals_list)
pred <- performance(pred, "tpr", "fpr")
rocs plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves") %>% legend(x = "bottomright",
legend = c("glm_AIC", "knn_tuned", "svm", "nn", "lda", "rf","ct", "nb"),
fill = 1:m) %>% abline(coef = c(0,1))
In terms of ROC, we see that the naive bayes model does not achieve the results expected in the first part. According to the curves of the graph, the random forest and the SVM obtain the best results. In order to better interpret these results, we use the area under the ROC curve (AUC). A model is considered good if it obtains an AUC of more than 80%.
#AUC metric
<- performance(pred, measure = "auc")
auc <- auc@y.values
auc <- tibble(auc)
auc$model <- c("glm_AIC", "knn_tuned", "svm", "nn", "lda", "rf","ct", "nb")
auc<- as.data.frame(lapply(auc, unlist))
results_auc <- results_auc %>% arrange(desc(auc))
results_auc %>% kable() %>% kable_styling() results_auc
auc | model |
---|---|
0.8457777 | rf |
0.8392176 | svm |
0.8377863 | lda |
0.8272304 | knn_tuned |
0.8268130 | glm_AIC |
0.8040315 | nn |
0.7858421 | ct |
0.7566794 | nb |
<-
best_auc %>%
results_auc slice_max(auc)
paste("According to the AUC, the best model is", best_auc$model, "with AUC =", round(best_auc$auc, digits = 4)) %>% kable(col.names = NULL, align="l")
According to the AUC, the best model is rf with AUC = 0.8458 |
AUC results confirm the results found with the ROC metric. However, results are totally different from the first part (accuracy metric). It is therefore really difficult to designate the best prediction model.
In order to look beyond these results, we will use the DALEX tool in the next chapter.