10.3 Multiple Models Comparison

Multiple model comparison is also called Cross Model Validation. Here the model refers to completely different algorithms. The idea is to use multiple models constructed from the same training dataset and validated using the same verification dataset to find out the performance of the different models.

We have already used the technique to compare our decision tree models and random forest models. Cross-model verification has a broader meaning that refers to the comparison between different models produced by the different algorithms or completely different approaches such as decision tree against random forest or decision tree against Support Vector Machine (SVM).

To demonstrate cross model validation, let us produce a few more models with completely different algorithms with the same predictors as much as possible. Let us use Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked as predictors.

Regression Model for Titanic

Logistic regression is a classification, not a regression algorithm. It predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression (VIDHYA 2015). Since it predicts the probability, its output values lie between 0 and 1, we can simply separate (or normalise) them by setting a threshold like (> 0.5).

LR_Model <- glm(formula = Survived ~ Pclass + Title + Sex + Age_group + Group_size + Ticket_class  + Fare_pp + Embarked, family = binomial, data = trainData)

#summary(LR_Model_CV)
### Validate on trainData
Valid_trainData <- predict(LR_Model, newdata = trainData, type = "response") #prediction threshold
Valid_trainData <- ifelse(Valid_trainData > 0.5, 1, 0)  # set binary 
#produce confusion matrix
confusion_Mat<- confusionMatrix(as.factor(trainData$Survived),as.factor(Valid_trainData))

# accuracy on traindata
Regression_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,2)
paste('Model Train Accuracy =', Regression_Acc_Train)
## [1] "Model Train Accuracy = 84.97"
### Validate on validData
validData_Survived_predicted <- predict(LR_Model, newdata = validData, type = "response")  
validData_Survived_predicted  <- ifelse(validData_Survived_predicted  > 0.5, 1, 0)  # set binary prediction threshold
conMat<- confusionMatrix(as.factor(validData$Survived),as.factor(validData_Survived_predicted))

Regression_Acc_Valid <-round(conMat$overall["Accuracy"]*100,2)
paste('Model Valid Accuracy =', Regression_Acc_Valid) 
## [1] "Model Valid Accuracy = 79.89"
### produce a prediction on test data
library(pROC)
auc(roc(trainData$Survived,Valid_trainData))  # calculate AUROC curve
## Area under the curve: 0.832
#predict on test
test$Survived <- predict(LR_Model, newdata = test, type = "response")  
test$Survived <- ifelse(test$Survived > 0.5, 1, 0)  # set binary prediction threshold
submit <- data.frame(PassengerId = test$PassengerId, Survived = as.factor(test$Survived))

write.csv(submit, file = "./data/LG_model1_CV.CSV", row.names = FALSE)
# Kaggle test accuracy score:0.76555

# record accuracy
Regr_Acc <- c(Regression_Acc_Train, Regression_Acc_Valid, 0.76555)

acc_names <- c("Train Accu", "Valid Accu", "Test Accu")
names(Regr_Acc) <- acc_names
Regr_Acc
## Train Accu Valid Accu  Test Accu 
##   84.97000   79.89000    0.76555

Support Vector Machine Model for Titanic

Let us also consider a support vector machine (SVM) model (Cortes and Vapnik 1995). We use the C-classification mode. Again, we fit a model with the same set of attributes as in the logistic regression model. We use function svm() from the e1071 package (https://cran.r-project.org/web/packages/e1071/e1071.pdf).

We could try to tune the two parameters of the SVM model gamma & cost, find and select the best parameters (see exercise).

We then use the best model to make predictions. The results of the model are collected for comparison.

#load library
library(e1071)

# fit the model using default parameters
SVM_model<- svm(Survived ~ Pclass + Title + Sex + Age_group + Group_size + Ticket_class + Fare_pp + Deck + HasCabinNum + Embarked, data=trainData, kernel = 'radial', type="C-classification")

#summary(SVM_model)
### Validate on trainData
Valid_trainData <- predict(SVM_model, trainData) 
#produce confusion matrix
confusion_Mat<- confusionMatrix(as.factor(trainData$Survived), as.factor(Valid_trainData))

# output accuracy
AVM_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model Train Accuracy =', AVM_Acc_Train)
## [1] "Model Train Accuracy = 84.4101"
### Validate on validData
validData_Survived_predicted <- predict(SVM_model, validData) #produce confusion matrix 
conMat<- confusionMatrix(as.factor(validData$Survived), as.factor(validData_Survived_predicted))
# output accuracy
AVM_Acc_Valid <- round(conMat$overall["Accuracy"]*100,4)
paste('Model Valid Accuracy =', AVM_Acc_Valid) 
## [1] "Model Valid Accuracy = 78.2123"
### make prediction on test
# SVM failed to produce a prediction on test because test has Survived col and it has value NA. A work around is assign it with a num like 1.
test$Survived <-1

# predict results on test
Survived <- predict(SVM_model, test)
solution <- data.frame(PassengerId=test$PassengerId, Survived =Survived)
write.csv(solution, file = './data/svm_predicton.csv', row.names = F)

# prediction accuracy on test 
SVM_Acc <- c(AVM_Acc_Train, AVM_Acc_Valid, 0.78947)
names(SVM_Acc) <- acc_names

# print out
SVM_Acc
## Train Accu Valid Accu  Test Accu 
##   84.41010   78.21230    0.78947

Neural Network Models

Neural networks are a rapidly developing paradigm for information processing based loosely on how neurons in the brain process information. A neural network consists of multiple layers of nodes, where each node performs a unit of computation and passes the result onto the next node. Multiple nodes can pass inputs to a single node and vice versa.

The neural network also contains a set of weights, which can be refined over time as the network learns from sample data. The weights are used to describe and refine the connection strengths between nodes.

Neural Network with one hidden layer utilizing all features.

# load library
library(nnet)

# train the model
xTrain = train[ , c("Survived", "Pclass","Title", "Sex","Age_group","Group_size", "Ticket_class", "Fare_pp", "Deck", "HasCabinNum", "Embarked")]

NN_model1 <- nnet(Survived ~ ., data = xTrain, size=10, maxit=500, trace=FALSE)

#How do we do on the training data?
nn_pred_train_class = predict(NN_model1, xTrain, type="class" )  # yields "0", "1"
nn_train_pred = as.numeric(nn_pred_train_class ) #transform to 0, 1
confusion_Mat<-confusionMatrix(as.factor(nn_train_pred), train$Survived)
# output accuracy
NN_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model Train Accuracy =', NN_Acc_Train)
## [1] "Model Train Accuracy = 89.2256"
#How do we do on the valid data?
nn_pred_valid_class = predict(NN_model1, validData, type="class" )  # yields "0", "1"
nn_valid_pred = as.numeric(nn_pred_valid_class ) #transform to 0, 1
confusion_Mat<-confusionMatrix(as.factor(nn_valid_pred), validData$Survived)
# output accuracy
NN_Acc_Valid <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model valid Accuracy =', NN_Acc_Valid)
## [1] "Model valid Accuracy = 87.7095"
#make a prediction on test data
nn_pred_test_class = predict(NN_model1, test, type="class" )  # yields "0", "1"
nn_pred_test = as.numeric(nn_pred_test_class ) #transform to 0, 1
solution <- data.frame(PassengerId=test$PassengerId, Survived = nn_pred_test)
write.csv(solution, file = './data/NN_predicton.csv', row.names = F)

###
# 0.8934,0.8547, 0.71052
NN_Acc <- c(NN_Acc_Train, NN_Acc_Valid, 0.71052)
names(NN_Acc) <- acc_names
NN_Acc
## Train Accu Valid Accu  Test Accu 
##   89.22560   87.70950    0.71052

Comparision among Different Models

Let us compare the different models we have produced and see which one has a better prediction accuracy on the test dataset. We will use our best prediction accuracy on the test dataset for decision tree and random forest models.

library(tidyr)
Model <- c("Regression","SVM","NN", "Decision tree", "Random Forest")
Train <- c(Regression_Acc_Train, AVM_Acc_Train, NN_Acc_Train, 82.72, 83.16)
Valid <- c(Regression_Acc_Valid, AVM_Acc_Valid, NN_Acc_Valid, 81.01, 92)
Test <- c(76.56, 78.95, 71.05, 77.75, 78.95)
df1 <- data.frame(Model, Train, Valid, Test)

knitr::kable(df1, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Accuracy on Train", "Accuracy on Valid","Accuracy on Test"), 
  caption = 'The Comparision among 3 Machine Learning Models'
)
Table 10.2: The Comparision among 3 Machine Learning Models
Models Accuracy on Train Accuracy on Valid Accuracy on Test
Regression 84.97 79.89 76.56
SVM 84.41 78.21 78.95
NN 89.23 87.71 71.05
Decision tree 82.72 81.01 77.75
Random Forest 83.16 92.00 78.95
df.long <- gather(df1, Dataset, Accuracy, -Model, factor_key =TRUE)
ggplot(data = df.long, aes(x = Model, y = Accuracy, fill = Dataset)) +
  geom_col(position = position_dodge()) 
Accuracy comparision among differnt Machine Leanring models.

Figure 10.6: Accuracy comparision among differnt Machine Leanring models.

From the above table and plot, we can see that multiple models cross-validation does not provide a conclusive answer on which model to use for real applications or production. Ideally, we would choose the model that has higher accuracy on trainData and validData. From the table 10.2, we should choose model NN since it has the highest train accuracy (91.13%) and the second highest validation accuracy (87.71%), however, it has the lowest test accuracy (71.05%). Another possible logic would be to choose the highest validation accuracy and ignore the train accuracy, in this case, we would choose Random Forest model since it has the highest validation accuracy (92.00%), and the highest test accuracy among the models (78.95%). However, the SVM model also has 78.95% test accuracy but its validation accuracy is the lowest.

It reveals an unpleasant fact - there is no model which be a certainty that its performance on unseen data can be ensured by the cross validation. The CV can only be used to spot and discover problems but not solutions.

References

Cortes, Corinna, and Vladimir Vapnik. 1995. “Support-Vector Networks.” Machine Learning 20. https://link.springer.com/article/10.1007/BF00994018.

VIDHYA, ANALYTICS. 2015. “Simple Guide to Logistic Regression in R and Python.” https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/.