Summary

So far, we have produced four prediction models with different predictors. We also used these four models to generate four submissions to the Kaggle competition for evaluation since we don’t have the survival values for the test dataset.

The results obtained from the Kaggle in terms of prediction’s accuracy were 76.56%, 76.56%, 77.03%, and 75.12% respectively.

We have noticed the accuracy differences between the model’s assessment and model’s test accuracy are:

model1’s assessed accuracy was 78.68% but the accuracy on the test was 76.555%;
Model2’s accessed accuracy was 81.48% and the actual accuracy on the test was still 76.555%;
Model3’s accessed accuracy was 85.19%. and the actual accuracy on the test was 77.03%;
Model4’s accessed accuracy was 85.41% and the actual accuracy on the test was 75.12%.

let us plot these model’s accuracy, so we can have a comparison.

library(tidyr)
# Tree models comparison
Model <- c("Model1","Model2","Model3","Model4")
Pre <- c("Sex", "Sex, Pclass, HasCabinNum, Deck, Fare_pp", "Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked", "All")
Train <- c(78.68, 81.48, 85.19, 85.41)
Test <- c(76.56, 76.56, 77.03, 75.12)
df1 <- data.frame(Model, Pre, Train, Test)
df2 <- data.frame(Model, Train, Test)
knitr::kable(df1, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Predictors", "Accuracy on Train", "Accuracy on Test"), 
  caption = 'The Comparision among 4 decision tree models'
)

Table 8.1: The Comparision among 4 decision tree models
Models	Predictors	Accuracy on Train	Accuracy on Test
Model1	Sex	78.68	76.56
Model2	Sex, Pclass, HasCabinNum, Deck, Fare_pp	81.48	76.56
Model3	Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked	85.19	77.03
Model4	All	85.41	75.12

Let us plot a bar graph,

df.long <- gather(df2, Dataset, Accuracy, -Model, factor_key =TRUE)
ggplot(data = df.long, aes(x = Model, y = Accuracy, fill = Dataset)) +
  geom_col(position = position_dodge())

Figure 8.9: Decision Tree models’ accuracy on Train dataset and Test dataset.

From the plot we can see that:

All four models perform better when predicting the survival on the training dataset than they do on the test data. In other words, all models drop prediction accuracy when facing unseen new data.
The accuracy of the training and testing datasets are both affected by the number of predictors used in the model.
It is not true that the more predictors a model has the better accuracy. Model4 has the worst accuracy on the test dataset.
The biggest drop of prediction accuracy is model4, the difference is almost 11% from 85% on train data to 74% on the test dataset.

The issue demonstrated by model4 is called overfitting. That is a phenomenon that a model has a higher prediction accuracy on the training dataset and subsequently drops prediction accuracy on the unseen data. Overfitting can be a consequence of many factors. One of the factors has been illustrated with the model4. That is the more predictors a model used the more chance of the model’s overfitting. this is because the model will be well fitted with the training dataset. This can be seen from the prediction accuracy on the training dataset. The overfit will not suit the unseen data so the model will perform badly on the test dataset.

We will investigate overfitting by a method called “Cross Validation” later in Chapters 10.