8.5 The Decision Tree with Full Predictors

Let us use all the attributes:

# The full-house classifier apart from name and ticket
model4 <- rpart(Survived ~ Sex + Pclass + Age + SibSp + Parch + Embarked + HasCabinNum + Friend_size + Fare_pp + Title + Deck + Ticket_class + Family_size + Group_size + Age_group,
#model4 <- rpart(Survived ~ .,
              data=train,
              method="class")
#assess prediction accuracy on train data
Predict_model4_train <- predict(model4, train, type = "class")
conMat <- confusionMatrix(as.factor(Predict_model4_train), as.factor(train$Survived))
conMat$table
##           Reference
## Prediction   0   1
##          0 513  94
##          1  36 248
#conMat$overall
predict4_train_accuracy <- conMat$overall["Accuracy"]
predict4_train_accuracy
## Accuracy 
##   0.8541

Our assessment on model4’s accuracy on the train data shows model4’s accuracy is 85%. Let us use this model to make a prediction on the test dataset.

# make prediction on test dataset
Prediction4 <- predict(model4, test, type = "class")
submit4 <- data.frame(PassengerId = test$PassengerId, Survived = Prediction4)
write.csv(submit4, file = "./data/Tree_model4.CSV", row.names = FALSE)

We have produced a new prediction with new model. You can submit to Kaggle for an evaluation. You may find it has a pretty bed score (0.75119). Let us examine our classifier again by plot it in a graph.

# plot our full house classifier 
prp(model4, type = 0, extra = 1, under = TRUE)
fancyRpartPlot(model4)
Decision trees.Decision trees.

Figure 8.8: Decision trees.

We can see from the tree, the first test condition is "title = Mr". We know that there is a large number of passengers are male adults and most of them perish. However, we can see from our decision tree (left branch) that there are two further test conditions are "Ticket_class" and "Deck" numbers. Our tree end with 2 survived leaves and one dead leaf. The purity on the leaves is not very high, the highest is 39:366 (90%:10%) and the lowest is 7:14 (33%:67%).

On the right-hand side of the tree, although some leaves have higher purity the overall percentage is very low. The most powerful predictor Sex and Pclass has been used towards the leaf of the tree. That could be an explanation for the poor performance. We will demonstrate the detailed interpretation of the model in the later chapter of cross validation and report.

Let us look into the difference between the last two predictions,

# build a comparison data frame to record each prediction results
compare <- data.frame(test$PassengerId, predict3 = Prediction3 , predict4 = Prediction4)
# Find differences
dif2 <- compare[compare[2] != compare[3], ]
#show dif
print.data.frame(dif2, row.names = FALSE)
##  test.PassengerId predict3 predict4
##               898        1        0
##               916        1        0
##               933        0        1
##               945        1        0
##               956        1        0
##               961        1        0
##               964        1        0
##               965        0        1
##              1038        0        1
##              1050        0        1
##              1073        0        1
##              1128        0        1
##              1134        0        1
##              1137        0        1
##              1183        1        0
##              1247        0        1
##              1251        1        0
##              1259        0        1
##              1296        0        1
##              1304        1        0

There are 20 different predictions in comparison with the third prediction model3 that is the best model we have with the decision tree.