8.3 The Decision Tree with Core Predictors

We have identified the correlation between the dependent variable Survived and other attributes in the previous Chapter. Let us try to improve the basic Sex model by introducing more predictors. From the previous chapter, we know that the five most predictive attributes are: Sex, Pclass, HasCabinNum, Deck, and Fare_PP. Let us see if they can produce a good model.

# A tree model with the top five attributes 
set.seed(1234)
model2 <- rpart(Survived ~ Sex + Pclass + HasCabinNum + Deck + Fare_pp, data = train, method="class")

# Assess model's accuracy with train data
Predict_model2_train <- predict(model2, train, type = "class")
conMat <- confusionMatrix(as.factor(Predict_model2_train), as.factor(train$Survived))
conMat$table

##           Reference
## Prediction   0   1
##          0 524 140
##          1  25 202

#conMat$overall
predict2_train_accuracy <- conMat$overall["Accuracy"]
predict2_train_accuracy

## Accuracy 
##   0.8148

Our assessment of model2’s accuracy is to make a prediction on train data and compare the predicted value with the original value. It shows our model2’s accuracy is 81%. This is great! Let us use this model to make a prediction on the test dataset.

Prediction2 <- predict(model2, test, type = "class")
submit2 <- data.frame(PassengerId = test$PassengerId, Survived = Prediction2)
write.csv(submit2, file = "./data/Tree_model2.CSV", row.names = FALSE)

We have produced our second prediction. We can also submit our results to the Kaggle website for the second evaluation.

This time, we can see the score is still 0.76555. The accuracy of the test dataset has not been improved.

Let us examine our classifier again by plot it in a graph.

# plot our full house classifier 
prp(model2, type = 0, extra = 1, under = TRUE)
# plot our full house classifier 
fancyRpartPlot(model2, caption = "")

Figure 8.6: Decision trees with core predictors.

The above decision tree appeared much complicated than the first one and it goes a lot deeper than what we saw with the decision tree the only test on sex. Note that both trees are binary trees (have two branches). For test conditions that more than two possible answers have been changed to a binary by auto add a split with them. For example, predictor Fare_pp takes real numbers and has many possibilities, our model simply split it by test conditions "Fare_pp >= 8" and “Fare_pp < 7.2”. Conditions have been automatically set for other attributes as well such as "Pclass >= 2.5". These auto-generated test conditions are not ideal, they can be changed if you know how to optimize the decision tree. For the moment, we will take the default settings.

We can also look into the difference between our two prediction models,

# build a comparison data frame to record each prediction results
Tree_compare <- data.frame(test$PassengerId, predict1=Prediction1, predict2=Prediction2)
# Find differences
dif <- Tree_compare[Tree_compare[2]!=Tree_compare[3], ]
#show dif
print.data.frame(dif, row.names = FALSE)

##  test.PassengerId predict1 predict2
##               893        1        0
##               896        1        0
##               911        1        0
##               924        1        0
##               925        1        0
##               928        1        0
##               929        1        0
##               941        1        0
##               979        1        0
##               982        1        0
##               996        1        0
##              1009        1        0
##              1017        1        0
##              1024        1        0
##              1030        1        0
##              1032        1        0
##              1045        1        0
##              1051        1        0
##              1061        1        0
##              1080        1        0
##              1091        1        0
##              1117        1        0
##              1141        1        0
##              1155        1        0
##              1160        1        0
##              1172        1        0
##              1175        1        0
##              1176        1        0
##              1183        1        0
##              1201        1        0
##              1225        1        0
##              1246        1        0
##              1257        1        0
##              1259        1        0
##              1268        1        0
##              1275        1        0
##              1301        1        0

We can see the second classifier has produced 37 different predictions in comparison with the first classifier. Interesting is that the differences are all that predicted to be survived by the model1 is now predicted to be dead by the model2.