8.4 The Decision Tree with More Predictors

We have seen that our 5 key predictor decision tree model has improved on the sex-only prediction model. However, we know that our re-engineered data has more dimensions that contain useful information. Let us see if we can improve the decision tree model with more predictors in addition to the correlation analysis and PCA analyses results. This time We add travel in groups, Age_group, embarked port and the title attributes Sex, Pclass, HasCabinNum, Deck, and Fare_pp.

# tree model2 construction using more predictors
model3 <- rpart(Survived ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class  + Embarked,
              data=train,
              method="class")
# This model will be used in later chapters so save it in to a file for later to be loaded into memory
save(model3, file = "model3.rda")
#Assess prediction accuracy on train data
Predict_model3_train <- predict(model3, train, type = "class")
conMat <- confusionMatrix(as.factor(Predict_model3_train), as.factor(train$Survived))
conMat$table
##           Reference
## Prediction   0   1
##          0 517 100
##          1  32 242
#conMat$overall
predict3_train_accuracy <- conMat$overall["Accuracy"]
predict3_train_accuracy
## Accuracy 
##   0.8519

Our assessment about the model3’s accuracy on the train data shows the accuracy has increased to 85%. It is a big increase from 82% of model2. Let us use this model to make another prediction on the test dataset and see if the accuracy on the test dataset is also increased.

Prediction3 <- predict(model3, test, type = "class")
submit3<- data.frame(PassengerId = test$PassengerId, Survived = Prediction3)
write.csv(submit3, file = "./data/tree_model3.CSV", row.names = FALSE)

After submitting it to Kaggle the feedback was 0.77033. This is a big improvement on the test dataset. Let us look into the difference between the last two predictions,

# plot our full house classifier 
prp(model3, type = 0, extra = 1, under = TRUE)
# plot our full house classifier 
fancyRpartPlot(model3)
Decision trees with more predictors.Decision trees with more predictors.

Figure 8.7: Decision trees with more predictors.

Again, let us look into the difference between predicted values on the test dataset.

# build a comparison data frame to record each prediction results
compare <- data.frame(test$PassengerId, predict2 = Prediction2 , predict3 = Prediction3)
# Find differences
dif <- compare[compare[2] != compare[3], ]
#show dif
print.data.frame(dif, row.names = FALSE)
##  test.PassengerId predict2 predict3
##               896        0        1
##               913        0        1
##               925        0        1
##               956        0        1
##               972        0        1
##               981        0        1
##               982        0        1
##               996        0        1
##              1009        0        1
##              1051        0        1
##              1053        0        1
##              1084        0        1
##              1086        0        1
##              1088        0        1
##              1093        0        1
##              1098        1        0
##              1106        1        0
##              1117        0        1
##              1136        0        1
##              1141        0        1
##              1155        0        1
##              1173        0        1
##              1175        0        1
##              1176        0        1
##              1183        0        1
##              1199        0        1
##              1205        1        0
##              1225        0        1
##              1231        0        1
##              1236        0        1
##              1239        1        0
##              1246        0        1
##              1284        0        1
##              1301        0        1
##              1309        0        1

There are 35 differences.