9.4 Random Forest with All Variables

Now let use the random forest to build a model with the maximum predictors that can be used from attributes. We may not be able to use all the attributes since the randomForest function cannot handle an attribute that is not a factor and has over 53 levels. So, we will not use the attribute Ticket.

# RF_model3 construction with the maximum predictors
set.seed(2233)
# RF_model3 <- randomForest(Survived ~ Sex + Pclass + Age 
#                           + SibSp + Parch + Embarked +
#                             HasCabinNum + Friend_size +
#                             Fare_pp + Title + Deck +
#                             Ticket_class + Family_size +
#                             Group_size + Age_group, 
#                           data = train, importance=TRUE)
# save(RF_model3, file = "./data/RF_model3.rda")

We can assess the new model,

# Display RE_model3's details
load("./data/RF_model3.rda")
RF_model3
## 
## Call:
##  randomForest(formula = Survived ~ Sex + Pclass + Age + SibSp +      Parch + Embarked + HasCabinNum + Friend_size + Fare_pp +      Title + Deck + Ticket_class + Family_size + Group_size +      Age_group, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.95%
## Confusion matrix:
##     0   1 class.error
## 0 485  64   0.1165756
## 1  87 255   0.2543860

Notice that the default parameter mtry = 3 and ntree = 500. It means the number of variables tried at each split is now 3 and number of trees can be built is 500. The model’s estimated OOB error rate is 17%. It has an increase in comparison with the model2 which was 18%. So the overall accuracy of the model has reached 83%.

Let us make a prediction on train Data to verify the model’s training accuracy.

# Make a prediction on Train
RF_prediction3 <- predict(RF_model3, train)
#check up
conMat<- confusionMatrix(RF_prediction3, train$Survived)
conMat$table
##           Reference
## Prediction   0   1
##          0 536  38
##          1  13 304
# Misclassification error
paste('Accuracy =', round(conMat$overall["Accuracy"],2))
## [1] "Accuracy = 0.94"
paste('Error =', round(mean(train$Survived != RF_prediction3), 2)) 
## [1] "Error = 0.06"

We can see the accuracy on train dataset has reached 95%. The result shows that the prediction on survive has 38 wrong predictions out of 536 correct predictions; The prediction on death has 304 correct predictions and 13 wrong predictions. The overall accuracy reaches 95%. It is again higher than the model learning accuracy 83%.

Let us make another submit to Kaggle to see if the prediction on unseen data has been improved.

# produce a submit with Kaggle 
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model3, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result3.CSV", row.names = FALSE)

The score is 0.77033. It shows the decrease of the accuracy.

Let us record these various accuracy.

# Record RF_model3's results
RF_model3_accuracy <- c(83, 94, 77)