9.2 Titanic prediciton with a Random Forest

Let’s now look at how we can implement the random forest algorithm for our Titanic prediction. R provides 'randomForest' package. You can check the details of the package for full usage. We will start with a direct function call with its default settings and we may change settings later. We will also use the original attributes first and then use re-engineered attributes to see if we can improve on the model.

Random Forest with Key Predictors

The process of using randomForest package to build an RF model is the same as the decision tree package rpart. Note also if a dependent (response) variable is a factor, classification is assumed, otherwise, regression is assumed. So to uses randomForest, we need to convert the dependent variable into a factor.

# convert variables into factor
train$Survived <- as.factor(train$Survived)
# convert other attributes which really are categorical data but in form of numbers
train$Pclass <- as.factor(train$Pclass)
train$Group_size <- as.factor(train$Group_size)
#confirm types
sapply(train, class)

##  PassengerId     Survived       Pclass          Sex          Age        SibSp 
##    "integer"     "factor"     "factor"     "factor"    "numeric"    "integer" 
##        Parch       Ticket     Embarked  HasCabinNum  Friend_size      Fare_pp 
##    "integer"     "factor"     "factor"     "factor"    "integer"    "numeric" 
##        Title         Deck Ticket_class  Family_size   Group_size    Age_group 
##     "factor"     "factor"     "factor"    "integer"     "factor"     "factor"

Let us use the same five most related attributes: Pclass, Sex, HasCabinNum, Deck and Fare_pp in the decision tree model2. We use all default parameters of the randomForest.

# Build the random forest model uses pclass, sex, HasCabinNum, Deck and Fare_pp
set.seed(1234) #for reproduction 
# RF_model1 <- randomForest(Survived ~ Sex + Pclass + HasCabinNum + Deck + Fare_pp, data=train, importance=TRUE)
# save(RF_model1, file = "./data/RF_model1.rda")

Let us check model’s prediction accuracy.

load("./data/RF_model1.rda")
RF_model1

## 
## Call:
##  randomForest(formula = Survived ~ Sex + Pclass + HasCabinNum +      Deck + Fare_pp, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.3%
## Confusion matrix:
##     0   1 class.error
## 0 505  44  0.08014572
## 1 128 214  0.37426901

We can see that the model uses default parameters: ntree = 500 and mtry = 1. The model’s estimated accuracy is 80.7%. It is 1 - 19.3% (OOB error).

Let us make a prediction on the training dataset and check the accuracy.

# Make your prediction using the validate dataset
# RF_prediction1 <- predict(RF_model1, train)
# #check up
# conMat<- confusionMatrix(RF_prediction1, train$Survived)
# conMat$table
# # Misclassification error
# paste('Accuracy =', round(conMat$overall["Accuracy"],2))
# paste('Error =', round(mean(train$Survived != RF_prediction1), 2))

We can see that prediction on the training dataset has achieved 84% accuracy. It has made 107 wrong predictions and 516 correct predictions on death. The prediction on survived is 33 wrong predictions out of 235 correct predictions.

The model has an accuracy of 80% after learning, but our evaluation of the training dataset achieves 84%. It has been increased. Compare with the decision tree model2, in which the same attributes were used and the prediction accuracy on the train data was 81%, the accuracy is also increased. Let us make a prediction on the test dataset and submit it to Kaggle to obtain an accuracy score.

# produce a submit with Kaggle required format that is only two attributes: PassengerId and Survived
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model1, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result1.CSV", row.names = FALSE)

We can see our random forest model has scored 0.76555 by the Kaggle competition. It is interesting to know that the random forest model has not improved on the test dataset compare with the decision tree model with the same predictors. The accuracy was also 0.76555.

Let us record these accuracies,

# Record the results
RF_model1_accuracy <- c(80, 84, 76.555)

Random Forest with More Variables

Now let us see if We can build a better model if we use more predictors. The predictor we are using is identical to the decision tree model3.

### RE_model2 with more predictors
set.seed(2222)
# RF_model2 <- randomForest(as.factor(Survived) ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class  + Embarked, 
#                           data = train, 
#                           importance=TRUE)
# # This model will be used in later chapters, so save it in a file and it can be loaded later.
# save(RF_model2, file = "./data/RF_model2.rda")

We can assess the new model,

load("./data/RF_model2.rda")
RF_model2

## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp +      Pclass + Title + Age_group + Group_size + Ticket_class +      Embarked, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.84%
## Confusion matrix:
##     0   1 class.error
## 0 499  50  0.09107468
## 1 100 242  0.29239766

Notice that the default parameter ‘mtry = 2’ and ntree = 500. It means the number of variables tried at each split is now 2 and the number of trees that can be built is 500. The model’s estimated OOB error rate is 16.5%. It has an increase in comparison with the first model which was 20%. So the overall accuracy of the model has reached 83.5%.

Let us make a prediction on train Data to verify the model’s training accuracy.

# RF_model2 Prediction on train
RF_prediction2 <- predict(RF_model2, train)
#check up
conMat<- confusionMatrix(RF_prediction2, train$Survived)
conMat$table

##           Reference
## Prediction   0   1
##          0 529  55
##          1  20 287

# Misclassification error
paste('Accuracy =', round(conMat$overall["Accuracy"],2))

## [1] "Accuracy = 0.92"

paste('Error =', round(mean(train$Survived != RF_prediction2), 2))

## [1] "Error = 0.08"

We can see the model’s accuracy of the training dataset has reached 91%. The result shows that the prediction on survival has 55 wrong predictions out of 527 correct predictions; The prediction on death has 287 correct predictions and 22 wrong predictions. The overall accuracy reaches 91%. It is again higher than the model learning accuracy 83.5%.

It has also increased a bit comparing with the accuracy on the estimated accuracy 80% and the accuracy on train dataset 84% of the random forest RF_model1. Compare with the decision tree model3, which has identical predictors, the accuracy was 85% on the training dataset.

Let us make another submit to Kaggle to see if the prediction on unseen data has been improved.

# produce a submission and submit to Kaggle 
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model2, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result2.CSV", row.names = FALSE)

The feedback shows the prediction only increased a lot with a scored 0.78947! It has improved on the RF_model1 (0.76555) and decision tree model3 (0.77033).

Let us record these various accuracy.

# Record RF_model2's results
RF_model2_accuracy <- c(83.16, 92, 78.95)

Random Forest with All Variables

Now let use the random forest to build a model with the maximum predictors that can be used from attributes. We may not be able to use all the attributes since the randomForest function cannot handle an attribute that is not a factor and has over 53 levels. So, we will not use the attribute Ticket.

# RF_model3 construction with the maximum predictors
set.seed(2233)
# RF_model3 <- randomForest(Survived ~ Sex + Pclass + Age 
#                           + SibSp + Parch + Embarked +
#                             HasCabinNum + Friend_size +
#                             Fare_pp + Title + Deck +
#                             Ticket_class + Family_size +
#                             Group_size + Age_group, 
#                           data = train, importance=TRUE)
# save(RF_model3, file = "./data/RF_model3.rda")

We can assess the new model,

# Display RE_model3's details
load("./data/RF_model3.rda")
RF_model3

## 
## Call:
##  randomForest(formula = Survived ~ Sex + Pclass + Age + SibSp +      Parch + Embarked + HasCabinNum + Friend_size + Fare_pp +      Title + Deck + Ticket_class + Family_size + Group_size +      Age_group, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.95%
## Confusion matrix:
##     0   1 class.error
## 0 485  64   0.1165756
## 1  87 255   0.2543860

Notice that the default parameter mtry = 3 and ntree = 500. It means the number of variables tried at each split is now 3 and number of trees can be built is 500. The model’s estimated OOB error rate is 17%. It has an increase in comparison with the model2 which was 18%. So the overall accuracy of the model has reached 83%.

Let us make a prediction on train Data to verify the model’s training accuracy.

# Make a prediction on Train
RF_prediction3 <- predict(RF_model3, train)
#check up
conMat<- confusionMatrix(RF_prediction3, train$Survived)
conMat$table

##           Reference
## Prediction   0   1
##          0 536  38
##          1  13 304

# Misclassification error
paste('Accuracy =', round(conMat$overall["Accuracy"],2))

## [1] "Accuracy = 0.94"

paste('Error =', round(mean(train$Survived != RF_prediction3), 2))

## [1] "Error = 0.06"

We can see the accuracy on train dataset has reached 95%. The result shows that the prediction on survive has 38 wrong predictions out of 536 correct predictions; The prediction on death has 304 correct predictions and 13 wrong predictions. The overall accuracy reaches 95%. It is again higher than the model learning accuracy 83%.

Let us make another submit to Kaggle to see if the prediction on unseen data has been improved.

# produce a submit with Kaggle 
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model3, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result3.CSV", row.names = FALSE)

The score is 0.77033. It shows the decrease of the accuracy.

Let us record these various accuracy.

# Record RF_model3's results
RF_model3_accuracy <- c(83, 94, 77)