9.2 Random Forest with Key Predictors

The process of using randomForest package to build an RF model is the same as the decision tree package rpart. Note also if a dependent (response) variable is a factor, classification is assumed, otherwise, regression is assumed. So to uses randomForest, we need to convert the dependent variable into a factor.

# convert variables into factor
# convert other attributes which really are categorical data but in form of numbers
train$Group_size <- as.factor(train$Group_size)
#confirm types
sapply(train, class)
##  PassengerId     Survived       Pclass          Sex          Age        SibSp
##    "integer"     "factor"     "factor"     "factor"    "numeric"    "integer"
##        Parch       Ticket     Embarked  HasCabinNum  Friend_size      Fare_pp
##    "integer"     "factor"     "factor"     "factor"    "integer"    "numeric"
##        Title         Deck Ticket_class  Family_size   Group_size    Age_group
##     "factor"     "factor"     "factor"    "integer"     "factor"     "factor"

Let us use the same five most related attributes: Pclass, Sex, HasCabinNum, Deck and Fare_pp in the decision tree model2. We use all default parameters of the randomForest.

# Build the random forest model uses pclass, sex, HasCabinNum, Deck and Fare_pp
set.seed(1234) #for reproduction
RF_model1 <- randomForest(as.factor(Survived) ~ Sex + Pclass + HasCabinNum + Deck + Fare_pp, data=train, importance=TRUE)
save(RF_model1, file = "./data/RF_model1.rda")

Let us check model’s prediction accuracy.

RF_model1
##
## Call:
##  randomForest(formula = as.factor(Survived) ~ Sex + Pclass + HasCabinNum +      Deck + Fare_pp, data = train, importance = TRUE)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##
##         OOB estimate of  error rate: 19.3%
## Confusion matrix:
##     0   1 class.error
## 0 505  44  0.08014572
## 1 128 214  0.37426901

We can see that the model uses default parameters: ntree = 500 and mtry = 1. The model’s estimated accuracy is 80.7%. It is 1 - 19.3% (OOB error).

Let us make a prediction on the training dataset and check the accuracy.

# Make your prediction using the validate dataset
RF_prediction1 <- predict(RF_model1, train)
#check up
conMat<- confusionMatrix(RF_prediction1, train$Survived) conMat$table
##           Reference
## Prediction   0   1
##          0 521 112
##          1  28 230
# Misclassification error
paste('Accuracy =', round(conMat$overall["Accuracy"],2)) ##  "Accuracy = 0.84" paste('Error =', round(mean(train$Survived != RF_prediction1), 2))
##  "Error = 0.16"

We can see that prediction on the training dataset has achieved 84% accuracy. It has made 107 wrong predictions and 516 correct predictions on death. The prediction on survived is 33 wrong predictions out of 235 correct predictions.

The model has an accuracy of 80% after learning, but our evaluation of the training dataset achieves 84%. It has been increased. Compare with the decision tree model2, in which the same attributes were used and the prediction accuracy on the train data was 81%, the accuracy is also increased. Let us make a prediction on the test dataset and submit it to Kaggle to obtain an accuracy score.

# produce a submit with Kaggle required format that is only two attributes: PassengerId and Survived
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model1, test)
submit <- data.frame(PassengerId = test\$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result1.CSV", row.names = FALSE)

We can see our random forest model has scored 0.76555 by the Kaggle competition. It is interesting to know that the random forest model has not improved on the test dataset compare with the decision tree model with the same predictors. The accuracy was also 0.76555.

Let us record these accuracies,

# Record the results
RF_model1_accuracy <- c(80, 84, 76.555)