9.5 Comparision the Three Random Forest Models

We 9.1 have produced three random forest models, each has different performance in terms of prediction accuracy on the test dataset. Let us make a quick comparison among them.

library(tidyr)
Model <- c("RF_Model1","RF_Model2","RF_Model3")
Pre <- c("Sex, Pclass, HasCabinNum, Deck, Fare_pp", "Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked", "Sex, Pclass, Age, SibSp, Parch, Embarked, HasCabinNum, Friend_size, Fare_pp, Title, Deck, Ticket_class, Family_size, Group_size, Age_group")

Learn <- c(80.0, 83.16, 83.0)
Train <- c(84, 92, 78)
Test <- c(76.555, 78.95, 77.03)
df1 <- data.frame(Model, Pre, Learn, Train, Test)
df2 <- data.frame(Model, Learn, Train, Test)
knitr::kable(df1, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Predictors", "Accuracy on Learn", "Accuracy on Train", "Accuracy on Test"), 
  caption = 'The Comparision among 3 Random Forest models'
)

Table 9.1: The Comparision among 3 Random Forest models
Models	Predictors	Accuracy on Learn	Accuracy on Train	Accuracy on Test
RF_Model1	Sex, Pclass, HasCabinNum, Deck, Fare_pp	80.00	84	76.56
RF_Model2	Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked	83.16	92	78.95
RF_Model3	Sex, Pclass, Age, SibSp, Parch, Embarked, HasCabinNum, Friend_size, Fare_pp, Title, Deck, Ticket_class, Family_size, Group_size, Age_group	83.00	78	77.03

df.long <- gather(df2, Dataset, Accuracy, -Model, factor_key =TRUE)
ggplot(data = df.long, aes(x = Model, y = Accuracy, fill = Dataset)) +
  geom_col(position = position_dodge())

Figure 9.2: Random Froest models’ accuracy on model learning, Train dataset and Test dataset.

It is not true that the more predictors the better performance with Random Forest models.
The result of the model validation on the training dataset is not reliable. The higher accuracy on the training dataset does not mean a higher accuracy on the test dataset.
All the model has a degree of overfitting. That is the accuracy on the test data is lower than the training dataset and even lower than the model its own estimated accuracy while learn or construct it.
the cause of the overfitting is a complicated issue. It may be related to all the factors: the number of predictors used to build the model, the dataset used to build the model, and the model default parameters.

In comparison with the decision tree models, we have built in the previous Chapter. The random forest models over-perform all the four models on the test dataset. The lowest accuracy is the same as the highest accuracy with the decision tree models (76.55%).