Summary

In this chapter, we have demonstrated the use of random forest models for the Titanic problem. We have tried using different numbers of predictors. Their accuracy on the test dataset has been illustrated in the figure 9.2.

Despite the efforts in features’ engineering, the careful selection of the predictors, the random forest models have higher accuracy on the train dataset but fall dramatically with the test dataset. It demonstrated the practical problem in a data science project that is overfitting. Overfitting is a serious problem because it is the test dataset (unseen data) matters in real practice. Overfitting can be discovered and eliminated with Cross-validation that is what we are going to discuss in the next Chapter.