Chapter 10 Model Cross Validation

“We cannot solve our problems with the same thinking we used when we created them.”

                                   -- Albert Einstein

In the previous two chapters, we have demonstrated how to build prediction models using both Decision Tree and Random Forest, the two popular prediction models. The models we built have different prediction accuracy. The big problem with the models is their reduced prediction accuracy with the test dataset. An even bigger problem is that the reduction of the prediction accuracy with each model is different and unpredictable. Together they created great difficulty to choose which model should be used for the real applications.

We are lucky because we have a Kaggle competition that provides us with a test dataset and feedback on our model’s performance. In real applications, as the titanic competition simulated, the test dataset has no response variable’s (survival status) value. We will have no means to compare to evaluate the model’s accuracy.

Although we may use the methods as we have used in Chapter 8, where we used our model to predict on the training dataset and made a comparison with the original value to estimate the model’s prediction accuracy. A similar method (OOB) is also used in the random forest models (in Chapter 9) to estimate the model’s accuracy. We know that our estimated accuracy is not reliable.

There is a systematic method in data science to evaluate a prediction model’s performance. It is called “Cross-Validation (CV)”. In this chapter, we will demonstrate how to use a CV to evaluate a model’s performance.