Summary

Remember the initial motivation of introducing Cross Validation was to identify the overfitting of a prediction model so it should not be used for the real application. To identify overfiting, a single model CV is sufficient. After split the training dataset into trainData and validData, the model’s overfitting problem can be clearly seen with the validData. For example, Random forest model4, in table 10.1, has 83.99% estimated accuracy on the model’s construction and 96.63% of accuracy on the train dataset, but it only has 77.09% accuracy on the validation dataset which the model does not see before. This big drop on the accuracy indicates that the overfitting of the model. It is confirmed by the worsen prediction accuracy on the test dataset, which was 75.12%.

Cross-validation with multiple models was expected to provide a conclusive judgment of a model whether it is the best or not. It seems that it can provide some evidence but it does not provide a definite conclusion. Perhaps it approved the point that each model has its own conditions and cases of usage. In case of multiple models can all be used for a problem, the individual model’s performance can be affected by the data samples and the model’s parameters, even the combination of the two together. In data science, there is a way to further dill down on this issue is called model’s fine tune. That is what we are going to discuss in the next Chapter.