6.23 Resampling methods (2): Cross-validation
- Conceptual confusion… training, validiation and test data set
- Training error rate vs. test error rate (James et al. 2013, Chap. 2.2.3)
- “test error is the average error that results from using a statistical learning method to predict the response on a new observation – that is, a measurement that was not used in training the method” (James et al. 2013, 176)
- Objective: Estimate this test error
- i.e., prediction errors our model would make for new data, e.g., for prisoners that were not used to train the model
- In absence of large designated test set a number of techniques can be used to estimate test error rate using the available training data (James et al. 2013, 176)
- Cross-validation (CV) (James et al. 2013, Chap. 5.1)
- Class of methods to estimate test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations
- James et al. (2013) discuss CV for regression approach [Ch. 5.1.1-5.1.4] and classification approach [Ch. 5.1.5]: Here we focus on the latter!
References
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.