6.23 Resampling methods (2): Cross-validation

  • Conceptual confusion… training, validiation and test data set
  • Training error rate vs. test error rate (James et al. 2013, Chap. 2.2.3)
    • “test error is the average error that results from using a statistical learning method to predict the response on a new observation – that is, a measurement that was not used in training the method” (James et al. 2013, 176)
  • Objective: Estimate this test error
    • i.e., prediction errors our model would make for new data, e.g., for prisoners that were not used to train the model
  • In absence of large designated test set a number of techniques can be used to estimate test error rate using the available training data (James et al. 2013, 176)
  • Cross-validation (CV) (James et al. 2013, Chap. 5.1)
    • Class of methods to estimate test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations
    • James et al. (2013) discuss CV for regression approach [Ch. 5.1.1-5.1.4] and classification approach [Ch. 5.1.5]: Here we focus on the latter!

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.