6.29 Lab: Resampling & cross-validation

In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.

We first import the data into R:

6.29.1 Simple sampling

To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called id.

## [1] 1000
## [1] 3607

6.29.2 Validation set approach

We used the validation set approach in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:

Subsequently, we would use data.train to calculate the training error and data.test to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:

## [1] 1197
## [1] 0.3318547
## [1] 1139
## [1] 0.3157749

Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).

6.29.3 Leave-one-out cross-validation (LOOCV)

LOOCV for classification problems is described in James et al. (2013 Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() (part of the boot library) functions. Importantly, this time we start by fitting the model on the complete dataset data. Then we use cv.glm for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!

## [1] 0.2131228 0.2131227

The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version. The first one is at around 0.213.

6.29.4 k-Fold Cross-Validation

cv.glm() can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.

## [1] 0.2129951
## [1] 0.2130808

You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.