9.10 Lab: Resampling & cross-validation
In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.
We first import the data into R:
library(tidyverse)
set.seed(0)
# data <- read.csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv")
# write_csv(data, "compas-scores-two-years.csv")
<- read_csv("compas-scores-two-years.csv")
data nrow(data)
## [1] 7214
9.10.1 Simple sampling
To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called id
.
# Take a random sample of size 1000
<- sample_n(data, size = 1000)
data1 nrow(data1)
## [1] 1000
# Take a random sample of size 50%
<- sample_frac(data, size = 0.5)
data2 nrow(data2)
## [1] 3607
# Q: What can we use the argument "replace = FALSE" for?
9.10.2 Validation set approach
We used the validation set approach in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:
# Add random vector
<- data %>% mutate(random = sample(1:nrow(data)))
data # Q: What happens here?
# First half of observations as training data
<- data %>% filter(random <= nrow(data)/2)
data_train
# Second half of observations as test data
<- data %>% filter(random > nrow(data)/2) data_test
Subsequently, we would use data_train
to calculate the training error and data_test
to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:
# TRAINING DATA
# Fit the model using training data
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data_train)
# Add predicted probabilites to training data and convert to classifier
<- data_train %>%
data_train mutate(probability = predict(fit, type="response"),
classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_train$is_recid!=data_train$classified)
## [1] 1134
mean(data_train$is_recid!=data_train$classified)
## [1] 0.3143887
# TEST DATA
<- data_test %>%
data_test mutate(probability = predict(fit, data_test ,type="response"),
classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_test$is_recid!=data_test$classified)
## [1] 1191
mean(data_test$is_recid!=data_test$classified)
## [1] 0.3301913
Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).
9.10.3 Leave-one-out cross-validation (LOOCV)
LOOCV for classification problems is described in James et al. (2013, Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.
The LOOCV estimate can be automatically computed for any generalized linear model using the glm()
and cv.glm()
functions (part of the boot
library). Importantly, this time we start by fitting the model on the complete dataset data
. Then we use cv.glm
for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!
library(boot)
# We drop other variables because cv.glm struggles with variables
# containing missings
<- data %>% select(is_recid, age, priors_count)
data
# Fit the model using the original data
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data)
# Use cv.glm for cross-validation
<- cv.glm(data, fit)
cv.err $delta cv.err
## [1] 0.2131228 0.2131227
The cv.glm()
function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version (see ?cv.glm
). The first one is at around 0.213.
9.10.4 k-Fold Cross-Validation
cv.glm()
can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data)
<- cv.glm(data, fit, K = 5)$delta[1] # the first one is the non-corrected error
cv.err # K = number of folds
cv.err
## [1] 0.2129951
# We could also pick a higher number for K
# But the results are very similar:
<- cv.glm(data, fit, K = 10)$delta[1]
cv.err cv.err
## [1] 0.2130808
You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.
9.10.5 Comparing models
In Lab: Predicting recidvism (Classification) we compared test error rates for two different models (using the validation set approach). Below we do the same using k-fold cross-validation.
# We only create new datasets to avoid having variables with missings in there
<- data %>% select(is_recid, age, priors_count)
data1 <- data %>% select(is_recid, age)
data2
# Three models (one with and one without priors_count)
<- glm(is_recid ~ age + priors_count,
fit1 family = binomial,
data = data1)
<- glm(is_recid ~ age,
fit2 family = binomial,
data = data2)
# k-fold cross-validation for the two models
cv.glm(data1, fit1, K = 5)$delta[1]
## [1] 0.2130422
cv.glm(data2, fit2, K = 5)$delta[1]
## [1] 0.2400051
The error for the second model is higher than for the first.