9.10 Lab: Resampling & cross-validation

In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.

We first import the data into R:

library(tidyverse)
set.seed(0)
# data <- read.csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv")
# write_csv(data, "compas-scores-two-years.csv")
data <- read_csv("compas-scores-two-years.csv")
nrow(data)
## [1] 7214

9.10.1 Simple sampling

To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called id.

# Take a random sample of size 1000
data1 <- sample_n(data, size = 1000)
nrow(data1)
## [1] 1000
# Take a random sample of size 50%
data2 <- sample_frac(data, size = 0.5)
nrow(data2)
## [1] 3607
# Q: What can we use the argument "replace = FALSE" for?

9.10.2 Validation set approach

We used the validation set approach in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:

# Add random vector
data <- data %>% mutate(random = sample(1:nrow(data)))
# Q: What happens here?

# First half of observations as training data
data_train <- data %>% filter(random <= nrow(data)/2)

# Second half of observations as test data
data_test <- data %>% filter(random > nrow(data)/2)

Subsequently, we would use data_train to calculate the training error and data_test to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:

# TRAINING DATA
# Fit the model using training data
fit <- glm(is_recid ~ age + priors_count, 
           family = binomial, 
           data = data_train)
# Add predicted probabilites to training data and convert to classifier
data_train <- data_train %>%
              mutate(probability = predict(fit, type="response"),
                classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_train$is_recid!=data_train$classified)
## [1] 1134
mean(data_train$is_recid!=data_train$classified)
## [1] 0.3143887
# TEST DATA
data_test <- data_test %>%
              mutate(probability = predict(fit, data_test ,type="response"),
                     classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_test$is_recid!=data_test$classified)
## [1] 1191
mean(data_test$is_recid!=data_test$classified)
## [1] 0.3301913

Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).

9.10.3 Leave-one-out cross-validation (LOOCV)

LOOCV for classification problems is described in James et al. (2013, Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions (part of the boot library). Importantly, this time we start by fitting the model on the complete dataset data. Then we use cv.glm for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!

library(boot)
# We drop other variables because cv.glm struggles with variables
# containing missings
data <- data %>% select(is_recid, age, priors_count)

# Fit the model using the original data
fit <- glm(is_recid ~ age + priors_count, 
           family = binomial, 
           data = data)

# Use cv.glm for cross-validation
cv.err <- cv.glm(data, fit)
cv.err$delta
## [1] 0.2131228 0.2131227

The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version (see ?cv.glm). The first one is at around 0.213.

9.10.4 k-Fold Cross-Validation

cv.glm() can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.

fit <- glm(is_recid ~ age + priors_count, 
           family = binomial, 
           data = data)
cv.err <- cv.glm(data, fit, K = 5)$delta[1] # the first one is the non-corrected error
# K = number of folds
cv.err
## [1] 0.2129951
# We could also pick a higher number for K
# But the results are very similar:
cv.err <- cv.glm(data, fit, K = 10)$delta[1]
cv.err
## [1] 0.2130808

You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.

9.10.5 Comparing models

In Lab: Predicting recidvism (Classification) we compared test error rates for two different models (using the validation set approach). Below we do the same using k-fold cross-validation.

# We only create new datasets to avoid having variables with missings in there
data1 <- data %>% select(is_recid, age, priors_count)
data2 <- data %>% select(is_recid, age)

# Three models (one with and one without priors_count)
fit1 <- glm(is_recid ~ age + priors_count, 
           family = binomial, 
           data = data1)
fit2 <- glm(is_recid ~ age, 
           family = binomial, 
           data = data2)

# k-fold cross-validation for the two models
cv.glm(data1, fit1, K = 5)$delta[1]
## [1] 0.2130422
cv.glm(data2, fit2, K = 5)$delta[1]
## [1] 0.2400051

The error for the second model is higher than for the first.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.