6.29 Lab: Resampling & cross-validation
In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.
We first import the data into R:
library(tidyverse)
set.seed(0)
# data <- read.csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv")
# write_csv(data, "compas-scores-two-years.csv")
data <- read_csv("compas-scores-two-years.csv")
6.29.1 Simple sampling
To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called id
.
## [1] 1000
## [1] 3607
6.29.2 Validation set approach
We used the validation set approach in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:
# Add random vector
data <- data %>% mutate(random = sample(1:nrow(data)))
# Q: What happens here?
# First half of observations as training data
data.train <- data %>% filter(random <= nrow(data)/2)
# Second half of observations as test data
data.test <- data %>% filter(random > nrow(data)/2)
Subsequently, we would use data.train
to calculate the training error and data.test
to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:
# TRAINING DATA
# Fit the model using training data
fit <- glm(is_recid ~ age + priors_count,
family = binomial,
data = data.train)
# Add predicted probabilites to training data and convert to classifier
data.train <- data.train %>%
mutate(probs = predict(fit, type="response"),
classified = if_else(probs >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data.train$is_recid!=data.train$classified)
## [1] 1197
## [1] 0.3318547
# TEST DATA
data.test <- data.test %>%
mutate(probs = predict(fit, data.test ,type="response"),
classified = if_else(probs >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data.test$is_recid!=data.test$classified)
## [1] 1139
## [1] 0.3157749
Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).
6.29.3 Leave-one-out cross-validation (LOOCV)
LOOCV for classification problems is described in James et al. (2013 Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.
The LOOCV estimate can be automatically computed for any generalized linear model using the glm()
and cv.glm()
(part of the boot
library) functions. Importantly, this time we start by fitting the model on the complete dataset data
. Then we use cv.glm
for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!
library(boot)
# We drop other variables because cv.glm struggles with variables
# containing missings
data <- data %>% select(is_recid, age, priors_count)
# Fit the model using the original data
fit <- glm(is_recid ~ age + priors_count,
family = binomial,
data = data)
# Use cv.glm for cross-validation
cv.err <- cv.glm(data, fit)
cv.err$delta
## [1] 0.2131228 0.2131227
The cv.glm()
function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version. The first one is at around 0.213.
6.29.4 k-Fold Cross-Validation
cv.glm()
can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.
fit <- glm(is_recid ~ age + priors_count,
family = binomial,
data = data)
cv.err <- cv.glm(data, fit, K = 5)$delta[1] # the first one is the non-corrected error
# K = number of folds
cv.err
## [1] 0.2129951
# We could also pick a higher number for K
# But the results are very similar:
cv.err <- cv.glm(data, fit, K = 10)$delta[1]
cv.err
## [1] 0.2130808
You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.
6.29.5 Comparing models
In Lab: Predicting recidvism (Classification) we compared test error rates for two different models (using the validation set approach). Below we do the same using k-fold cross-validation.
# We only create new datasets to avoid having variables with missings in there
data1 <- data %>% select(is_recid, age, priors_count)
data2 <- data %>% select(is_recid, age)
# Three models (one with and one without priors_count)
fit1 <- glm(is_recid ~ age + priors_count,
family = binomial,
data = data1)
fit2 <- glm(is_recid ~ age,
family = binomial,
data = data2)
# k-fold cross-validation for the two models
cv.glm(data1, fit1, K = 5)$delta[1]
## [1] 0.2130422
## [1] 0.2400051
The error for the second model is higher than for the first.
References
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.