In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.
We first import the data into R:
library(tidyverse) set.seed(0) # data <- read.csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv") # write_csv(data, "compas-scores-two-years.csv") <- read_csv("compas-scores-two-years.csv") data nrow(data)
##  7214
To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called
# Take a random sample of size 1000 <- sample_n(data, size = 1000) data1 nrow(data1)
##  1000
# Take a random sample of size 50% <- sample_frac(data, size = 0.5) data2 nrow(data2)
##  3607
# Q: What can we use the argument "replace = FALSE" for?
We used the validation set approach in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:
# Add random vector <- data %>% mutate(random = sample(1:nrow(data))) data # Q: What happens here? # First half of observations as training data <- data %>% filter(random <= nrow(data)/2) data_train # Second half of observations as test data <- data %>% filter(random > nrow(data)/2)data_test
Subsequently, we would use
data_train to calculate the training error and
data_test to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:
# TRAINING DATA # Fit the model using training data <- glm(is_recid ~ age + priors_count, fit family = binomial, data = data_train) # Add predicted probabilites to training data and convert to classifier <- data_train %>% data_train mutate(probability = predict(fit, type="response"), classified = if_else(probability >= 0.5, 1, 0)) # Calculate training errors: number and rate sum(data_train$is_recid!=data_train$classified)
##  1134
##  0.3143887
# TEST DATA <- data_test %>% data_test mutate(probability = predict(fit, data_test ,type="response"), classified = if_else(probability >= 0.5, 1, 0)) # Calculate training errors: number and rate sum(data_test$is_recid!=data_test$classified)
##  1191
##  0.3301913
Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).
LOOCV for classification problems is described in James et al. (2013, Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.
The LOOCV estimate can be automatically computed for any generalized linear model using the
cv.glm() functions (part of the
boot library). Importantly, this time we start by fitting the model on the complete dataset
data. Then we use
cv.glm for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!
library(boot) # We drop other variables because cv.glm struggles with variables # containing missings <- data %>% select(is_recid, age, priors_count) data # Fit the model using the original data <- glm(is_recid ~ age + priors_count, fit family = binomial, data = data) # Use cv.glm for cross-validation <- cv.glm(data, fit) cv.err $deltacv.err
##  0.2131228 0.2131227
cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version (see
?cv.glm). The first one is at around 0.213.
cv.glm() can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.
<- glm(is_recid ~ age + priors_count, fit family = binomial, data = data) <- cv.glm(data, fit, K = 5)$delta # the first one is the non-corrected error cv.err # K = number of folds cv.err
##  0.2129951
# We could also pick a higher number for K # But the results are very similar: <- cv.glm(data, fit, K = 10)$delta cv.err cv.err
##  0.2130808
You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.
In Lab: Predicting recidvism (Classification) we compared test error rates for two different models (using the validation set approach). Below we do the same using k-fold cross-validation.
# We only create new datasets to avoid having variables with missings in there <- data %>% select(is_recid, age, priors_count) data1 <- data %>% select(is_recid, age) data2 # Three models (one with and one without priors_count) <- glm(is_recid ~ age + priors_count, fit1 family = binomial, data = data1) <- glm(is_recid ~ age, fit2 family = binomial, data = data2) # k-fold cross-validation for the two models cv.glm(data1, fit1, K = 5)$delta
##  0.2130422
cv.glm(data2, fit2, K = 5)$delta
##  0.2400051
The error for the second model is higher than for the first.