## 9.10 Lab: Resampling & cross-validation

In this lab we will use the our example from above – predicting recidivism to illustrate how we can use different cross-validation methods to estimate the test error rate.

We first import the data into R:

```
library(tidyverse)
set.seed(0)
# data <- read.csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv")
# write_csv(data, "compas-scores-two-years.csv")
<- read_csv("compas-scores-two-years.csv")
data nrow(data)
```

`## [1] 7214`

### 9.10.1 Simple sampling

To start, sampling from a dataset in R is quite easy. In case you ever need to choose a random subset of observations there are convenient dplyr functions. Importantly, make sure that the data contains a unique identifier for your units as it allows you to check afterwards which units/individuals your samples are made up of. Here the unique identifier is called `id`

.

```
# Take a random sample of size 1000
<- sample_n(data, size = 1000)
data1 nrow(data1)
```

`## [1] 1000`

```
# Take a random sample of size 50%
<- sample_frac(data, size = 0.5)
data2 nrow(data2)
```

`## [1] 3607`

`# Q: What can we use the argument "replace = FALSE" for?`

### 9.10.2 Validation set approach

We used the *validation set approach* in Lab: Predicting recidvism (Classification) but repeated it here for completeness. First, we assign a random number to each observation in our dataset. Then we split the dataset into training data and test data by taking subsets of units that lie below/above a cutoff on this vector of random numbers:

```
# Add random vector
<- data %>% mutate(random = sample(1:nrow(data)))
data # Q: What happens here?
# First half of observations as training data
<- data %>% filter(random <= nrow(data)/2)
data_train
# Second half of observations as test data
<- data %>% filter(random > nrow(data)/2) data_test
```

Subsequently, we would use `data_train`

to calculate the training error and `data_test`

to calculate the test error as described in Lab: Predicting recidvism (Classification). Below we provide a quick summary of the code:

```
# TRAINING DATA
# Fit the model using training data
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data_train)
# Add predicted probabilites to training data and convert to classifier
<- data_train %>%
data_train mutate(probability = predict(fit, type="response"),
classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_train$is_recid!=data_train$classified)
```

`## [1] 1134`

`mean(data_train$is_recid!=data_train$classified)`

`## [1] 0.3143887`

```
# TEST DATA
<- data_test %>%
data_test mutate(probability = predict(fit, data_test ,type="response"),
classified = if_else(probability >= 0.5, 1, 0))
# Calculate training errors: number and rate
sum(data_test$is_recid!=data_test$classified)
```

`## [1] 1191`

`mean(data_test$is_recid!=data_test$classified)`

`## [1] 0.3301913`

Importantly, we could repeat the steps above and we would obtain different estimates (because the splits would differ).

### 9.10.3 Leave-one-out cross-validation (LOOCV)

LOOCV for classification problems is described in James et al. (2013, Ch. 5.1.5) with a lab for linear models described in Ch. 5.3. Here we focus on a logistic regression model.

The LOOCV estimate can be automatically computed for any generalized linear model using the `glm()`

and `cv.glm()`

functions (part of the `boot`

library). Importantly, this time we start by fitting the model on the complete dataset `data`

. Then we use `cv.glm`

for the cross-validation. Remember that it takes time (it’s like a look running through all possible splits)!

```
library(boot)
# We drop other variables because cv.glm struggles with variables
# containing missings
<- data %>% select(is_recid, age, priors_count)
data
# Fit the model using the original data
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data)
# Use cv.glm for cross-validation
<- cv.glm(data, fit)
cv.err $delta cv.err
```

`## [1] 0.2131228 0.2131227`

The `cv.glm()`

function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. The first number is the raw leave-one-out result and the second a bias-corrected version (see `?cv.glm`

). The first one is at around 0.213.

### 9.10.4 k-Fold Cross-Validation

`cv.glm()`

can also be used to implement k-fold CV. Below we use \(k = 5\) (you could also use \(k = 10\)) on our original dataset. At the start we set a random seed and create a vector to store results.

```
<- glm(is_recid ~ age + priors_count,
fit family = binomial,
data = data)
<- cv.glm(data, fit, K = 5)$delta[1] # the first one is the non-corrected error
cv.err # K = number of folds
cv.err
```

`## [1] 0.2129951`

```
# We could also pick a higher number for K
# But the results are very similar:
<- cv.glm(data, fit, K = 10)$delta[1]
cv.err cv.err
```

`## [1] 0.2130808`

You may notice that k-fold cross-validation computation time is far quicker. The reason is that we have to split the data fewer times/estimate fewer models.

### 9.10.5 Comparing models

In Lab: Predicting recidvism (Classification) we compared test error rates for two different models (using the validation set approach). Below we do the same using k-fold cross-validation.

```
# We only create new datasets to avoid having variables with missings in there
<- data %>% select(is_recid, age, priors_count)
data1 <- data %>% select(is_recid, age)
data2
# Three models (one with and one without priors_count)
<- glm(is_recid ~ age + priors_count,
fit1 family = binomial,
data = data1)
<- glm(is_recid ~ age,
fit2 family = binomial,
data = data2)
# k-fold cross-validation for the two models
cv.glm(data1, fit1, K = 5)$delta[1]
```

`## [1] 0.2130422`

`cv.glm(data2, fit2, K = 5)$delta[1]`

`## [1] 0.2400051`

The error for the second model is higher than for the first.

### References

*An Introduction to Statistical Learning: With Applications in R*. Springer Texts in Statistics. Springer.