5 Resampling Methods
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap.
Cross-validation is most often used to estimate test error associated with a statistical learning method, whereas the boostrap is most commonly used to provide a measure of accuracy for a given parameter/method.
The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.
5.1 Cross Validation
Sometimes we want to estimate the test error rate using the available training data. A number of approaches can be used for this. In this section we consider methods which involve holding out a subset of the training data from the fitting process, then applying the model to that hold-out set for model assessment.
5.1.1 The Validation Set Approach
This simple strategy involves randomly dividing available observations into training and validation sets. The model is fit on the training set, and used to make predictions on the validation set. The corresponding metric from the validation set predictions – usually MSE in the case of a quantitative response – provides an estimate of the test error rate.
To illustrate this, load the Auto
data set and R packages:
<- ISLR2::Auto
auto
library(tidyverse)
library(tidymodels)
library(broom)
library(gt)
library(patchwork)
library(tictoc)
# Load my R package and set the ggplot theme
library(dunnr)
::loadfonts(device = "win", quiet = TRUE)
extrafonttheme_set(theme_td())
set_geom_fonts()
set_palette()
Randomly split the data into 50% training and 50% validation, fit on the training set, and compute the MSE on the validation set.
Since I’ll be repeating this 10 times to reproduce the figure, make a couple functions.
First, a function to set a random seed and split the data into training and validation set (called assessment
set in the rsample
package):
<- function(seed) {
auto_validation_split set.seed(seed)
validation_split(auto, prop = 0.5)
}<- auto_validation_split(seed = 10)
auto_splits auto_splits
## # Validation Set Split (0.5/0.5)
## # A tibble: 1 × 2
## splits id
## <list> <chr>
## 1 <split [196/196]> validation
Similar to the initial_split
workflow, I can access the training and validation sets as follows:
training(auto_splits$splits[[1]]) %>% head()
## mpg cylinders displacement horsepower weight acceleration year origin
## 139 14.0 8 318 150 4457 13.5 74 1
## 333 29.8 4 89 62 1845 15.3 80 2
## 373 27.0 4 151 90 2735 18.0 82 1
## 73 15.0 8 304 150 3892 12.5 72 1
## 213 16.5 8 350 180 4380 12.1 76 1
## 348 37.0 4 85 65 1975 19.4 81 3
## name
## 139 dodge coronet custom (sw)
## 333 vokswagen rabbit
## 373 pontiac phoenix
## 73 amc matador (sw)
## 213 cadillac seville
## 348 datsun 210 mpg
assessment(auto_splits$splits[[1]]) %>% head()
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## 7 14 8 454 220 4354 9.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 5 ford torino
## 6 ford galaxie 500
## 7 chevrolet impala
Second, a function to fit 10 models (1 to 10 polynomial degrees of freedom) on the training set, and evaluate on the validation set with fit_resamples()
:
1
<- recipe(mpg ~ horsepower, data = auto)
auto_rec <- workflow() %>% add_model(linear_reg())
lm_workflow
<- function(auto_splits) {
evaluate_poly_fits tibble(poly_df = 1:10) %>%
mutate(
lm_rec = map(
~ auto_rec %>% step_poly(horsepower, degree = .x)
poly_df,
),lm_fit = map(
lm_rec,~ lm_workflow %>% add_recipe(.x) %>% fit_resamples(auto_splits)
),lm_metrics = map(lm_fit, collect_metrics)
%>%
) unnest(lm_metrics) %>%
filter(.metric == "rmse") %>%
select(poly_df, rmse = mean)
}<- evaluate_poly_fits(auto_splits)
auto_poly_fits_validation auto_poly_fits_validation
## # A tibble: 10 × 2
## poly_df rmse
## <int> <dbl>
## 1 1 5.14
## 2 2 4.46
## 3 3 4.50
## 4 4 4.49
## 5 5 4.39
## 6 6 4.36
## 7 7 4.33
## 8 8 4.33
## 9 9 4.32
## 10 10 4.60
Run it an additional 9 times:
<- bind_rows(
auto_poly_fits_validation_10 # 9 additional sets of fits
map_dfr(
1:9,
function(seed) {
<- auto_validation_split(seed)
auto_splits evaluate_poly_fits(auto_splits)
},.id = "rep"
),%>% mutate(rep = "10")
auto_poly_fits_validation )
Now reproduce Figure 5.2:
# Use a 10 color palette from the MetBrewer package
<- MetBrewer::met.brewer("Veronese", 10, type = "continuous")
pal
<- auto_poly_fits_validation_10 %>%
p2 ggplot(aes(x = poly_df, y = rmse^2, fill = rep)) +
geom_line(aes(color = rep), size = 1) +
expand_limits(y = c(15, 30)) +
scale_x_continuous(breaks = seq(2, 10, 2)) +
scale_color_manual(values = pal) +
scale_fill_manual(values = pal) +
labs(x = "Degree of polynomial", y = NULL) +
theme(legend.position = "none")
<- p2 %+% filter(auto_poly_fits_validation_10, rep == "1") +
p1 geom_point(aes(fill = rep), shape = 21, color = "white", size = 4) +
labs(y = "MSE")
| p2 p1
As is clear from the right-hand panel, this approach is highly variable depending on the testing/validation set split. Another downside is that, because the training set used to fit the data has fewer observations, it tends to overestimate the test error rate on the entire data set.
5.1.2 Leave-One-Out Cross Validation
Leave-one-out cross validation (LOOCV) attempts to address the shortcomings of the validation set approach. It still involves splitting the \(n\) observations into two parts, but it repeats it \(n\) times, with a single observation \((x_i, y_i)\) as the hold-out “set” and the remaining \(n-1\) observations as the training set. The MSE for each iteration is simply \(\text{MSE}_i = (y_i - \hat{y}_i)^2\). Then the LOOCV estimate of the MSE is the average over all observations:
\[ \text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i. \]
The LOOCV approach has a few advantages over the validation set approach:
- First, it has far less bias. In LOOCV, we repeatedly fit the statistical learning method using training sets that contain \(n − 1\) observations, almost as many as are in the entire data set. This is in contrast to the validation set approach, in which the training set is typically around half the size of the original data set. Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does.
- Second, in contrast to the validation approach which will yield different results when applied repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training/validation set splits.
Define a new function to split the data by LOOCV: 2
<- loo_cv(auto)
auto_splits auto_splits
## # Leave-one-out cross-validation
## # A tibble: 392 × 2
## splits id
## <list> <chr>
## 1 <split [391/1]> Resample1
## 2 <split [391/1]> Resample2
## 3 <split [391/1]> Resample3
## 4 <split [391/1]> Resample4
## 5 <split [391/1]> Resample5
## 6 <split [391/1]> Resample6
## 7 <split [391/1]> Resample7
## 8 <split [391/1]> Resample8
## 9 <split [391/1]> Resample9
## 10 <split [391/1]> Resample10
## # … with 382 more rows
## # ℹ Use `print(n = ...)` to see more rows
392 splits, which is the number of observations, as expected. Fit:
<- evaluate_poly_fits(auto_splits) auto_poly_fits_loo_cv
## Error in `mutate()`:
## ! Problem while computing `lm_fit = map(lm_rec, ~lm_workflow %>%
## add_recipe(.x) %>% fit_resamples(auto_splits))`.
## Caused by error in `check_rset()`:
## ! Leave-one-out cross-validation is not currently supported with tune.
As the error says, LOOCV is not supported. Here’s an explanation I found online:
Leave-one-out methods are deficient compared to almost any other method. For anything but pathologically small samples, LOO is computationally excessive and it may not have good statistical properties. Although rsample contains a
loo_cv()
function, these objects are not generally integrated into the broader tidymodels frameworks.
Fair enough.
For completeness, I will still get the LOOCV estimates using boot::cv.glm()
:
<- tibble(
auto_poly_fits_loo_cv poly_df = 1:10,
mse = map_dbl(poly_df,
function(poly_df) {
<- glm(mpg ~ poly(horsepower, degree = poly_df), data = auto)
glm_fit ::cv.glm(auto, glm_fit)$delta[1]
boot
}
)
) auto_poly_fits_loo_cv
## # A tibble: 10 × 2
## poly_df mse
## <int> <dbl>
## 1 1 24.2
## 2 2 19.2
## 3 3 19.3
## 4 4 19.4
## 5 5 19.0
## 6 6 19.0
## 7 7 18.8
## 8 8 19.0
## 9 9 19.1
## 10 10 19.5
5.1.3 \(k\)-fold Cross-Validation
\(k\)-fold CV involves randomly dividing the observations into \(k\) groups/folds of approximately equal size. The first fold is used as the validation/assessment set, and the remaining \(k-1\) folds used to fit the model. This is repeated \(k\) times, with each fold being used as the assessment set once. The \(k\)-fold CV estimate of the test error is then the average:
\[ \text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{MSE}_i. \]
This should look familiar:
It is not hard to see that LOOCV is a special case of \(k\)-fold CV in which \(k\) is set to equal \(n\). In practice, one typically performs \(k\)-fold CV using \(k\) = 5 or \(k\) = 10. What is the advantage of using \(k\) = 5 or \(k\) = 10 rather than \(k\) = \(n\)? The most obvious advantage is computational. LOOCV requires fitting the statistical learning method \(n\) times. This has the potential to be computationally expensive (except for linear models fit by least squares, in which case formula (5.2) can be used). But cross-validation is a very general approach that can be applied to almost any statistical learning method. Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational problems, especially if \(n\) is extremely large. In contrast, performing 10-fold CV requires fitting the learning procedure only ten times, which may be much more feasible. As we see in Section 5.1.4, there also can be other non-computational advantages to performing 5-fold or 10-fold CV, which involve the bias-variance trade-off.
Fit the polynomial models with 10-fold CV:
<- function(seed) {
auto_10_fold_cv set.seed(seed)
vfold_cv(auto, v = 10)
}<- auto_10_fold_cv(seed = 10)
auto_splits auto_splits
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [352/40]> Fold01
## 2 <split [352/40]> Fold02
## 3 <split [353/39]> Fold03
## 4 <split [353/39]> Fold04
## 5 <split [353/39]> Fold05
## 6 <split [353/39]> Fold06
## 7 <split [353/39]> Fold07
## 8 <split [353/39]> Fold08
## 9 <split [353/39]> Fold09
## 10 <split [353/39]> Fold10
<- evaluate_poly_fits(auto_splits)
auto_poly_fits_10_fold_cv auto_poly_fits_10_fold_cv
## # A tibble: 10 × 2
## poly_df rmse
## <int> <dbl>
## 1 1 4.90
## 2 2 4.35
## 3 3 4.35
## 4 4 4.37
## 5 5 4.31
## 6 6 4.31
## 7 7 4.28
## 8 8 4.29
## 9 9 4.30
## 10 10 4.37
Now repeat this another 9 times.
Thought it won’t take too long to run, I’ll make use of parallel
to speed it up a bit:
<- parallel::detectCores(logical = FALSE)
n_cores library(doParallel)
<- makePSOCKcluster(n_cores - 1)
cl registerDoParallel(cl)
tic()
<- bind_rows(
auto_poly_fits_10_fold_cv # 9 additional sets of fits
map_dfr(
1:9,
function(seed) {
<- auto_10_fold_cv(seed)
auto_splits evaluate_poly_fits(auto_splits)
},.id = "rep"
),%>% mutate(rep = "10")
auto_poly_fits_10_fold_cv
)toc()
## 93.12 sec elapsed
Now reproduce Figure 5.4:
<- auto_poly_fits_10_fold_cv %>%
p2 mutate(mse = rmse^2) %>%
ggplot(aes(x = poly_df, y = mse, fill = rep)) +
geom_line(aes(color = rep), size = 1) +
expand_limits(y = c(15, 30)) +
scale_x_continuous(breaks = seq(2, 10, 2)) +
scale_color_manual(values = pal) +
scale_fill_manual(values = pal) +
labs(x = "Degree of polynomial", y = NULL) +
theme(legend.position = "none")
<- p2 %+% mutate(auto_poly_fits_loo_cv, rep = "1") +
p1 geom_point(aes(fill = rep), shape = 21, color = "white", size = 4) +
labs(y = "MSE")
| p2 p1
The \(k\)-fold CV approach (right panel) still has some variability due to random splitting, but much less than the validation set approach.
As a reminder, with cross-validation we are trying to approximate the true test MSE, which we cannot know for certain unless the data are simulated (like in Figure 5.6). The true error value itself is important if we want to know how a model will perform on independent data. However, if all we care about is parameter(s) that give the minimum error (like the degree of polynomial in these examples), then the CV estimate will usually come close to the true answer.
5.1.4 Bias-Variance Trade-Off for \(k\)-Fold Cross-Validation
In addition to being computationally more efficient than LOOCV, \(k\)-fold CV with \(k < n\) also often gives more accurate estimates of the test error rate due to the bias-variance tradeoff.
- \(k\)-fold CV has moderate bias in comparison to LOOCV, which is approximately unbiased.
- \(k\)-fold CV (average over \(k\) fitted models) tends to have lower variance than LOOCV (average over \(n\) fitted models, highly correlated with each other).
To summarize, there is a bias-variance trade-off associated with the choice of \(k\) in \(k\)-fold cross-validation. Typically, given these considerations, one performs \(k\)-fold cross-validation using \(k\) = 5 or \(k\) = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
5.1.5 Cross-Validation on Classification Problems
Rather than quantitative \(Y\), cross-validation works just as well with qualitative \(Y\). Instead of MSE, we use the number of misclassified observations, \(\text{Err}_i = I(y_i \neq \hat{y}_i)\). The LOOCV error rate:
\[ \text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{Err}_i. \]
The \(k\)-fold CV and validation set error rates are defined analogously.
5.2 The Bootstrap
The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. As a simple example, the bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit. In the specific case of linear regression, this is not particularly useful, since we saw in Chapter 3 that standard statistical software such as R outputs such standard errors automatically. However, the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.
The toy example in this section is about investment in two assets \(X\) and \(Y\). We wish to choose a fraction \(\alpha\) of investment into \(X\) which minimizes the total variance (risk) of the investment. In can be shown that the optimal value is given by:
\[ \alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}, \]
where \(\sigma_X^2 = \text{Var}(X)\), \(\sigma_Y^2 = \text{Var}(Y)\), and \(\sigma_{XY} = \text{Cov}(X, Y)\). In reality, we don’t know these variances and covariance, so we estimate them (e.g. \(\hat{\sigma}_X^2\)) using observations of \(X\) and \(Y\) to get an estimated \(\hat{\alpha}\).
To illustrate, we simulate 100 pairs of \(X\) and \(Y\), and compute the estimated \(\alpha\).
The simulation parameters are \(\sigma_X^2 = 1\), \(\sigma_Y^2 = 1.25\), and \(\sigma_{XY} = 0.5\).
Presumably, these are bivariate normally distributed, which I can simulate with the mvtnorm
package:
library(mvtnorm)
<- 1
sigma_x <- sqrt(1.25)
sigma_y <- 0.5
sigma_xy # The variance-covariance matrix
<- matrix(c(sigma_x^2, sigma_xy, sigma_xy, sigma_y^2), nrow = 2)
sigma # Generate 5 observations
rmvnorm(n = 5, mean = c(0, 0), sigma = sigma)
## [,1] [,2]
## [1,] -0.6198186 0.2723180
## [2,] -0.4510171 -1.4568201
## [3,] -0.0302161 -1.3962636
## [4,] -0.3405990 -0.3004827
## [5,] -0.3552014 1.1948117
Write a function to compute \(\alpha\) from the simulated \(X\) and \(Y\):
<- function(xy = NULL) {
sim_alpha if (is.null(xy)) {
<- rmvnorm(n = 100, mean = c(0, 0), sigma = sigma)
xy
}<- xy[,1]; y <- xy[,2]
x var(y) - cov(x, y)) / (var(x) + var(y) - 2 * cov(x, y))
(
}sim_alpha()
## [1] 0.577909
Run four simulations and plot \(X\) vs \(Y\) for Figure 5.9
set.seed(319)
<- tibble(sim = 1:4) %>%
d4 rowwise() %>%
mutate(xy = list(rmvnorm(n = 100, mean = c(0, 0), sigma = sigma))) %>%
ungroup() %>%
mutate(alpha = map_dbl(xy, sim_alpha) %>% round(3),
x = map(xy, ~ .x[,1]), y = map(xy, ~ .x[,2])) %>%
unnest(cols = c(x, y))
%>%
d4 ggplot(aes(x, y)) +
geom_point(color = td_colors$nice$emerald, size = 2) +
facet_wrap(~alpha, labeller = "label_both") +
add_facet_borders()
For Figure 5.10, I first simulate 1000 data sets from the true population and estimate \(\alpha\) for each:
<- bind_rows(
d %>% distinct(sim, alpha),
d4 tibble(sim = 5:1000) %>%
rowwise() %>%
mutate(alpha = sim_alpha()) %>%
ungroup()
) d
## # A tibble: 1,000 × 2
## sim alpha
## <int> <dbl>
## 1 1 0.555
## 2 2 0.622
## 3 3 0.615
## 4 4 0.453
## 5 5 0.588
## 6 6 0.605
## 7 7 0.483
## 8 8 0.598
## 9 9 0.602
## 10 10 0.633
## # … with 990 more rows
## # ℹ Use `print(n = ...)` to see more rows
And then, using just the first simulated data set, use the rsample::boostraps()
function to generate 1000 bootstrap resamples:
<- d4 %>% filter(sim == 1) %>% select(x, y) %>%
d1_boot ::bootstraps(times = 1000)
rsample d1_boot
## # Bootstrap sampling
## # A tibble: 1,000 × 2
## splits id
## <list> <chr>
## 1 <split [100/34]> Bootstrap0001
## 2 <split [100/37]> Bootstrap0002
## 3 <split [100/32]> Bootstrap0003
## 4 <split [100/35]> Bootstrap0004
## 5 <split [100/39]> Bootstrap0005
## 6 <split [100/38]> Bootstrap0006
## 7 <split [100/43]> Bootstrap0007
## 8 <split [100/39]> Bootstrap0008
## 9 <split [100/31]> Bootstrap0009
## 10 <split [100/44]> Bootstrap0010
## # … with 990 more rows
## # ℹ Use `print(n = ...)` to see more rows
Compute \(\hat{\alpha}\) from each split:
<- map_dbl(
d1_boot_alpha $splits,
d1_bootfunction(split) {
<- as.data.frame(split)
xy sim_alpha(as.matrix(xy))
} )
Now reproduce the figure:
<- bind_rows(
d %>% mutate(population = "true"),
d tibble(alpha = d1_boot_alpha, population = "bootstrap resamples")
%>%
) mutate(population = fct_rev(population))
<- (sigma_y^2 - sigma_xy) / (sigma_x^2 + sigma_y^2 - 2 * sigma_xy)
true_alpha <- d %>%
p1 ggplot(aes(x = alpha, fill = population)) +
geom_histogram(binwidth = 0.05, show.legend = FALSE, color = "black") +
geom_vline(xintercept = true_alpha,
color = td_colors$nice$light_coral, size = 1.5) +
facet_wrap(~ population, nrow = 1) +
scale_fill_manual(values = c(td_colors$nice$soft_orange,
$nice$strong_blue))
td_colors<- d %>%
p2 ggplot(aes(y = alpha, x = population)) +
geom_boxplot(aes(fill = population), show.legend = FALSE) +
geom_hline(yintercept = true_alpha,
color = td_colors$nice$light_coral, size = 1.5) +
scale_fill_manual(values = c(td_colors$nice$soft_orange,
$nice$strong_blue))
td_colors| p2 p1
Note that the histogram looks very similar to the left-hand panel, which displays the idealized histogram of the estimates of α obtained by generating 1,000 simulated data sets from the true population. In particular the bootstrap estimate \(\text{SE}(\hat{\alpha})\) from (5.8) is 0.087, very close to the estimate of 0.083 obtained using 1,000 simulated data sets. The right-hand panel displays the information in the center and left panels in a different way, via boxplots of the estimates for \(\alpha\) obtained by generating 1,000 simulated data sets from the true population and using the bootstrap approach. Again, the boxplots have similar spreads, indicating that the bootstrap approach can be used to effectively estimate the variability associated with \(\hat{\alpha}\).
5.3 Lab: Cross-Validation and the Bootstrap
5.3.1 The Validation Set Approach
Here are the MSE values as computed in 5.1.1:
%>%
auto_poly_fits_validation_10 filter(poly_df <= 3, rep == 1) %>%
mutate(mse = rmse^2)
## # A tibble: 3 × 4
## rep poly_df rmse mse
## <chr> <int> <dbl> <dbl>
## 1 1 1 4.82 23.3
## 2 1 2 4.33 18.7
## 3 1 3 4.34 18.8
5.3.2 Leave-One-Out Cross-Validation
Because LOOCV is deterministic (not random), the process in 5.1.2 produces the same MSE values:
auto_poly_fits_loo_cv
## # A tibble: 10 × 2
## poly_df mse
## <int> <dbl>
## 1 1 24.2
## 2 2 19.2
## 3 3 19.3
## 4 4 19.4
## 5 5 19.0
## 6 6 19.0
## 7 7 18.8
## 8 8 19.0
## 9 9 19.1
## 10 10 19.5
5.3.3 \(k\)-fold Cross-Validation
The MSE values as computed in 5.1.3:
%>%
auto_poly_fits_10_fold_cv filter(rep == 1) %>%
mutate(mse = rmse^2)
## # A tibble: 10 × 4
## rep poly_df rmse mse
## <chr> <int> <dbl> <dbl>
## 1 1 1 4.87 23.7
## 2 1 2 4.31 18.6
## 3 1 3 4.31 18.6
## 4 1 4 4.32 18.7
## 5 1 5 4.28 18.3
## 6 1 6 4.28 18.3
## 7 1 7 4.24 18.0
## 8 1 8 4.26 18.1
## 9 1 9 4.26 18.2
## 10 1 10 4.32 18.7
5.3.4 The Bootstrap
Estimating the Accuracy of a Statistic of Interest
<- ISLR2::Portfolio
portfolio glimpse(portfolio)
## Rows: 100
## Columns: 2
## $ X <dbl> -0.89525089, -1.56245433, -0.41708988, 1.04435573, -0.31556841, -1.7…
## $ Y <dbl> -0.2349235, -0.8851760, 0.2718880, -0.7341975, 0.8419834, -2.0371910…
The tidymodels
approach to bootstrap estimates, as in 5.2:
<- bootstraps(portfolio, times = 1000)
portfolio_boot <- map_dbl(
portfolio_boot_alpha $splits,
portfolio_bootfunction(split) {
<- as.data.frame(split)
xy var(xy$Y) - cov(xy$X, xy$Y)) /
(var(xy$X) + var(xy$Y) - 2 * cov(xy$X, xy$Y))
(
}
)mean(portfolio_boot_alpha); sd(portfolio_boot_alpha)
## [1] 0.5784967
## [1] 0.08952573
Estimating the Accuracy of a Regression Model
The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for \(\beta_0\) and \(\beta_1\), the intercept and slope terms for the linear regression model that uses horsepower to predict mpg in the
Auto
data set. We will compare the estimates obtained using the bootstrap to those obtained using the formulas for \(\text{SE}(\hat{\beta}_0)\) and \(\text{SE}(\hat{\beta}_1)\) described in Section 3.1.2.
<- bootstraps(auto, times = 1000) auto_boot
By default, the tune::fit_resamples()
function does not keep the model object from each fit.
Usually, all we care about is the performance metrics and (if we’re tuning) the best hyper parameters.
So in order to get the actual regression coefficients, I need to provide a control_grid()
object to the control
argument of fit_resamples()
(see this article for more explanation):
<- function(x) {
get_lm_coefs %>% extract_fit_engine() %>% broom::tidy()
x
}<- control_grid(extract = get_lm_coefs) tidy_ctrl
Now fit the resamples with this with control object:
tic()
<- workflow() %>%
auto_boot_fit add_model(linear_reg()) %>%
add_recipe(recipe(mpg ~ horsepower, data = auto)) %>%
fit_resamples(auto_boot, control = tidy_ctrl)
toc()
## 72.1 sec elapsed
%>% head() auto_boot_fit
## # A tibble: 6 x 5
## splits id .metrics .notes .extracts
## <list> <chr> <list> <list> <list>
## 1 <split [392/152]> Bootstrap0001 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
## 2 <split [392/145]> Bootstrap0002 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
## 3 <split [392/152]> Bootstrap0003 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
## 4 <split [392/154]> Bootstrap0004 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
## 5 <split [392/139]> Bootstrap0005 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
## 6 <split [392/138]> Bootstrap0006 <tibble [2 x 4]> <tibble [0 x 1]> <tibble [1 ~
The .extracts
column contains the tidy model objects.
It is a bit clunky to work with – it is a list column with a list of tibbles, so requires tidyr::unnest()
twice:
# A single `.extracts` element, which contains a tibble
$.extracts[[1]] auto_boot_fit
## # A tibble: 1 × 2
## .extracts .config
## <list> <chr>
## 1 <tibble [2 × 5]> Preprocessor1_Model1
# Using `unnest` to get all `.extracts`
<- auto_boot_fit %>%
auto_boot_fit_coefs unnest(.extracts) %>%
unnest(.extracts) %>%
select(id, term, estimate, std.error)
%>% head() auto_boot_fit_coefs
## # A tibble: 6 × 4
## id term estimate std.error
## <chr> <chr> <dbl> <dbl>
## 1 Bootstrap0001 (Intercept) 39.4 0.695
## 2 Bootstrap0001 horsepower -0.154 0.00601
## 3 Bootstrap0002 (Intercept) 37.6 0.665
## 4 Bootstrap0002 horsepower -0.139 0.00587
## 5 Bootstrap0003 (Intercept) 39.6 0.688
## 6 Bootstrap0003 horsepower -0.154 0.00620
Then taking the mean of the estimates, and applying equation 5.8 to get the standard error:
%>%
auto_boot_fit_coefs group_by(term) %>%
summarise(
mean_estimate = mean(estimate),
se_estimate = sd(estimate),
.groups = "drop"
%>%
) gt() %>%
fmt_number(columns = -term, decimals = 4)
term | mean_estimate | se_estimate |
---|---|---|
(Intercept) | 39.9855 | 0.8707 |
horsepower | −0.1583 | 0.0075 |
Not exactly the same as those in the text due to the difference in the random bootstrap samples. And here are the SE estimates using the formula from section 3.1.2:
tidy(lm(mpg ~ horsepower, data = auto)) %>%
select(term, estimate, se_estimate = std.error) %>%
gt() %>%
fmt_number(columns = -term, decimals = 4)
term | estimate | se_estimate |
---|---|---|
(Intercept) | 39.9359 | 0.7175 |
horsepower | −0.1578 | 0.0064 |
These estimates are similar, but wouldn’t be considered close.
Does this indicate a problem with the bootstrap? In fact, it suggests the opposite. Recall that the standard formulas given in Equation 3.8 on page 66 rely on certain assumptions. For example, they depend on the unknown parameter \(\sigma^2\), the noise variance. We then estimate \(\sigma^2\) using the RSS. Now although the formulas for the standard errors do not rely on the linear model being correct, the estimate for \(\sigma^2\) does. We see in Figure 3.8 on page 91 that there is a non-linear relationship in the data, and so the residuals from a linear fit will be inflated, and so will \(\hat{\sigma}^2\). Secondly, the standard formulas assume (somewhat unrealistically) that the \(x_i\) are fixed, and all the variability comes from the variation in the errors \(\epsilon_i\). The bootstrap approach does not rely on any of these assumptions, and so it is likely giving a more accurate estimate of the standard errors of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) than is the
summary()
function.
We can find better correspondence between bootstrap and regression estimates if we use the quadratic model because it better fits the data:
tic()
<- workflow() %>%
auto_boot_quad_fit add_model(linear_reg()) %>%
add_recipe(
recipe(mpg ~ horsepower, data = auto) %>%
# Need to set `raw` = TRUE to not use orthogonal polynomials
step_poly(horsepower, degree = 2, options = list(raw = TRUE))
%>%
) fit_resamples(auto_boot, control = tidy_ctrl)
toc()
## 76.58 sec elapsed
%>%
auto_boot_quad_fit unnest(.extracts) %>%
unnest(.extracts) %>%
group_by(term) %>%
summarise(
mean_estimate = mean(estimate),
se_estimate = sd(estimate),
.groups = "drop"
%>%
) gt() %>%
fmt_number(columns = -term, decimals = 4)
term | mean_estimate | se_estimate |
---|---|---|
(Intercept) | 56.9755 | 2.1490 |
horsepower_poly_1 | −0.4673 | 0.0340 |
horsepower_poly_2 | 0.0012 | 0.0001 |
# Need to set `raw` = TRUE to not use orthogonal polynomials
tidy(lm(mpg ~ poly(horsepower, 2, raw = TRUE), data = auto)) %>%
select(term, estimate, se_estimate = std.error) %>%
gt() %>%
fmt_number(columns = -term, decimals = 4)
term | estimate | se_estimate |
---|---|---|
(Intercept) | 56.9001 | 1.8004 |
poly(horsepower, 2, raw = TRUE)1 | −0.4662 | 0.0311 |
poly(horsepower, 2, raw = TRUE)2 | 0.0012 | 0.0001 |
5.4 Exercises
Applied
5. Logistic regression with default
- validation set
<- ISLR2::Default default
- Fit a logistic regression model that uses
income
andbalance
to predictdefault
.
<- workflow() %>%
default_workflow add_model(logistic_reg(mode = "classification", engine = "glm"))
<- recipe(default ~ income + balance, data = default)
default_rec
<- default_workflow %>%
default_fit add_recipe(default_rec) %>%
fit(data = default)
extract_fit_parsnip(default_fit)
## parsnip model object
##
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) income balance
## -1.154e+01 2.081e-05 5.647e-03
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9997 Residual
## Null Deviance: 2921
## Residual Deviance: 1579 AIC: 1585
- Using the validation set approach, estimate the test error of this model.
Set seed and split:
set.seed(940)
<- validation_split(default, prop = 0.5) default_splits
Fit and evaluate the model with the hold out:
<- default_workflow %>%
default_fit_validation add_recipe(default_rec) %>%
fit_resamples(default_splits)
collect_metrics(default_fit_validation)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.973 1 NA Preprocessor1_Model1
## 2 roc_auc binary 0.949 1 NA Preprocessor1_Model1
The accuracy of the model is 97.26% and so the test error is 2.74%.
The tidymodels
workflow made fitting and assessing this model very easy – just 5 lines of code.
For comparison purposes, here is how you could do it without tidymodels
(excluding the data splitting part):
<- training(default_splits$splits[[1]])
default_train <- assessment(default_splits$splits[[1]])
default_assess
<- glm(default ~ income + balance, data = default_train,
default_glm family = "binomial")
augment(default_glm, newdata = default_assess, type.predict = "response") %>%
mutate(default_pred = ifelse(.fitted > 0.5, "Yes", "No")) %>%
summarise(error_rate = mean(default != default_pred)) %>%
$error_rate %>% scales::percent(accuracy = 0.01) .
## [1] "2.74%"
- Repeat the process in (b) three times, using three different splits of the observations. Comment on the results.
set.seed(109)
bind_rows(
collect_metrics(default_fit_validation) %>% mutate(rep = 1),
map_dfr(2:4,
function(rep) {
%>%
default_workflow add_recipe(default_rec) %>%
fit_resamples(validation_split(default, prop = 0.5)) %>%
collect_metrics() %>%
mutate(rep = rep)
}
)%>%
) filter(.metric == "accuracy") %>%
transmute(
rep,accuracy = scales::percent(mean, accuracy = 0.01),
test_error = scales::percent(1 - mean, accuracy = 0.01)
%>%
) gt()
rep | accuracy | test_error |
---|---|---|
1 | 97.26% | 2.74% |
2 | 97.52% | 2.48% |
3 | 97.46% | 2.54% |
4 | 97.26% | 2.74% |
The estimates are actually quite stable.
- Add the dummy variable
student
to the logistic regression model. Comment on whether or not this leads to a reduction in the test error rate.
<-
default_rec_student recipe(default ~ income + balance + student, data = default) %>%
step_dummy(student)
<- default_workflow %>%
default_fit_validation_student add_recipe(default_rec_student) %>%
fit_resamples(default_splits)
collect_metrics(default_fit_validation_student)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.972 1 NA Preprocessor1_Model1
## 2 roc_auc binary 0.949 1 NA Preprocessor1_Model1
Adding the student
variable resulted in a test error rate of
2.78%
and so did not lead to a reduction compared to the original model.
6. Logistic regression with default
- standard error estimation
- Determine the estimated standard errors for the coefficients using
summary()
andglm()
.
Here it is in base R:
glm(default ~ income + balance, data = default, family = "binomial") %>%
summary()
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
In tidymodels
, I can use the extract_fit_engine()
function to get the same:
logistic_reg() %>%
fit(default ~ income + balance, data = default) %>%
extract_fit_engine() %>%
summary()
##
## Call:
## stats::glm(formula = default ~ income + balance, family = stats::binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
- Write a function that takes as input the
default
data set as well as the index of the observations, and that outputs the coefficient estimates forincome
andbalance
in the multiple logistic regression model.
Instead of the workflow I used in 5.2 (with fit_resamples()
), I’ll show a different way.
First, a function that takes a split
object (a bootstrap sample), fits the model and returns the tidy regression coefficients:
<- function(split) {
get_tidy_default_fit logistic_reg() %>%
fit(default ~ income + balance, data = as.data.frame(split)) %>%
::tidy()
broom }
- Get the bootstrap estimates for the standard errors.
Now I use purrr::map()
to fit the models using the function from (b):
<- bootstraps(default, times = 100) default_boot
tic()
<- map_dfr(
default_fits $splits, get_tidy_default_fit,
default_boot.id = "split"
)toc()
## 291.56 sec elapsed
Then estimate the standard errors of the coefficients:
%>%
default_fits group_by(term) %>%
summarise(mean_estimate = mean(estimate), se_estimate = sd(estimate)) %>%
gt() %>%
fmt_number(columns = -term, n_sigfig = 3)
term | mean_estimate | se_estimate |
---|---|---|
(Intercept) | −11.6 | 0.443 |
balance | 0.00566 | 0.000235 |
income | 0.0000210 | 0.00000492 |
- Comment on the estimated standard errors obtained using the
glm()
function and using your bootstrap function.
glm(default ~ income + balance, data = default, family = "binomial") %>%
::tidy() %>%
broomselect(term, estimate, se_estimate = std.error) %>%
gt() %>%
fmt_number(columns = -term, n_sigfig = 3)
term | estimate | se_estimate |
---|---|---|
(Intercept) | −11.5 | 0.435 |
income | 0.0000208 | 0.00000499 |
balance | 0.00565 | 0.000227 |
The estimates are essentially the same between the two methods.
7. LOOCV with weekly
<- ISLR2::Weekly weekly
- Fit a logistic regression model that predicts
Direction
usingLag1
andLag2
.
<- logistic_reg() %>%
direction_fit fit(Direction ~ Lag1 + Lag2, data = weekly)
tidy(direction_fit)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.221 0.0615 3.60 0.000319
## 2 Lag1 -0.0387 0.0262 -1.48 0.140
## 3 Lag2 0.0602 0.0265 2.27 0.0232
- Fit the same model using all but the first observation.
A shortcut to getting all but the first row of a data frame is tail(df, -1)
:
<- logistic_reg() %>%
direction_fit_1 fit(Direction ~ Lag1 + Lag2, data = tail(weekly, -1))
tidy(direction_fit_1)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.223 0.0615 3.63 0.000283
## 2 Lag1 -0.0384 0.0262 -1.47 0.143
## 3 Lag2 0.0608 0.0266 2.29 0.0220
- Use the model from (b) to predict the direction of the first observation. Was this observation correctly classified?
augment(direction_fit_1, new_data = head(weekly, 1)) %>%
select(Lag1, Lag2, Direction, .pred_class, .pred_Down, .pred_Up)
## # A tibble: 1 × 6
## Lag1 Lag2 Direction .pred_class .pred_Down .pred_Up
## <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 0.816 1.57 Down Up 0.429 0.571
It was predicted as “Up”, which is incorrect.
- Write a for loop from \(i = 1\) to \(i = n\), where \(n\) is the number of observations in the data set, that fits a model using all but the \(i\)th observation to prediction
Direction
. Then use that fit to predict theDirection
of the held out observation.
Instead of a for loop, I’ll use purrr
:
tic()
<-
direction_loocv_preds map_dfr(
1:nrow(weekly),
~logistic_reg() %>%
# Fit with `slice` to exclude the single row (`-.x`)
fit(Direction ~ Lag1 + Lag2, data = slice(weekly, -.x)) %>%
# Predict with `slice` to get just that row (`.x`)
augment(new_data = slice(weekly, .x)) %>%
select(Lag1, Lag2, Direction, starts_with(".pred")),
.id = "i"
)toc()
## 296.35 sec elapsed
When running long computations in an iterative way (with a for loop or mapping with purrr
), it is very easy to take advantage of parallel computing with future
and furrr
:
library(future)
# Use all but one of my cores
plan(multisession(workers = availableCores() - 1))
tic()
<-
direction_loocv_preds ::future_map_dfr(
furrr1:nrow(weekly),
~logistic_reg() %>%
fit(Direction ~ Lag1 + Lag2, data = slice(weekly, -.x)) %>%
augment(new_data = slice(weekly, .x)) %>%
select(Lag1, Lag2, Direction, starts_with(".pred")),
.id = "i"
)toc()
## 23.19 sec elapsed
- Take the average of the \(n\) numbers obtained in (d) in order to obtain the LOOCV estimate for the test error. Comment on the results.
I can compute the error, as well as some other metrics, manually like this:
%>%
direction_loocv_preds summarise(
n_correct = sum(Direction == .pred_class),
accuracy = n_correct / n(),
error_rate = 1 - accuracy,
# Defining "Up" as positive, compute sensitivity and specificity
sensitivity = sum(Direction == "Up" & .pred_class == "Up") /
sum(Direction == "Up"),
specificity = sum(Direction == "Down" & .pred_class == "Down") /
sum(Direction == "Down"),
across(c(accuracy, error_rate, sensitivity, specificity), scales::percent)
%>%
) gt()
n_correct | accuracy | error_rate | sensitivity | specificity |
---|---|---|---|---|
599 | 55% | 45% | 93% | 7% |
The yardstick
package has a helpful metrics()
function for quickly computing various metrics:
metrics(direction_loocv_preds, truth = Direction, estimate = .pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.550
## 2 kap binary 0.00451
For classification, this returns by default the accuracy and the Kappa statistic. The latter is a metric that compares the observed accuracy with a random chance classifier. It may be familiar in the context of inter-rater reliability, which is an important aspect of test/scale/measurement validity. Kappa values range from 0 to 1, with high values indicating better agreement. The threshold for an acceptable value will vary by context, but the value here (0.0045) is undoubtedly poor.
I can define a custom metric_set()
to get my desired classification metrics for this model:
3
<- metric_set(accuracy, sensitivity, specificity)
classification_metrics classification_metrics(direction_loocv_preds,
truth = Direction, estimate = .pred_class) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
transmute(
error_rate = 1 - accuracy,
across(c(accuracy, error_rate, sensitivity, specificity), scales::percent)
%>%
) gt()
error_rate | accuracy | sensitivity | specificity |
---|---|---|---|
45% | 55% | 7% | 93% |
Note that the labels are switched from my manual calculation: sensitivity is swapped for specificity.
This is because the “positive” level (called the event_level
in yardstick
) was manually chosen as “Up” by me, but automatically chosen as “Down” by yardstick
.
I can switch this by changing the event_level
argument in the sensitivity()
and specificity()
functions:
sensitivity(direction_loocv_preds, truth = Direction, estimate = .pred_class,
event_level = "second")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sensitivity binary 0.934
But ideally this is something I would set before fitting the model – by setting “Up” as the first level of the Direction
factor, tidymodels
will interpret it as the “event” or “positive” level.
8. Cross-validation on simulated data
- Generate a simulated data set. What is \(n\) and \(p\)? Write out the model used to generate the data in equation form.
set.seed(1)
<- rnorm(100)
x <- x - 2 * x^2 + rnorm(100) y
The number of observations is \(n\) = 100, and the number of predictors is \(p\) = 1.
\[ \begin{align} Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ X &\sim \text{Normal}(0, 1) \\ \epsilon &\sim \text{Normal}(0, 1) \\ \beta_0 &= 0 \\ \beta_1 &= 1 \\ \beta_2 &= -2 \end{align} \]
- Create a scatterplot of \(X\) against \(Y\).
<- tibble(x, y)
d ggplot(d, aes(x, y)) +
geom_point()
This is obviously a very non-linear relationship, with a negative parabola shape. A model without a polynomial term of \(X\) will definitely be a poor fit.
- Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
\[ \begin{align} Y &= \beta_0 + \beta_1 X + \epsilon \\ Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon \\ Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon \end{align} \]
set.seed(927)
tic()
<- crossing(i = 1:nrow(d), poly_df = 1:4) %>%
sim_fits_c mutate(
fit = furrr::future_map2(
i, poly_df,~ linear_reg() %>%
fit(y ~ poly(x, degree = .y), data = slice(d, -.x))
),resid_i = furrr::future_map2_dbl(
i, fit,~ augment(.y, new_data = slice(d, .x)) %>% pull(.resid)
)
)toc()
## 18.86 sec elapsed
%>%
sim_fits_c group_by(poly_df) %>%
summarise(mse = sum(resid_i^2)) %>%
gt() %>%
fmt_number(mse, decimals = 2)
poly_df | mse |
---|---|
1 | 728.82 |
2 | 93.74 |
3 | 95.66 |
4 | 95.39 |
- Repeat (c) using another random seed. Are your results the same? Why?
set.seed(213)
<- crossing(i = 1:nrow(d), poly_df = 1:4) %>%
sim_fits_d mutate(
fit = furrr::future_map2(
i, poly_df,~ linear_reg() %>%
fit(y ~ poly(x, degree = .y), data = slice(d, -.x))
),resid_i = furrr::future_map2_dbl(
i, fit,~ augment(.y, new_data = slice(d, .x)) %>% pull(.resid)
) )
%>%
sim_fits_d group_by(poly_df) %>%
summarise(mse = sum(resid_i^2)) %>%
gt() %>%
fmt_number(mse, decimals = 2)
poly_df | mse |
---|---|
1 | 728.82 |
2 | 93.74 |
3 | 95.66 |
4 | 95.39 |
Despite setting different random seeds, the results are exactly the same because none of the above code involved the generation of random numbers. LOOCV is an entirely deterministic process – each observation is used a single time as the holdout/evaluation data set.
- Which of the models had the smallest LOOCV error?
The lowest MSE was the second model (\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\)) which is unsurprising because this is also the model used to generate the data.
- Comment on the statistical significance of the coefficient estimates that result from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
%>%
sim_fits_c # For this, use the model fits that excluded the first (i = 1) observation
filter(i == 1) %>%
mutate(fit_tidy = map(fit, tidy)) %>%
unnest(fit_tidy) %>%
transmute(
poly_df = paste0("polynomial degree = ", poly_df),
term = factor(term,
labels = c("beta_0", "beta_1", "beta_1",
"beta_2", "beta_3", "beta_4")),
estimate = round(estimate, 2), std.error = round(std.error, 2),
p.value = scales::pvalue(p.value)
%>%
) group_by(poly_df) %>%
gt()
term | estimate | std.error | p.value |
---|---|---|---|
polynomial degree = 1 | |||
beta_0 | -1.55 | 0.26 | <0.001 |
beta_1 | 6.17 | 2.61 | 0.020 |
polynomial degree = 2 | |||
beta_0 | -1.55 | 0.10 | <0.001 |
beta_1 | 6.17 | 0.96 | <0.001 |
beta_2 | -23.96 | 0.96 | <0.001 |
polynomial degree = 3 | |||
beta_0 | -1.55 | 0.10 | <0.001 |
beta_1 | 6.17 | 0.97 | <0.001 |
beta_2 | -23.96 | 0.97 | <0.001 |
beta_3 | 0.32 | 0.97 | 0.737 |
polynomial degree = 4 | |||
beta_0 | -1.55 | 0.10 | <0.001 |
beta_1 | 6.17 | 0.96 | <0.001 |
beta_2 | -23.96 | 0.96 | <0.001 |
beta_3 | 0.32 | 0.96 | 0.736 |
beta_4 | 1.24 | 0.96 | 0.201 |
This is aligned with the cross-validation results: the model with the 2nd degree polynomial (\(\beta_1\) and \(\beta_2\)) has statistically significant coefficients. The 3rd and 4th degree terms (\(\beta_3\) and \(\beta_4\)) are not significant in either model.
9. Bootstrap with boston
<- ISLR2::Boston boston
- Provide an estimate for the population mean of
medv
. Call this estimate \(\hat{\mu}\).
mean(boston$medv)
## [1] 22.53281
- Provide an estimate for the standard error of \(\hat{\mu}\).
sd(boston$medv) / sqrt(nrow(boston))
## [1] 0.4088611
- Now estimate the standard error of \(\hat{\mu}\) using the bootstrap. How does this compare to your answer from (b)?
set.seed(214)
<- bootstraps(boston, times = 1000)
boston_boot <- map_dbl(
boston_boot_mu $splits, ~ mean(as.data.frame(.x)$medv)
boston_boot
)<- mean(boston_boot_mu)
mean_mu <- sd(boston_boot_mu)
se_mu mean_mu; se_mu
## [1] 22.54254
## [1] 0.4066961
For comparison, here is the implementation with the boot
package:
set.seed(214)
<- function(data, i) mean(data$medv[i])
medv_mean ::boot(boston, statistic = medv_mean, R = 1000) boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot::boot(data = boston, statistic = medv_mean, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.009737747 0.4181119
The results are slightly different, even with the same seed, because the inner workings of the functions use random numbers in different ways, but both are very close to the estimate from (b).
- Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of
medv
. Compare it to the results obtained usingt.test(boston$medv)
.
Approximate interval using the bootstrap estimates:
- 2 * se_mu; mean_mu + 2 * se_mu mean_mu
## [1] 21.72915
## [1] 23.35594
t.test(boston$medv)
##
## One Sample t-test
##
## data: boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The 95% confidence intervals are essentially equal.
- Provide an estimate, \(\hat{\mu}_{\text{med}}\), for the median value of
medv
in the population.
median(boston$medv)
## [1] 21.2
- We now would like to estimate the standard error of \(\hat{\mu}_{\text{med}}\). Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap.
set.seed(25)
<- map_dbl(
boston_boot_median $splits, ~ median(as.data.frame(.x)$medv)
boston_boot
)<- mean(boston_boot_median)
mean_median <- sd(boston_boot_median)
se_median mean_median; se_median
## [1] 21.19575
## [1] 0.3749333
- Provide an estimate, \(\hat{\mu}_{0.1}\), for the tenth percentile of
medv
.
quantile(boston$medv, probs = 0.1)
## 10%
## 12.75
- Use the bootstrap to estimate the standard error of \(\hat{\mu}_{0.1}\).
set.seed(6929)
<- map_dbl(
boston_boot_p10 $splits, ~ quantile(as.data.frame(.x)$medv, probs = 0.1)
boston_boot
)<- mean(boston_boot_p10)
mean_p10 <- sd(boston_boot_p10)
se_p10 mean_p10; se_p10
## [1] 12.78025
## [1] 0.4978406
Reproducibility
Sys.time()
## [1] "2022-09-19 10:30:18 AST"
if ("git2r" %in% installed.packages()) {
if (git2r::in_repository()) {
::repository()
git2r
} }
## Local: main C:/Users/tdunn/Documents/learning/islr-tidy
## Remote: main @ origin (https://github.com/taylordunn/islr-tidy)
## Head: [b60974c] 2022-09-19: Working on the empirical examples in Chapter 4 (issue #1)
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23 ucrt)
## os Windows 10 x64 (build 19044)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Canada.utf8
## ctype English_Canada.utf8
## tz America/Curacao
## date 2022-09-19
## pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.1)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
## bayestestR 0.12.1 2022-05-02 [1] CRAN (R 4.2.1)
## bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.1)
## bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.1)
## bookdown 0.27 2022-06-14 [1] CRAN (R 4.2.1)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.1)
## broom * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
## bslib 0.4.0 2022-07-16 [1] CRAN (R 4.2.1)
## cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.1)
## car 3.1-0 2022-06-15 [1] CRAN (R 4.2.1)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.2.1)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.1)
## checkmate 2.1.0 2022-04-21 [1] CRAN (R 4.2.1)
## class 7.3-20 2022-01-16 [2] CRAN (R 4.2.1)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.1)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.1)
## combinat 0.0-8 2012-10-29 [1] CRAN (R 4.2.0)
## corrr * 0.4.4 2022-08-16 [1] CRAN (R 4.2.1)
## crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.1)
## datawizard 0.4.1 2022-05-16 [1] CRAN (R 4.2.1)
## DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.1)
## dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.1)
## dials * 1.0.0 2022-06-14 [1] CRAN (R 4.2.1)
## DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.1)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.1)
## discrim * 1.0.0 2022-06-23 [1] CRAN (R 4.2.1)
## distill 1.4 2022-05-12 [1] CRAN (R 4.2.1)
## distributional 0.3.0 2022-01-05 [1] CRAN (R 4.2.1)
## doParallel * 1.0.17 2022-02-07 [1] CRAN (R 4.2.1)
## downlit 0.4.2 2022-07-05 [1] CRAN (R 4.2.1)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.1)
## dunnr * 0.2.6 2022-08-07 [1] Github (taylordunn/dunnr@e2a8213)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
## equatiomatic 0.3.1 2022-01-30 [1] CRAN (R 4.2.1)
## evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.1)
## extrafont 0.18 2022-04-12 [1] CRAN (R 4.2.0)
## extrafontdb 1.0 2012-06-11 [1] CRAN (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.1)
## farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.1)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.1)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.2.1)
## foreach * 1.5.2 2022-02-02 [1] CRAN (R 4.2.1)
## fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.1)
## furrr 0.3.0 2022-05-04 [1] CRAN (R 4.2.1)
## future 1.27.0 2022-07-22 [1] CRAN (R 4.2.1)
## future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.1)
## gargle 1.2.0 2021-07-02 [1] CRAN (R 4.2.1)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
## GGally 2.1.2 2021-06-21 [1] CRAN (R 4.2.1)
## ggdist * 3.2.0 2022-07-19 [1] CRAN (R 4.2.1)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.1)
## ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.2.1)
## git2r 0.30.1 2022-03-16 [1] CRAN (R 4.2.1)
## globals 0.15.1 2022-06-24 [1] CRAN (R 4.2.1)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
## googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.1)
## googlesheets4 1.0.0 2021-07-21 [1] CRAN (R 4.2.1)
## gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.1)
## gt * 0.6.0 2022-05-24 [1] CRAN (R 4.2.1)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.1)
## hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.1)
## haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.1)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.1)
## highr 0.9 2021-04-16 [1] CRAN (R 4.2.1)
## hms 1.1.1 2021-09-26 [1] CRAN (R 4.2.1)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.1)
## httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.1)
## httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.1)
## igraph 1.3.4 2022-07-19 [1] CRAN (R 4.2.1)
## infer * 1.0.2 2022-06-10 [1] CRAN (R 4.2.1)
## insight 0.18.0 2022-07-05 [1] CRAN (R 4.2.1)
## ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.1)
## ISLR2 * 1.3-1 2022-01-10 [1] CRAN (R 4.2.1)
## iterators * 1.0.14 2022-02-05 [1] CRAN (R 4.2.1)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.1)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.1)
## kknn 1.3.1 2016-03-26 [1] CRAN (R 4.2.1)
## klaR 1.7-1 2022-06-27 [1] CRAN (R 4.2.1)
## knitr 1.39 2022-04-26 [1] CRAN (R 4.2.1)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
## labelled 2.9.1 2022-05-05 [1] CRAN (R 4.2.1)
## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
## lava 1.6.10 2021-09-02 [1] CRAN (R 4.2.1)
## lhs 1.1.5 2022-03-22 [1] CRAN (R 4.2.1)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.1)
## listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.1)
## lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.1)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
## MASS 7.3-57 2022-04-22 [2] CRAN (R 4.2.1)
## Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.1)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
## MetBrewer 0.2.0 2022-03-21 [1] CRAN (R 4.2.1)
## mgcv 1.8-40 2022-03-29 [2] CRAN (R 4.2.1)
## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
## modeldata * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.2.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
## mvtnorm * 1.1-3 2021-10-08 [1] CRAN (R 4.2.0)
## nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.1)
## nnet 7.3-17 2022-01-16 [2] CRAN (R 4.2.1)
## parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1)
## parsnip * 1.0.0 2022-06-16 [1] CRAN (R 4.2.1)
## patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.2.1)
## performance 0.9.1 2022-06-20 [1] CRAN (R 4.2.1)
## pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.1)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
## plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.1)
## poissonreg * 1.0.0 2022-06-15 [1] CRAN (R 4.2.1)
## prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.1)
## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.1)
## questionr 0.7.7 2022-01-31 [1] CRAN (R 4.2.1)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
## Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
## readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.1)
## readxl 1.4.0 2022-03-28 [1] CRAN (R 4.2.1)
## recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.2.1)
## repr 1.1.4 2022-01-04 [1] CRAN (R 4.2.1)
## reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.1)
## reshape 0.8.9 2022-04-12 [1] CRAN (R 4.2.1)
## rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1)
## rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.1)
## rpart 4.1.16 2022-01-24 [2] CRAN (R 4.2.1)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1)
## rsample * 1.1.0 2022-08-08 [1] CRAN (R 4.2.1)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.1)
## Rttf2pt1 1.3.8 2020-01-10 [1] CRAN (R 4.2.1)
## rvest 1.0.2 2021-10-16 [1] CRAN (R 4.2.1)
## sass 0.4.2 2022-07-16 [1] CRAN (R 4.2.1)
## scales * 1.2.0 2022-04-13 [1] CRAN (R 4.2.1)
## see 0.7.1 2022-06-20 [1] CRAN (R 4.2.1)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
## shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.1)
## skimr 2.1.4 2022-04-15 [1] CRAN (R 4.2.1)
## stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.2.1)
## survival 3.3-1 2022-03-03 [2] CRAN (R 4.2.1)
## tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
## tictoc * 1.0.1 2021-04-19 [1] CRAN (R 4.2.0)
## tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.1)
## tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.1)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.1)
## tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.1)
## timeDate 4021.104 2022-07-19 [1] CRAN (R 4.2.1)
## tune * 1.0.0 2022-07-07 [1] CRAN (R 4.2.1)
## tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.1)
## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.1)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.1)
## vroom 1.5.7 2021-11-30 [1] CRAN (R 4.2.1)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.1)
## workflows * 1.0.0 2022-07-05 [1] CRAN (R 4.2.1)
## workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.1)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.1)
## xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.1)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
## yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
## yardstick * 1.0.0 2022-06-06 [1] CRAN (R 4.2.1)
##
## [1] C:/Users/tdunn/AppData/Local/R/win-library/4.2
## [2] C:/Program Files/R/R-4.2.1/library
##
## ──────────────────────────────────────────────────────────────────────────────
In the text, they compute the MSE but here I am computing RMSE then squaring it. This is because
yardstick
has armse()
function but not anmse()
function. If I wanted to, I could define a custom metric like this.↩︎Since there is no random splitting with this approach, I don’t need use a function that sets the random seed.↩︎
The list of available metrics in
yardstick
can be found in this article.↩︎