1 Bootstrap
1.1 Introduction
Bootstrap was proposed by Efron in 1979 and it mimics the sampling process by sampling with replacement the original sample. This process replicates the sample variation and allows the calculation of standard errors. It can also be used to refine more complex machine learning algorithms, as we will see later.
From the original sample of size \(n\), we create many (e.g. 10,000) sample of size \(n\) by sampling the sample with replacement. Notice that the replacement allows the same value to be included multiple times in the same sample. This is why you can create many different samples. For each of the bootstrapped sample we compute the statistics of interest. The standard deviance of all these computed statistics, is the standard error.
1.3 Practice session
Task - Confidence intervals with bootstrap
We will be using the bmd.csv dataset to plot the histogram, compute the median and 95% confidence interval for the “waiting_time”
Let’s first read the data and create the variable BMI that will be used later.
library(caret) #library for Machine Learning
## Warning: package 'ggplot2' was built under R version 4.1.2
library(boot) #library for bootstrap
library(pROC) #library for the ROC curve
library(Rmisc) #CI() function to compute conf interval
set.seed(1974)
#the option stringsAsFactors = TRUE in the command below converts
#string variables as sex into factor variables
<- read.csv("https://www.dropbox.com/s/c6mhgatkotuze8o/bmd.csv?dl=1",
bmd.data stringsAsFactors = TRUE)
#computes the BMI
$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2 bmd.data
Let’s plot the histogram for waiting time and compute the median
hist(bmd.data$waiting_time)
median(bmd.data$waiting_time)
## [1] 14
And now use the function boot(), from the library boot, to bootstrap the median. We will also compute it manually
#using the boot function
median(bmd.data$waiting_time)
## [1] 14
<- function(x, d) { #need to define the function to use bootstrap
samplemedian return(median(x[d])) #d is the index for the bootstrap
}<- boot(bmd.data$waiting_time, statistic=samplemedian, R=10000)
bootresults # get 95% confidence interval
boot.ci(bootresults, type="perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootresults, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (13, 18 )
## Calculations and Intervals on Original Scale
#manual bootstrap
<- NA
median.bs for (i in 1:10000){ # change to 10000
<- sample(bmd.data$waiting_time, 169, replace = TRUE)
sample.bs <- median(sample.bs)
median.bs[i]
}median(median.bs) #median (could use mean) of all the bootstrapped medians
## [1] 14
sd(median.bs) #Std error of the median
## [1] 1.34752
quantile(median.bs, c(0.025, 0.975)) #95% empirical confidence interval
## 2.5% 97.5%
## 12 18
TRY IT YOURSELF:
- Compute the mean for waiting_time and the usual 95% confidence interval using the CI() function.
See the solution code
#library(Rmisc) #CI() function to compute the conf interval for the mean
CI(bmd.data$waiting_time)
- Compute the 95% confidence interval for the mean using the boot function
See the solution code
CI(bmd.data$waiting_time)
<- function(x, d) { #need to define the function to use bootstrap
samplemean return(mean(x[d])) #d is the index for the bootstrap
}
<- boot(bmd.data$waiting_time, statistic=samplemean, R=10000)
bootresults boot.ci(bootresults, type="perc")
1.4 Exercises
Solve the following exercises:
The diabetes data were provided by Hastie and Tibshirani (1990, p. 6). The observations arise from a study of the factors affecting patterns of insulin-dependent diabetes mellitus in 43 children (Sockett et al., 1987). The aim was to investigate the dependence of serum C-peptide on other factors, better to understand the patterns of residual insulin secretion.
The response, cpep, is the log of C-peptide concentration at diagnosis, and the selected covariates are age, the child’s age at diagnosis, and base, minus their base deficit. Base deficit is a measure of acidity.
1) Calculate the Pearson correlation coefficient between cpep and base with the respective 95% confidence interval obtained by the Fisher’s z-transformation (the usual way of getting the confidence interval for the Pearson’s correlation)
Note: you can use the function CIr in the “psychometric” package
2) Write your own function to compute the 95% confidence interval for the above correlation, using bootstrap.
3) Use the function boot() from the boot package to compute the 95% confidence interval through bootstrap
4) Plot the histogram with the correlations obtained in the bootstrap samples.
See the solution code
#install.packages("psychometric") # install package with function for
# the correlation confidence interval
#install.packages("boot") # install package for bootstrap function
library(boot)
library(psychometric)
set.seed(1001)
<- read.csv("https://www.dropbox.com/s/6rc00ealjtyp3qi/diabetes.csv?dl=1")
myData
<- dim(myData)[1] #nr of observations
sample.size
#1 - the Person correlation
cor(myData$base, myData$cpep)
CIr( r = cor(myData$base, myData$cpep), #conf interval using
n = sample.size, level = .95) # Fisher's z-transformations
#2 - Bootstrap
<- 25000 #choose how many bootstraps
n.boot <- NULL # to store the bootstrap correlations
cor.boot
#manually implementing bootstrap
for (i in 1:n.boot) {
<- sample.int(sample.size, #bootstrap the
id.bs #original sample
sample.size, replace = TRUE)
<- cor(myData[id.bs, ])[2,3] #Compute correlation
cor.boot[i] #between
}quantile(cor.boot, c(0.025, 0.975)) #95% confidence interval
#3 - Using the boot() function
<- function(data, d) {
sample.corr return(cor(data$base[d], data$cpep[d])) #d is the index for the bootstrap
}
<- boot(myData,
bootcorr statistic=sample.corr,
R=n.boot)
# get 95% confidence interval
boot.ci(bootcorr, type="perc")
#4 - Histogram
#the correlations for the bootstrap samples
#are stored in bootcorr$t
hist(bootcorr$t)
#or
#hist(cor.boot)