# 1 Bootstrap

## 1.1 Introduction

Bootstrap was proposed by Efron in 1979 and it mimics the sampling process by sampling with replacement the original sample. This process replicates the sample variation and allows the calculation of standard errors. It can also be used to refine more complex machine learning algorithms, as we will see later.

From the original sample of size \(n\), we create many (e.g. 10,000) sample of size \(n\) by sampling the sample with replacement. Notice that the replacement allows the same value to be included multiple times in the same sample. This is why you can create many different samples. For each of the bootstrapped sample we compute the statistics of interest. The standard deviance of all these computed statistics, is the standard error.

## 1.2 Readings

Read the following chapter of *An introduction to statistical learning*:

## 1.3 Practice session

### Task - Confidence intervals with bootstrap

We will be using the bmd.csv dataset to plot the histogram, compute the median and 95% confidence interval for the “waiting_time”

Let’s first read the data and create the variable BMI that will be used later.

```
library(caret) #library for Machine Learning
library(boot) #library for bootstrap
library(pROC) #library for the ROC curve
library(Rmisc) #CI() function to compute conf interval
set.seed(1974)
#the option stringsAsFactors = TRUE in the command below converts
#string variables as sex into factor variables
<- read.csv("https://www.dropbox.com/s/c6mhgatkotuze8o/bmd.csv?dl=1",
bmd.data stringsAsFactors = TRUE)
#computes the BMI
$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2 bmd.data
```

Let’s plot the histogram for *waiting time* and compute the median

`hist(bmd.data$waiting_time)`

`median(bmd.data$waiting_time)`

`## [1] 14`

And now use the function **boot()**, from the library **boot**, to bootstrap the median. We will also compute it manually

```
#using the boot function
median(bmd.data$waiting_time)
```

`## [1] 14`

```
<- function(x, d) { #need to define the function to use bootstrap
samplemedian return(median(x[d])) #d is the index for the bootstrap
}<- boot(bmd.data$waiting_time, statistic=samplemedian, R=10000)
bootresults # get 95% confidence interval
boot.ci(bootresults, type="perc")
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootresults, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (13, 18 )
## Calculations and Intervals on Original Scale
```

```
#manual bootstrap
<- NA
median.bs for (i in 1:10000){ # change to 10000
<- sample(bmd.data$waiting_time, 169, replace = TRUE)
sample.bs <- median(sample.bs)
median.bs[i]
}median(median.bs) #median (could use mean) of all the bootstrapped medians
```

`## [1] 14`

`sd(median.bs) #Std error of the median`

`## [1] 1.34752`

`quantile(median.bs, c(0.025, 0.975)) #95% empirical confidence interval`

```
## 2.5% 97.5%
## 12 18
```

**TRY IT YOURSELF:**

- Compute the mean for
*waiting_time*and the usual 95% confidence interval using the**CI()**function.

##
*See the solution code*

```
#library(Rmisc) #CI() function to compute the conf interval for the mean
CI(bmd.data$waiting_time)
```

- Compute the 95% confidence interval for the mean using the boot function

##
*See the solution code*

```
CI(bmd.data$waiting_time)
<- function(x, d) { #need to define the function to use bootstrap
samplemean return(mean(x[d])) #d is the index for the bootstrap
}
<- boot(bmd.data$waiting_time, statistic=samplemean, R=10000)
bootresults boot.ci(bootresults, type="perc")
```

## 1.4 Exercises

Solve the following exercises:

The diabetes data were provided by Hastie and Tibshirani (1990, p. 6). The observations arise from a study of the factors affecting patterns of insulin-dependent diabetes mellitus in 43 children (Sockett et al., 1987). The aim was to investigate the dependence of serum C-peptide on other factors, better to understand the patterns of residual insulin secretion.

The response, *cpep*, is the log of C-peptide concentration at diagnosis, and
the selected covariates are *age*, the child’s age at diagnosis, and *base*,
minus their base deficit. Base deficit is a measure of acidity.

**1) Calculate the Pearson correlation coefficient between cpep and base with
the respective 95% confidence interval obtained by the Fisher’s
z-transformation (the
usual way of getting the confidence interval for the Pearson’s correlation)**

*Note: you can use the function CIr in the “psychometric” package*

**2) Write your own function to compute the 95% confidence interval
for the above correlation, using bootstrap.**

**3) Use the function boot() from the boot package to compute
the 95% confidence interval through bootstrap**

**4) Plot the histogram with the correlations obtained in
the bootstrap samples.**

##
*See the solution code*

```
#install.packages("psychometric") # install package with function for
# the correlation confidence interval
#install.packages("boot") # install package for bootstrap function
library(boot)
library(psychometric)
set.seed(1001)
<- read.csv("https://www.dropbox.com/s/6rc00ealjtyp3qi/diabetes.csv?dl=1")
myData
<- dim(myData)[1] #nr of observations
sample.size
#1 - the Person correlation
cor(myData$base, myData$cpep)
CIr( r = cor(myData$base, myData$cpep), #conf interval using
n = sample.size, level = .95) # Fisher's z-transformations
#2 - Bootstrap
<- 25000 #choose how many bootstraps
n.boot <- NULL # to store the bootstrap correlations
cor.boot
#manually implementing bootstrap
for (i in 1:n.boot) {
<- sample.int(sample.size, #bootstrap the
id.bs #original sample
sample.size, replace = TRUE)
<- cor(myData[id.bs, ])[2,3] #Compute correlation
cor.boot[i] #between
}quantile(cor.boot, c(0.025, 0.975)) #95% confidence interval
#3 - Using the boot() function
<- function(data, d) {
sample.corr return(cor(data$base[d], data$cpep[d])) #d is the index for the bootstrap
}
<- boot(myData,
bootcorr statistic=sample.corr,
R=n.boot)
# get 95% confidence interval
boot.ci(bootcorr, type="perc")
#4 - Histogram
#the correlations for the bootstrap samples
#are stored in bootcorr$t
hist(bootcorr$t)
```

```
#or
#hist(cor.boot)
```