# 1 Bootstrap

## 1.1 Introduction

Bootstrap was proposed by Efron in 1979 and it mimics the sampling process by sampling with replacement the original sample. This process replicates the sample variation and allows the calculation of standard errors. It can also be used to refine more complex machine learning algorithms, as we will see later.

From the original sample of size $$n$$, we create many (e.g. 10,000) sample of size $$n$$ by sampling the sample with replacement. Notice that the replacement allows the same value to be included multiple times in the same sample. This is why you can create many different samples. For each of the bootstrapped sample we compute the statistics of interest. The standard deviance of all these computed statistics, is the standard error.

Read the following chapter of An introduction to statistical learning:

## 1.3 Practice session

### Task - Confidence intervals with bootstrap

We will be using the bmd.csv dataset to plot the histogram, compute the median and 95% confidence interval for the “waiting_time”

Let’s first read the data and create the variable BMI that will be used later.

library(caret) #library for Machine Learning
library(boot)  #library for bootstrap
library(pROC)  #library for the ROC curve
library(Rmisc) #CI() function to compute conf interval
set.seed(1974)

#the option stringsAsFactors = TRUE in the command below converts
#string variables as sex into factor variables
stringsAsFactors = TRUE)
#computes the BMI
bmd.data$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2 Let’s plot the histogram for waiting time and compute the median hist(bmd.data$waiting_time) median(bmd.data$waiting_time) ##  14 And now use the function boot(), from the library boot, to bootstrap the median. We will also compute it manually #using the boot function median(bmd.data$waiting_time)
##  14
samplemedian <- function(x, d) { #need to define the function to use bootstrap
return(median(x[d]))           #d is the index for the bootstrap
}
bootresults <- boot(bmd.data$waiting_time, statistic=samplemedian, R=10000) # get 95% confidence interval boot.ci(bootresults, type="perc") ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bootresults, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (13, 18 ) ## Calculations and Intervals on Original Scale #manual bootstrap median.bs <- NA for (i in 1:10000){ # change to 10000 sample.bs <- sample(bmd.data$waiting_time, 169, replace = TRUE)
median.bs[i] <- median(sample.bs)
}
median(median.bs)  #median (could use mean) of all the bootstrapped medians
##  14
sd(median.bs)      #Std error of the median
##  1.34752
quantile(median.bs, c(0.025, 0.975))  #95% empirical confidence interval
##  2.5% 97.5%
##    12    18

TRY IT YOURSELF:

1. Compute the mean for waiting_time and the usual 95% confidence interval using the CI() function.
See the solution code
#library(Rmisc) #CI() function to compute the conf interval for the mean
CI(bmd.data$waiting_time) 1. Compute the 95% confidence interval for the mean using the boot function See the solution code CI(bmd.data$waiting_time)

samplemean <- function(x, d) { #need to define the function to use bootstrap
return(mean(x[d]))           #d is the index for the bootstrap
}

bootresults <- boot(bmd.data$waiting_time, statistic=samplemean, R=10000) boot.ci(bootresults, type="perc") ## 1.4 Exercises Solve the following exercises: The diabetes data were provided by Hastie and Tibshirani (1990, p. 6). The observations arise from a study of the factors affecting patterns of insulin-dependent diabetes mellitus in 43 children (Sockett et al., 1987). The aim was to investigate the dependence of serum C-peptide on other factors, better to understand the patterns of residual insulin secretion. The response, cpep, is the log of C-peptide concentration at diagnosis, and the selected covariates are age, the child’s age at diagnosis, and base, minus their base deficit. Base deficit is a measure of acidity. 1) Calculate the Pearson correlation coefficient between cpep and base with the respective 95% confidence interval obtained by the Fisher’s z-transformation (the usual way of getting the confidence interval for the Pearson’s correlation) Note: you can use the function CIr in the “psychometric” package 2) Write your own function to compute the 95% confidence interval for the above correlation, using bootstrap. 3) Use the function boot() from the boot package to compute the 95% confidence interval through bootstrap 4) Plot the histogram with the correlations obtained in the bootstrap samples. See the solution code #install.packages("psychometric") # install package with function for # the correlation confidence interval #install.packages("boot") # install package for bootstrap function library(boot) library(psychometric) set.seed(1001) myData <- read.csv("https://www.dropbox.com/s/6rc00ealjtyp3qi/diabetes.csv?dl=1") sample.size <- dim(myData) #nr of observations #1 - the Person correlation cor(myData$base, myData$cpep) CIr( r = cor(myData$base, myData$cpep), #conf interval using n = sample.size, level = .95) # Fisher's z-transformations #2 - Bootstrap n.boot <- 25000 #choose how many bootstraps cor.boot <- NULL # to store the bootstrap correlations #manually implementing bootstrap for (i in 1:n.boot) { id.bs <- sample.int(sample.size, #bootstrap the sample.size, #original sample replace = TRUE) cor.boot[i] <- cor(myData[id.bs, ])[2,3] #Compute correlation #between } quantile(cor.boot, c(0.025, 0.975)) #95% confidence interval #3 - Using the boot() function sample.corr <- function(data, d) { return(cor(data$base[d], data$cpep[d])) #d is the index for the bootstrap } bootcorr <- boot(myData, statistic=sample.corr, R=n.boot) # get 95% confidence interval boot.ci(bootcorr, type="perc") #4 - Histogram #the correlations for the bootstrap samples #are stored in bootcorr$t
hist(bootcorr\$t) #or
#hist(cor.boot)