Chapter 9 The Bootstrap

This lecture is based on:

  • L. Wasserman. All of Statistics. Springer, 2010.
  • G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning with Applications in R. Chapter 5. Springer, 2013.
  • B. Efron and R.J. Tibshirani. An introduction to the Bootstrap. Chapman & Hall, 2010.
  • A.W. van der Vaart. Asymptotic Statistics. Chapter 23. Cambridge University Press, 1998.
  • A. DasGupta. Asymptotic Theory of Statistics and Probability. Chapter 29. Springer, 2008.

The Bootstrap belongs to the wider class of resampling methods.

  • It was introduced in 1979 by Bradley Efron and offered a principled, simulation-based approach for measures of accuracy to statistical estimates, and estimating biases, variances or obtaining appropriate confidence intervals.

  • It applies to both parametric (when the distribution of the data is known and parameterized by a finite-dimensional parameter) and nonparametric settings.

9.1 Motivating Examples

9.1.1 Confidence interval for the median of a population

Let X1,X2,,Xn be independent identically distributed (iid) random variables from a cumulative distribution function (cdf) F and consider that we are interested in the median m=F1(0.5) of the distribution. We consider the following estimator ˆMn={X(n+1)/2if n odd,12(Xn/2+Xn/2+1)if n even, where Xr is the rth order statistic of the random sample (X1,,Xn).

Example 1(Wages). We are interested in the median wage in the mid-Atlantic states of the USA in 2005. We consider the dataset from the R packages ISLR, which consists of the wages of n=447 workers in that region in 2005.

library(ISLR)
data(Wage)
wage2005 = Wage$wage[Wage$year==2005]
length(wage2005)
## [1] 447
hist(wage2005)

mhat = median(wage2005)
mhat
## [1] 104.9215

The median estimate is just a point estimate for the population median m. How do we get a confidence interval for m? If F is differentiable with probability density function (pdf) f, then the asymptotic distribution ˆMn is n(ˆMnm)dN(0,14f2(m)), as n.

  • We cannot evaluate the variance term due to the unknown density f
  • If we could evaluate σ2=V(ˆMn), then we could use the normal approximation to obtain an approximate 1α confidence interval. The bootstrap allows to obtain an estimate of the variance σ2 of the estimator.
  • Alternatively, if we knew the quantiles qα/2 and q1α/2 of the distribution of ˆMnm where P(ˆMnmqx)=x for x[0,1], then we could use these quantiles to build a 1α confidence interval [ˆMnq1α/2,ˆMnqα/2] for m as P(qα/2ˆMnmq1α/2)=1α. The quantiles are also in general impossible to obtain analytically. The bootstrap provides estimates of the quantiles qα/2 and q1α/2 and thus approximate confidence intervals.

9.1.2 Bias and Variance of an estimator

Let X1,,Xn be iid random variables from a pdf f(x;θ) parametrized by θ. Let ˆθn=t(X1,,Xn) be some estimator of θ.

  • In order to evaluate the accuracy of the estimator, we may be interested in evaluating

    • Bias: E[ˆθn]θ
    • Variance: Var(ˆθn).
  • The bootstrap is a general strategy to obtain estimates for these quantities. And the CIs.

9.2 Background

9.2.1 Satistical functional

A statistical functional F(F) is any function of the cdf F. Examples include

  • the mean E(X)=xdF(x)
  • the variance Var(X)=x2dF(x)(xdF(x))2
  • the median m=F1(0.5).

9.2.2 Empirical distribution function

Let X1,,Xn be iid real-valued random variables with cdf F. The empirical cumulative distribution function is defined as Fn(x)=1nni=1I(Xix)=#|{i|Xix}|n.

The empirical cdf have the following attractive theoretical properties: Fn(x) is

  1. Unbiasedness
    E[Fn(x)]=F(x),xR.

  2. Strong consistency
    Fn(x)a.s.F(x),n.

  3. Asymptotic normality
    n(Fn(x)F(x))dN(0,F(x){1F(x)}).

proof. For any xR, the binary random variables I(Xix) i are independent and identically Bernoulli distributed with success probability of F(x). Hence the discrete random variable nFn(x)Binomial(n,F(x)) and (i) follows. (ii) follows from the strong large numbers and (iii) by CLT.

ecdf.wages = ecdf(wage2005)
plot(ecdf.wages, xlab = 'x', ylab = 'Fn(x)', main = 'Empirical cdf', col='lightblue2')

9.2.3 Monte Carlo Integration

The Monte Carlo method is a way to approximtate potentially high-dimensional integrals via simulation.

Let YRp be a continuous or discrete random variable with cdf G. Consider η=E(ϕ(Y))=Rpϕ(y)dG(y) where ϕ:RpR. Let Y(1),,Y(B) be iid random variables with cdf G. Then ˆηB=1BBj=1ϕ(Y(j)) is called the Monte Carlo estimator of the expectation η.

The Monte Carlo algorithm, which only requires to be able to simulate from G, is as follows:


Algorithm 1: Monte Carlo Algorithm

  1. Simulate independent Y(1),Y(B) with cdf G

  2. Return ˆηB=1BBj=1ϕ(Y(j)).


Here are the properties of the Monte Carlo estimator, which follow from the law of large numbers and the central limit theorem.

Proposition 5. The Monte Carlo estimator is

  1. (i) Unbiased
    E(ˆηB)=ηfor any B1.

  2. (ii) (Strongly) consistent
    ˆηBa.s.ηas B,

  3. (iii) Asymptotically normal.
    If σ2=V(ϕ(Y)) exists, then B(ˆηBη)dN(0,σ2)as B.

For example, the Monte Carlo estimators of the mean and variance of Y are:

ˆμB=1BBj=1Y(j)a.s.E(Y),

ˆσ2B=1BBj=1(Y(j)ˆμB)2

=1BBj=1(Y(j))2(1BBj=1Y(j))2a.s.V(Y).

9.3 Bootstrap

Let X1,,Xn be iid random variables with cdf F. The distribution F of the data is unknown, and we cannot even simulate from F. Consider the estimator ˆθn=t(X1,,Xn) for estimating θ=F(F) a statistical functional of F. This setup is represented below: Real World: FX1,,Xnˆθn=t(X1,,Xn)

9.3.1 Variance Estimation

We are first interested in estimating the variance σ2=VF(ˆθn) of the estimator ˆθn. We write VF to emphasize the fact that the variance depends on the unknown cdf F.

9.3.1.1 An idealized variance estimator

Suppose first that we could simulate from the cdf F, that is we could reproduce the Real World and generate new datasets. Then we could use the Monte Carlo method to obtain a Monte Carlo estimator for σ2 as follows.

  • For j=1,,B:
    • Let X(j)1,,X(j)n be iid from F and ˆθ(j)n=t(X(j)1,,X(j)n)
  • The Monte Carlo variance estimator is ˆσ2n,B=1BBj=1(ˆθ(j)n1BBj=1ˆθ(j)n)2.

However, the cdf F is unknown and we cannot simulate from it, so we cannot implement this estimator.

9.3.1.2 The bootstrap for variance estimation

The idea of the bootstrap is to:

  1. Replace the unknown cdf F by its (known) empirical cdf Fn
  2. Use the Monte Carlo method.

More formally, conditionally on X1,,Xn, let X1,,Xn be iid random variables with cdf Fn and ˆθn=t(X1,,Xn). The bootstrap mimics the Real World by using the empirical cdf Fn:

  • Real World: FX1,,Xnˆθn=t(X1,,Xn)
  • Bootstrap World: FnX1,,Xnˆθn=t(X1,,Xn)

The bootstrap relies on two approximations:

  1. Approximate VF(ˆθn) by VFn(ˆθn) using Fn
  2. Approximate VFn(ˆθn) by Monte Carlo to obtain the bootstrap estimate ˆσ2n,B

In a nutshell:

VF(ˆθn)Empirical cdfVFn(ˆθn)Monte Carloˆσ2n,B

  • Simulating X1,,Xn from Fn is easy. Fn puts mass 1/n at each value {X1,,Xn}.

  • X1,,Xn are thus discrete random variables, and we can simulate from Fn by sampling with replacement from the original dataset {X1,,Xn}.


Algorithm 2: Bootstrap algorithm for estimating the variance

Let X1,,Xn be some data and ˆθn=t(X1,,Xn).

  • For j=1,,B:
    • Simulate X(j)1,,X(j)niidFn by sampling with replacement from {X1,,Xn}
    • Evaluate ˆθ(j)n=t(X(j)1,,X(j)n)
  • Return the bootstrap variance estimate ˆσ2n,B=1BBj=1(ˆθ(j)n1BBj=1ˆθ(j)n)2

9.3.1.3 Example

Consider the median estiamator ˆMn to estimate its variance VF(ˆMn).

bootstrap_variance_median <- function(X, B) {
  # Bootstrap variance estimate for the median estimator
  # X: Data
  # B: Number of Monte Carlo samples
  n <- length(X)
  mhat.boot <- numeric(B) 

  for (j in 1:B){
    X.boot <- sample(X,n,replace=TRUE) # Sample bootstrap data from Fn
    mhat.boot[j] <- median(X.boot) # Median bootstrap samples 
  }
  var.boot <- var(mhat.boot) # Evaluate the bootstrap variance estimate
  return(list(var.boot = var.boot, mhat.boot = mhat.boot)) 
}

We start with a synthetic normal dataset with n=500. * X1,,Xn are normally distributed with mean 0.4 and variance 1.

n <- 500
X <- rnorm(n,mean=0.4) # Median estimate mhat.gauss <- median(X)
# Median estimate 
mhat.gauss <- median(X)

We now estimate the variance of the median estimator using the bootstrap.

B <- 10000
results.gauss = bootstrap_variance_median(X,B)
mhat.boot.gauss = results.gauss$mhat.boot
hist(mhat.boot.gauss, xlab='mhat.boot', main='',col='orange', breaks=20)
points(mhat.gauss,0,pch = 22,col='red',bg='red')

This is the histogram of the B=10000 bootstrap samples ˆM(1)n,,ˆM(B)n of the median estimator and the red square is ˆMn for the Gaussian synthetic dataset.

var.boot.gauss = results.gauss$var.boot
var.boot.gauss
## [1] 0.002706442

DIY Compare it with the theoretical one. Because this is from synthetic data, we know the true distribution F. So we can evaluate how accurate the bootstrap variance is.

We now consider the wages dataset and use the bootstrap estimator to estimate the variance of the sample median.

B <- 10000
results.wages <- bootstrap_variance_median(wage2005,B)
mhat.boot.wages <- results.wages$mhat.boot
hist(mhat.boot.wages,xlab='mhat.boot', main='',col='orange', breaks=20)
points(mhat,0,pch = 22,col='red',bg='red')

This is the histogram of the B=10000 bootstrap samples ˆM(1)n,,ˆM(B)n of the median estimator and the red square is ˆMn for the wages dataset.

var.boot.wages = results.wages$var.boot
var.boot.wages
## [1] 5.543813

9.3.1.4 Confidence Intervals

Normal Approximation

If central limit theorem holds (iid, large n) for the estimator ˆθn, the following is statisfied: ˆθnθVF(ˆθn)dN(0,1).

We can now approximate VF(ˆθn) with ˆσ2n,B. The (1alpha)×100% bootstrap confidence interval is from :

P(zα/2<ˆθnθˆσ2n,B<z1α/2)=1α.

Thus, the confidence interval can be expressed as: (ˆθnz1α/2ˆσ2n,B,ˆθn+z1α/2ˆσ2n,B).

Now, we can construct the 95% confidence interval for the wages dataset.

alpha = 0.05
cm = qnorm(1-alpha/2) # Confidence multiplier
se = sqrt(var.boot.wages) # standard error of the statistics
me = cm * se # margin of error 
ci.bootnormal.wages = c(mhat - me,mhat + me)
ci.bootnormal.wages
## [1] 100.3067 109.5363

Bootstrap Pivotal Confidence Interval

  • A pivotal quantity is a statistic whose distribution does not depende (or depends minimally) on unknown parameters.
  • Consider the random variable (pivot) Rn=ˆθnθ.
  • Let Hn(r)=P(Rnr) be the (unknown) cdf of Rn.
  • H1n is the associated quantile function qα:=H1n(α)=inf{x:H(x)>α}.
  • To construct 1α confidence interval, we can consider: 1α=P(qα/2ˆθnθq1α/2)=P(ˆθnq1α/2θˆθnqα/2).
  • However we do not know the quantiles.
  • Here, we can use bootstrap to obtain an approximation of these quantiles.
  • Bootstrap Approximation: Rn=ˆθnθRn=ˆθnˆθn.
  • Consider the distribution: Hn(r)=PFn(Rnr)
  • This can be approximated by Monte Carlo methods to obtain the bootstrap estimate ˆHn,B(r)=1BBj=1I(R(j)nr), where R(j)n=ˆθ(j)nˆθn.
  • In a nutshell:

HnEmpirical cdfHnMonte CarloˆHn,B

  • We can obtain the quantiles based on these approximations: ˆqα=ˆH1n,B(α) and build the bootstrap pivotal 1α confidence interval Cn=[ˆθnˆq1α/2,ˆθnˆqα/2].

  • As R(j)n=ˆθ(j)nˆθn, we have ˆqα=ˆqθαˆθn, where ˆqθα is the α quantile of the bootstrap samples (ˆθ(1),ˆθ(B)).

  • Thus, in terms of (ˆθ(1),ˆθ(B)), the 1α bootstrap pivotal confidence interval is given by Cn=[2ˆθnˆqθ1α/2,2ˆθnˆqθα/2].

  • Construct the 95% bootstrap pivotal CI.

ci.bootpivot.wages = c(2*mhat-quantile(mhat.boot.wages, 1-alpha/2,names=FALSE) ,
                 2*mhat-quantile(mhat.boot.wages,alpha/2,names=FALSE))
print(ci.bootpivot.wages)
## [1] 100.0090 109.0787

9.3.2 Bias Estimation

The bias of an estimator ˆθn is b:=EF(ˆθn)θ.

It is difficult to evaluate due to the true parameter. However it can be achieved by bootstrapping with ˆbn=EFn(ˆθn)ˆθn

  • Using bootstrap samples (ˆθ(1),ˆθ(B)) to approximate EFn(ˆθn), the bootstrap bias estimator is

1BBj=1ˆθ(j)nˆθn.

9.3.3 Bootstrap for regression

Let (X1,Y1),,(Xn,Yn) be paired iid random variables assumed to be drawn from Yi=g(Xi,θ)+ϵi, where ϵ1,,ϵn are iid from some unknown cdf F and θRp. g is some known function, for example g(Xi,θ)=Xiθ. Let ˆθn=t((X1,Y1),,(Xn,Yn)) be some estimator of θ, for example the least square estimator.

  • For i=1,,n, let ˆϵi=Yig(Xi,ˆθn) be the fitted residuals. Two boostrap strategies are possible:

    1. Resample the data (X1,Y1),,(Xn,Yn)
    2. Resample the fitted residuals (ˆϵ1,,ˆϵn)

The first is the standard nonparametric bootstrap approach. The second is the semiparametric bootstrap (using g(Xi,ˆθn) of the generating mechanism).

Nonparametric Paired Bootstrap

  • For j=1,,B
    • Simulate ((X(j)1,Y(j)1),,(X(j)n,Y(j)n)) by sampling with replacement from {(X1,Y1),,(Xn,Yn)}
    • Evaluate ˆθ(j)n=t((X(j)1,Y(j)1),,(X(j)n,Y(j)n))
  • Return the bootstrap estimate. For example, for θR, the variance estimate is

ˆσ2n,B=1BBj=1(ˆθ(j)n1BBj=1ˆθ(j)n)2

Semiparametric Residual Bootstrap

  • For j=1,,B
    • Simulate (ˆε(j)1,,ˆε(j)n) by sampling with replacement from (ˆε1,,ˆεn)
    • For i=1,,n, set Y(j)i=g(Xi,ˆθn)+ˆε(j)i
    • Evaluate ˆθ(j)n=t((X1,Y(j)1),,(Xn,Y(j)n))
  • Return the bootstrap estimate. For example, for θR, the variance estimate is

ˆσ2n,B=1BBj=1(ˆθ(j)n1BBj=1ˆθ(j)n)2

9.3.4 Parametric Bootstrap

The methods seen so far are nonparametric, in the sense that no assumption has been made on the cumulative distribution function (cdf) F.

In the parametric setting, we assume that X1,,Xn are i.i.d. random variables with cdf Fθ, where θ is an unknown parameter of the model we wish to estimate.

Let ˆθn=t(X1,,Xn) be some estimator of θ, for example, a maximum likelihood estimator (MLE).

The parametric bootstrap proceeds similarly to the nonparametric bootstrap described above, except that it uses the fitted cdf Fˆθn instead of the empirical cdf to obtain the bootstrap samples.

If the parametric model is correctly specified, this approach can lead to superior performance compared to the nonparametric bootstrap.

9.3.4.1 Normal Model

In the parametric bootstrap, we assume the data follow a known distribution family with unknown parameters. Here we assume

X1,,XniidN(μ,σ2)

We want to estimate the variance of the sample mean ˉX using parametric bootstrap.

Step-by-Step Procedure

  1. Estimate ˆμ and ˆσ from the observed data.
  2. Generate bootstrap samples from N(ˆμ,ˆσ2).
  3. Compute the sample mean for each bootstrap sample.
  4. Estimate the variance of the bootstrap means.
set.seed(123)

# Step 1: Simulate observed data
n <- 100
true_mu <- 5
true_sd <- 2
x <- rnorm(n, mean = true_mu, sd = true_sd)

# Step 2: Estimate parameters from observed data
mu_hat <- mean(x)
sd_hat <- sd(x)

# Step 3: Parametric bootstrap
B <- 1000
boot_means <- numeric(B)

for (b in 1:B) {
  # Generate bootstrap sample from N(mu_hat, sd_hat^2)
  x_star <- rnorm(n, mean = mu_hat, sd = sd_hat)
  
  # Compute the bootstrap statistic (sample mean)
  boot_means[b] <- mean(x_star)
}

# Step 4: Estimate variance and confidence interval
boot_var <- var(boot_means)
boot_ci <- quantile(boot_means, c(0.025, 0.975))

boot_var
## [1] 0.03011228
boot_ci
##     2.5%    97.5% 
## 4.836943 5.508777

9.4 DIY

DIY Explore the concept of generalized correlation coefficients. Select one of the approaches and incorporate the bootstrap procedure for inference. Then evaluate using synthetic data.

library(tidyverse)
library(arm)