Chapter 9 The Bootstrap
This lecture is based on:
- L. Wasserman. All of Statistics. Springer, 2010.
- G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning with Applications in R. Chapter 5. Springer, 2013.
- B. Efron and R.J. Tibshirani. An introduction to the Bootstrap. Chapman & Hall, 2010.
- A.W. van der Vaart. Asymptotic Statistics. Chapter 23. Cambridge University Press, 1998.
- A. DasGupta. Asymptotic Theory of Statistics and Probability. Chapter 29. Springer, 2008.
The Bootstrap belongs to the wider class of resampling methods.
It was introduced in 1979 by Bradley Efron and offered a principled, simulation-based approach for measures of accuracy to statistical estimates, and estimating biases, variances or obtaining appropriate confidence intervals.
It applies to both parametric (when the distribution of the data is known and parameterized by a finite-dimensional parameter) and nonparametric settings.
9.1 Motivating Examples
9.1.1 Confidence interval for the median of a population
Let X1,X2,…,Xn be independent identically distributed (iid) random variables from a cumulative distribution function (cdf) F and consider that we are interested in the median m=F−1(0.5) of the distribution. We consider the following estimator ˆMn={X(n+1)/2if n odd,12(Xn/2+Xn/2+1)if n even, where Xr is the rth order statistic of the random sample (X1,…,Xn).
Example 1(Wages). We are interested in the median wage in the mid-Atlantic states of the USA in 2005. We consider the dataset from the R packages ISLR, which consists of the wages of n=447 workers in that region in 2005.
## [1] 447
## [1] 104.9215
The median estimate is just a point estimate for the population median m. How do we get a confidence interval for m? If F is differentiable with probability density function (pdf) f, then the asymptotic distribution ˆMn is √n(ˆMn−m)→dN(0,14f2(m)), as n→∞.
- We cannot evaluate the variance term due to the unknown density f
- If we could evaluate σ2=V(ˆMn), then we could use the normal approximation to obtain an approximate 1−α confidence interval. The bootstrap allows to obtain an estimate of the variance σ2 of the estimator.
- Alternatively, if we knew the quantiles qα/2 and q1−α/2 of the distribution of ˆMn−m where P(ˆMn−m≤qx)=x for x∈[0,1], then we could use these quantiles to build a 1−α confidence interval [ˆMn−q1−α/2,ˆMn−qα/2] for m as P(qα/2≤ˆMn−m≤q1−α/2)=1−α. The quantiles are also in general impossible to obtain analytically. The bootstrap provides estimates of the quantiles qα/2 and q1−α/2 and thus approximate confidence intervals.
9.1.2 Bias and Variance of an estimator
Let X1,…,Xn be iid random variables from a pdf f(x;θ) parametrized by θ. Let ˆθn=t(X1,…,Xn) be some estimator of θ.
In order to evaluate the accuracy of the estimator, we may be interested in evaluating
- Bias: E[ˆθn]−θ
- Variance: Var(ˆθn).
The bootstrap is a general strategy to obtain estimates for these quantities. And the CIs.
9.2 Background
9.2.1 Satistical functional
A statistical functional F(F) is any function of the cdf F. Examples include
- the mean E(X)=∫xdF(x)
- the variance Var(X)=∫x2dF(x)−(∫xdF(x))2
- the median m=F−1(0.5).
9.2.2 Empirical distribution function
Let X1,…,Xn be iid real-valued random variables with cdf F. The empirical cumulative distribution function is defined as Fn(x)=1nn∑i=1I(Xi≤x)=#|{i|Xi≤x}|n.
The empirical cdf have the following attractive theoretical properties: Fn(x) is
Unbiasedness
E[Fn(x)]=F(x),∀x∈R.Strong consistency
Fn(x)a.s.→F(x),n→∞.Asymptotic normality
√n(Fn(x)−F(x))d→N(0,F(x){1−F(x)}).
proof. For any x∈R, the binary random variables I(Xi≤x) ∀i are independent and identically Bernoulli distributed with success probability of F(x). Hence the discrete random variable nFn(x)∼Binomial(n,F(x)) and (i) follows. (ii) follows from the strong large numbers and (iii) by CLT.
ecdf.wages = ecdf(wage2005)
plot(ecdf.wages, xlab = 'x', ylab = 'Fn(x)', main = 'Empirical cdf', col='lightblue2')
9.2.3 Monte Carlo Integration
The Monte Carlo method is a way to approximtate potentially high-dimensional integrals via simulation.
Let Y∈Rp be a continuous or discrete random variable with cdf G. Consider η=E(ϕ(Y))=∫Rpϕ(y)dG(y) where ϕ:Rp→R. Let Y(1),…,Y(B) be iid random variables with cdf G. Then ˆηB=1BB∑j=1ϕ(Y(j)) is called the Monte Carlo estimator of the expectation η.
The Monte Carlo algorithm, which only requires to be able to simulate from G, is as follows:
Algorithm 1: Monte Carlo Algorithm
Simulate independent Y(1),…Y(B) with cdf G
Return ˆηB=1B∑Bj=1ϕ(Y(j)).
Here are the properties of the Monte Carlo estimator, which follow from the law of large numbers and the central limit theorem.
Proposition 5. The Monte Carlo estimator is
(i) Unbiased
E(ˆηB)=ηfor any B≥1.(ii) (Strongly) consistent
ˆηBa.s.→ηas B→∞,(iii) Asymptotically normal.
If σ2=V(ϕ(Y)) exists, then √B(ˆηB−η)d→N(0,σ2)as B→∞.
For example, the Monte Carlo estimators of the mean and variance of Y are:
ˆμB=1BB∑j=1Y(j)a.s.→E(Y),
ˆσ2B=1BB∑j=1(Y(j)−ˆμB)2
=1BB∑j=1(Y(j))2−(1BB∑j=1Y(j))2a.s.→V(Y).
9.3 Bootstrap
Let X1,…,Xn be iid random variables with cdf F. The distribution F of the data is unknown, and we cannot even simulate from F. Consider the estimator ˆθn=t(X1,…,Xn) for estimating θ=F(F) a statistical functional of F. This setup is represented below: Real World: F⇒X1,…,Xn⇒ˆθn=t(X1,…,Xn)
9.3.1 Variance Estimation
We are first interested in estimating the variance σ2=VF(ˆθn) of the estimator ˆθn. We write VF to emphasize the fact that the variance depends on the unknown cdf F.
9.3.1.1 An idealized variance estimator
Suppose first that we could simulate from the cdf F, that is we could reproduce the Real World and generate new datasets. Then we could use the Monte Carlo method to obtain a Monte Carlo estimator for σ2 as follows.
- For j=1,…,B:
- Let X(j)1,…,X(j)n be iid from F and ˆθ(j)n=t(X(j)1,…,X(j)n)
- The Monte Carlo variance estimator is ˆσ2n,B=1BB∑j=1(ˆθ(j)n−1BB∑j=1ˆθ(j)n)2.
However, the cdf F is unknown and we cannot simulate from it, so we cannot implement this estimator.
9.3.1.2 The bootstrap for variance estimation
The idea of the bootstrap is to:
- Replace the unknown cdf F by its (known) empirical cdf Fn
- Use the Monte Carlo method.
More formally, conditionally on X1,…,Xn, let X∗1,…,X∗n be iid random variables with cdf Fn and ˆθ∗n=t(X∗1,…,X∗n). The bootstrap mimics the Real World by using the empirical cdf Fn:
- Real World: F⇒X1,…,Xn⇒ˆθn=t(X1,…,Xn)
- Bootstrap World: Fn⇒X∗1,…,X∗n⇒ˆθ∗n=t(X∗1,…,X∗n)
The bootstrap relies on two approximations:
- Approximate VF(ˆθn) by VFn(ˆθ∗n) using Fn
- Approximate VFn(ˆθ∗n) by Monte Carlo to obtain the bootstrap estimate ˆσ2n,B
In a nutshell:
VF(ˆθn)Empirical cdf≃VFn(ˆθ∗n)Monte Carlo≃ˆσ2n,B
Simulating X∗1,…,X∗n from Fn is easy. Fn puts mass 1/n at each value {X1,…,Xn}.
X∗1,…,X∗n are thus discrete random variables, and we can simulate from Fn by sampling with replacement from the original dataset {X1,…,Xn}.
Algorithm 2: Bootstrap algorithm for estimating the variance
Let X1,…,Xn be some data and ˆθn=t(X1,…,Xn).
- For j=1,…,B:
- Simulate X∗(j)1,…,X∗(j)n∼iidFn by sampling with replacement from {X1,…,Xn}
- Evaluate ˆθ∗(j)n=t(X∗(j)1,…,X∗(j)n)
- Return the bootstrap variance estimate ˆσ2n,B=1BB∑j=1(ˆθ∗(j)n−1BB∑j=1ˆθ∗(j)n)2
9.3.1.3 Example
Consider the median estiamator ˆMn to estimate its variance VF(ˆMn).
bootstrap_variance_median <- function(X, B) {
# Bootstrap variance estimate for the median estimator
# X: Data
# B: Number of Monte Carlo samples
n <- length(X)
mhat.boot <- numeric(B)
for (j in 1:B){
X.boot <- sample(X,n,replace=TRUE) # Sample bootstrap data from Fn
mhat.boot[j] <- median(X.boot) # Median bootstrap samples
}
var.boot <- var(mhat.boot) # Evaluate the bootstrap variance estimate
return(list(var.boot = var.boot, mhat.boot = mhat.boot))
}
We start with a synthetic normal dataset with n=500. * X1,…,Xn are normally distributed with mean 0.4 and variance 1.
n <- 500
X <- rnorm(n,mean=0.4) # Median estimate mhat.gauss <- median(X)
# Median estimate
mhat.gauss <- median(X)
We now estimate the variance of the median estimator using the bootstrap.
B <- 10000
results.gauss = bootstrap_variance_median(X,B)
mhat.boot.gauss = results.gauss$mhat.boot
hist(mhat.boot.gauss, xlab='mhat.boot', main='',col='orange', breaks=20)
points(mhat.gauss,0,pch = 22,col='red',bg='red')
This is the histogram of the B=10000 bootstrap samples ˆM∗(1)n,…,ˆM∗(B)n of the median estimator and the red square is ˆMn for the Gaussian synthetic dataset.
## [1] 0.002706442
DIY Compare it with the theoretical one. Because this is from synthetic data, we know the true distribution F. So we can evaluate how accurate the bootstrap variance is.
We now consider the wages dataset and use the bootstrap estimator to estimate the variance of the sample median.
B <- 10000
results.wages <- bootstrap_variance_median(wage2005,B)
mhat.boot.wages <- results.wages$mhat.boot
hist(mhat.boot.wages,xlab='mhat.boot', main='',col='orange', breaks=20)
points(mhat,0,pch = 22,col='red',bg='red')
This is the histogram of the B=10000 bootstrap samples ˆM∗(1)n,…,ˆM∗(B)n of the median estimator and the red square is ˆMn for the wages dataset.
## [1] 5.543813
9.3.1.4 Confidence Intervals
Normal Approximation
If central limit theorem holds (iid, large n) for the estimator ˆθn, the following is statisfied: ˆθn−θ√VF(ˆθn)→dN(0,1).
We can now approximate VF(ˆθn) with ˆσ2n,B. The (1−alpha)×100% bootstrap confidence interval is from :
P(zα/2<ˆθn−θˆσ2n,B<z1−α/2)=1−α.
Thus, the confidence interval can be expressed as: (ˆθn−z1−α/2ˆσ2n,B,ˆθn+z1−α/2ˆσ2n,B).
Now, we can construct the 95% confidence interval for the wages dataset.
alpha = 0.05
cm = qnorm(1-alpha/2) # Confidence multiplier
se = sqrt(var.boot.wages) # standard error of the statistics
me = cm * se # margin of error
ci.bootnormal.wages = c(mhat - me,mhat + me)
ci.bootnormal.wages
## [1] 100.3067 109.5363
Bootstrap Pivotal Confidence Interval
- A pivotal quantity is a statistic whose distribution does not depende (or depends minimally) on unknown parameters.
- Consider the random variable (pivot) Rn=ˆθn−θ.
- Let Hn(r)=P(Rn≤r) be the (unknown) cdf of Rn.
- H−1n is the associated quantile function qα:=H−1n(α)=inf{x:H(x)>α}.
- To construct 1−α confidence interval, we can consider: 1−α=P(qα/2≤ˆθn−θ≤q1−α/2)=P(ˆθn−q1−α/2≤θ≤ˆθn−qα/2).
- However we do not know the quantiles.
- Here, we can use bootstrap to obtain an approximation of these quantiles.
- Bootstrap Approximation: Rn=ˆθn−θ≃R∗n=ˆθ∗n−ˆθn.
- Consider the distribution: H∗n(r)=PFn(R∗n≤r)
- This can be approximated by Monte Carlo methods to obtain the bootstrap estimate ˆH∗n,B(r)=1BB∑j=1I(R∗(j)n≤r), where R∗(j)n=ˆθ∗(j)n−ˆθn.
- In a nutshell:
HnEmpirical cdf≃H∗nMonte Carlo≃ˆH∗n,B
We can obtain the quantiles based on these approximations: ˆq∗α=ˆH∗−1n,B(α) and build the bootstrap pivotal 1−α confidence interval C∗n=[ˆθn−ˆq∗1−α/2,ˆθn−ˆq∗α/2].
As R∗(j)n=ˆθ∗(j)n−ˆθn, we have ˆq∗α=ˆqθ∗α−ˆθn, where ˆqθ∗α is the α quantile of the bootstrap samples (ˆθ∗(1),…ˆθ∗(B)).
Thus, in terms of (ˆθ∗(1),…ˆθ∗(B)), the 1−α bootstrap pivotal confidence interval is given by C∗n=[2ˆθn−ˆqθ∗1−α/2,2ˆθn−ˆqθ∗α/2].
Construct the 95% bootstrap pivotal CI.
ci.bootpivot.wages = c(2*mhat-quantile(mhat.boot.wages, 1-alpha/2,names=FALSE) ,
2*mhat-quantile(mhat.boot.wages,alpha/2,names=FALSE))
print(ci.bootpivot.wages)
## [1] 100.0090 109.0787
9.3.2 Bias Estimation
The bias of an estimator ˆθn is b:=EF(ˆθn)−θ.
It is difficult to evaluate due to the true parameter. However it can be achieved by bootstrapping with ˆbn=EFn(ˆθ∗n)−ˆθn
- Using bootstrap samples (ˆθ∗(1),…ˆθ∗(B)) to approximate EFn(ˆθ∗n), the bootstrap bias estimator is
1BB∑j=1ˆθ∗(j)n−ˆθn.
9.3.3 Bootstrap for regression
Let (X1,Y1),…,(Xn,Yn) be paired iid random variables assumed to be drawn from Yi=g(Xi,θ)+ϵi, where ϵ1,…,ϵn are iid from some unknown cdf F and θ∈Rp. g is some known function, for example g(Xi,θ)=Xiθ. Let ˆθn=t((X1,Y1),…,(Xn,Yn)) be some estimator of θ, for example the least square estimator.
For i=1,…,n, let ˆϵi=Yi−g(Xi,ˆθn) be the fitted residuals. Two boostrap strategies are possible:
- Resample the data (X1,Y1),…,(Xn,Yn)
- Resample the fitted residuals (ˆϵ1,…,ˆϵn)
The first is the standard nonparametric bootstrap approach. The second is the semiparametric bootstrap (using g(Xi,ˆθn) of the generating mechanism).
Nonparametric Paired Bootstrap
- For j=1,…,B
- Simulate ((X∗(j)1,Y∗(j)1),…,(X∗(j)n,Y∗(j)n)) by sampling with replacement from {(X1,Y1),…,(Xn,Yn)}
- Evaluate ˆθ∗(j)n=t((X∗(j)1,Y∗(j)1),…,(X∗(j)n,Y∗(j)n))
- Return the bootstrap estimate. For example, for θ∈R, the variance estimate is
ˆσ2n,B=1BB∑j=1(ˆθ∗(j)n−1BB∑j=1ˆθ∗(j)n)2
Semiparametric Residual Bootstrap
- For j=1,…,B
- Simulate (ˆε∗(j)1,…,ˆε∗(j)n) by sampling with replacement from (ˆε1,…,ˆεn)
- For i=1,…,n, set Y∗(j)i=g(Xi,ˆθn)+ˆε∗(j)i
- Evaluate ˆθ∗(j)n=t((X1,Y∗(j)1),…,(Xn,Y∗(j)n))
- Return the bootstrap estimate. For example, for θ∈R, the variance estimate is
ˆσ2n,B=1BB∑j=1(ˆθ∗(j)n−1BB∑j=1ˆθ∗(j)n)2
9.3.4 Parametric Bootstrap
The methods seen so far are nonparametric, in the sense that no assumption has been made on the cumulative distribution function (cdf) F.
In the parametric setting, we assume that X1,…,Xn are i.i.d. random variables with cdf Fθ, where θ is an unknown parameter of the model we wish to estimate.
Let ˆθn=t(X1,…,Xn) be some estimator of θ, for example, a maximum likelihood estimator (MLE).
The parametric bootstrap proceeds similarly to the nonparametric bootstrap described above, except that it uses the fitted cdf Fˆθn instead of the empirical cdf to obtain the bootstrap samples.
If the parametric model is correctly specified, this approach can lead to superior performance compared to the nonparametric bootstrap.
9.3.4.1 Normal Model
In the parametric bootstrap, we assume the data follow a known distribution family with unknown parameters. Here we assume
X1,…,Xniid∼N(μ,σ2)
We want to estimate the variance of the sample mean ˉX using parametric bootstrap.
Step-by-Step Procedure
- Estimate ˆμ and ˆσ from the observed data.
- Generate bootstrap samples from N(ˆμ,ˆσ2).
- Compute the sample mean for each bootstrap sample.
- Estimate the variance of the bootstrap means.
set.seed(123)
# Step 1: Simulate observed data
n <- 100
true_mu <- 5
true_sd <- 2
x <- rnorm(n, mean = true_mu, sd = true_sd)
# Step 2: Estimate parameters from observed data
mu_hat <- mean(x)
sd_hat <- sd(x)
# Step 3: Parametric bootstrap
B <- 1000
boot_means <- numeric(B)
for (b in 1:B) {
# Generate bootstrap sample from N(mu_hat, sd_hat^2)
x_star <- rnorm(n, mean = mu_hat, sd = sd_hat)
# Compute the bootstrap statistic (sample mean)
boot_means[b] <- mean(x_star)
}
# Step 4: Estimate variance and confidence interval
boot_var <- var(boot_means)
boot_ci <- quantile(boot_means, c(0.025, 0.975))
boot_var
## [1] 0.03011228
## 2.5% 97.5%
## 4.836943 5.508777