Chapter 3 Interval Estimation via Simulation
3.1 Performing the Bootstrap
Bootstrapping is a computationally-intense, simulation approach, designed to quantify variability associated with a statistic when generalizing from a sample to a large population.
To write code for a bootstrap, follow these steps
- Create vectors in which to store the bootstrap results. Create a vector for each statistic of interest.
- Write a for-loop in which you will perform a procedure many times (10000 simulations is usually appropriate in this class) within the loop perform steps 3 and 4.
- Take a sample the same size as the original data, using replacement.
- Calculate whatever statistics you are interested in for the bootstrap sample. This might involve calculating means, medians, standard deviations, or other statistics, or fitting a model to obtain regression coefficients.
- Combine results for each of the qualities you bootstrapped into a dataframe.
3.1.1 Setting a Seed
Because bootstrapping involves taking a random sample, results will come out slightly differentely each time you run your code. Doing a large number of simulations ensures that the average results from each simulation will be similar, but they will not be identical.
Thus, each time you knit your document, you’ll get slightly different results, which can be a problem when summarizing your results. To avoid this, you can set a random seed, which tells R to start its random procedure in the same place each time. This is done using the set.seed()
function. I typically use the date as my seed, but you can use any number you want.
set.seed(09262020)
3.1.2 Template for Bootstrap
<- rep(NA, 10000) # create vector to store bootstrap results for first statistic
statistic1 <- rep(NA, 10000) # create vector to store bootstrap results for second statistic
statistic2 <- rep(NA, 10000) # create vector to store bootstrap results for third statistic
statistic3 # continue if there are more than 3 quantities of interest
# for loop
for (i in 1:10000){
<- sample_n(DatasetName, samplesize, replace="TRUE")
BootstrapSample <- #calculate first statistic on bootstrap sample
statistic1[i] <- #calculate second statistic on bootstrap sample
statistic2[i] <- #calculate third statistic on bootstrap sample
statistic3[i]
}
<- data.frame(statistic1, statistic2, statistic3) Bootstrap_Results
3.1.3 Bootstrap for Difference in Proportion of M&M
Bootstrap distribution for proportion of blue M&M’s based on sample of 14.
<- c(rep("Blue", 6), rep("Orange", 3), rep("Green", 2), rep("Yellow", 2),
Color rep("Red", 1))
<- data.frame(Color)
Sample
<- rep(NA, 10000)
Proportion_Blue for (i in 1:10000){
<- sample_n(Sample, 14, replace=TRUE)
BootstrapSample <- sum(BootstrapSample$Color=="Blue")/14
Proportion_Blue[i]
}<- data.frame(Proportion_Blue) Candy_Bootstrap_Results
3.1.4 Bootstrap for Mean, Median, St.Dev., Prop>1
Bootstrap mean, median, standard deviation, mercury levels in Florida lakes, and percentage of lakes with mercury level exceeding 1 ppm.
<- rep(NA, 10000)
MeanHg <- rep(NA, 10000)
StDevHg <- rep(NA, 10000)
PropOver1 <- rep(NA, 10000)
MedianHg
for (i in 1:10000){
<- sample_n(FloridaLakes, 53, replace=TRUE)
BootstrapSample <- mean(BootstrapSample$AvgMercury)
MeanHg[i] <- sd(BootstrapSample$AvgMercury)
StDevHg[i] <- mean(BootstrapSample$AvgMercury>1)
PropOver1[i] <- median(BootstrapSample$AvgMercury)
MedianHg[i]
}<- data.frame(MeanHg, MedianHg, PropOver1, StDevHg) Lakes_Bootstrap_Results
3.1.5 Bootstrap for Difference Between Groups
Consider the difference between average mercury levels in northern and southern Florida lakes. First, we’ll add the location variable to the dataset.
library(Lock5Data)
data(FloridaLakes)
#Location relative to rt. 50
$Location <- as.factor(c("S","S","N","S","S","N","N","N","N","N","N",
FloridaLakes"S","N","S","N","N","N","N","S","S","N","S",
"N","S","N","S","N","S","N","N","N","N","N",
"N","S","N","N","S","S","N","N","N","N","S",
"N","S","S","S","S","N","N","N","N"))
head(FloridaLakes %>% select(Lake, AvgMercury, Location))
## # A tibble: 6 x 3
## Lake AvgMercury Location
## <chr> <dbl> <fct>
## 1 Alligator 1.23 S
## 2 Annie 1.33 S
## 3 Apopka 0.04 N
## 4 Blue Cypress 0.44 S
## 5 Brick 1.2 S
## 6 Bryant 0.27 N
When performing the bootstrap, we sample the same number of observations from each category as were present in the original data. In this example, there were 33 lakes in Northern Florida, and 20 in Southern Florida. We’ll fit a model with the categorical explanatory variable, location, and recall that \(b_1\) represents the average difference between northern and southern lakes. We’ll bootstrap for the distribution of \(b_1\).
<- rep(NA, 10000) #vector to store b1 values
b1 for (i in 1:10000){
# sample 33 northern lakes
<- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE)
NLakes # sample 20 southern lakes
<- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE)
SLakes <- rbind(NLakes, SLakes) # combine Northern and Southern Lakes
BootstrapSample <- lm(data=BootstrapSample, AvgMercury ~ Location) # fit linear model
M # record b1 (note indexing in coefficents starts at 1, so b0=coef(M)[1], and b1=coef(M)[2], etc.)
<- coef(M)[2]
b1[i]
}<- data.frame(b1) #save results as dataframe NS_Lakes_Bootstrap_Results
3.1.6 Bootstrap for Regression Coefficients
We’ll bootstrap for coefficients in the bears interaction model.
<- rep(NA, 10000)
b0 <- rep(NA, 10000)
b1 <- rep(NA, 10000)
b2 <- rep(NA, 10000)
b3 for (i in 1:10000){
<- sample_n(Bears_Subset, 97, replace=TRUE) #take bootstrap sample
BootstrapSample <- lm(data=BootstrapSample, Weight ~ Age*Sex) # fit linear model
M #(note indexing in coefficents starts at 1, so b0=coef(M)[1], and b1=coef(M)[2], etc.)
<- coef(M)[1] # record b0
b0[i] <- coef(M)[2] # record b1
b1[i] <- coef(M)[3] # record b2
b2[i] <- coef(M)[4] # record b3
b3[i]
}<- data.frame(b0, b1, b2, b3) Bears_Bootstrap_Results
3.2 Bootstrap Confidence Intervals
3.2.1 Visualizing the Bootstrap Distribution
Template for creating a histogram of the bootstrap distribution:
<- ggplot(data=Bootstrap_Results, aes(x=statistic)) +
Boot_Dist_Plot geom_histogram(color="colorchoice", fill="colorchoice") +
xlab("Statistic") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for ...")
Boot_Dist_Plot
3.2.2 Percentile Confidence Intervals
The quantile commands can be used to find the middle 95% of the bootstrap distribution. (or whatever percentage is desired).
.025 <- quantile(Bootstrap_Results$statistic, 0.025)
q.975 <- quantile(Bootstrap_Results$statistic, 0.975)
qc(q.025, q.975)
We can add the cutoff points of the percentile confidence interval to the plot, as shown.
+ geom_vline(xintercept=c(Estimate - 2*SE, Estimate + 2*SE), color="red") Boot_Dist_Plot
3.2.3 Standard Error Confidence Intervals
When the bootstrap distribution is symmetric and bell-shaped, the 2-standard error method also provides a reasonable confidence interval.
Calculate standard error:
<- sd(Bootstrap_Results$statistic) SE
Calculate estimate from original sample
<- #calculate statistic of interest from original dataset Estimate
Add and subtract 2 standard errors
c(Estimate - 2*SE, Estimate + 2*SE)
We add the cutoff points for the standard error confidence interval
+ geom_vline(xintercept = c(Estimate - 2*SE, Estimate + 2*SE), color="red") Boot_Dist_Plot
3.2.4 M&M Proportions
<- ggplot(data=Candy_Bootstrap_Results, aes(x=Proportion_Blue)) + geom_histogram(fill="blue", color="white")
Boot_Dist_Plot Boot_Dist_Plot
We calculate the 0.025 and 0.975 quantiles of the bootstrap distribution.
.025 <- quantile(Candy_Bootstrap_Results$Proportion_Blue, 0.025)
q.975 <- quantile(Candy_Bootstrap_Results$Proportion_Blue, 0.975)
qc(q.025, q.975)
## 2.5% 97.5%
## 0.2142857 0.7142857
Plot of bootstrap distribution with percentile confidence interval endpoints.
+ geom_vline(xintercept=c(q.025, q.975), color="red") Boot_Dist_Plot
We’ll calculate the standard deviation of the bootstrap distribution, also known as the bootstrap standard error.
<- sd(Candy_Bootstrap_Results$Proportion_Blue)
SE SE
## [1] 0.1318719
A 95% standard error confidence interval is given by:
<- 6/14 Estimate
c(Estimate - 2*SE, Estimate + 2*SE)
## [1] -0.03453962 0.11453962
+ geom_vline(xintercept=c(Estimate - 2*SE, Estimate + 2*SE), color="red") Boot_Dist_Plot
3.2.5 Bootstrap Distribution for Mean Mercury Level in Florida Lakes
Graph the bootstrap distribution:
<- ggplot(data=Lakes_Bootstrap_Results, aes(x=MeanHg)) +
Boot_Dist_Plot geom_histogram(color="white", fill="blue") +
xlab("Mean Mercury Level in Bootstrap Samples") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Mean Mercury Level")
Boot_Dist_Plot
Calculate 0.025 and 0.975 quantiles:
.025 <- quantile(Lakes_Bootstrap_Results$MeanHg, 0.025)
q.975 <- quantile(Lakes_Bootstrap_Results$MeanHg, 0.975)
qc(q.025, q.975)
## 2.5% 97.5%
## 0.4379198 0.6213255
Add cutoff points of interval to the plot:
+ geom_vline(xintercept=c(q.025, q.975), color="red") Boot_Dist_Plot
Calculate standard error:
<- sd(Lakes_Bootstrap_Results$MeanHg)
SE SE
## [1] 0.0467542
Calculate estimate from original sample
<- mean(FloridaLakes$AvgMercury) Estimate
Add and subtract 2 standard errors
c(Estimate - 2*SE, Estimate + 2*SE)
## [1] 0.4336614 0.6206782
Plot with endpoints of SE Confidence interval
+ geom_vline(xintercept = c(Estimate - 2*SE, Estimate + 2*SE),
Boot_Dist_Plot color="red")
3.2.6 Bootstrap Distribution for Difference in Mean Mercury Level in Florida Lakes between N and S
Graph the bootstrap distribution:
<- ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=b1)) +
Boot_Dist_Plot geom_histogram(color="white", fill="blue") +
xlab("Mean Difference in Mercury Level in Bootstrap Samples") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Difference in Mean Mercury Level for N vs S")
Boot_Dist_Plot
Calculate 0.025 and 0.975 quantiles:
.025 <- quantile(NS_Lakes_Bootstrap_Results$b1, 0.025)
q.975 <- quantile(NS_Lakes_Bootstrap_Results$b1, 0.975)
qc(q.025, q.975)
## 2.5% 97.5%
## 0.08751061 0.45354697
Add cutoff points of interval to the plot:
+ geom_vline(xintercept=c(q.025, q.975), color="red") Boot_Dist_Plot
Calculate standard error:
<- sd(NS_Lakes_Bootstrap_Results$b1)
SE SE
## [1] 0.09378214
Calculate estimate from original sample:
<- lm(data=FloridaLakes, AvgMercury~Location)
M <- M$coef[2] #record b1 from actual sample Estimate
Add and subtract 2 standard errors:
c(Estimate - 2*SE, Estimate + 2*SE)
## LocationS LocationS
## 0.08439026 0.45951883
+ geom_vline(xintercept=c(Estimate - 2*SE, Estimate + 2*SE), color="red") Boot_Dist_Plot
3.2.7 Bootstrap Distribution for Coefficients in Bears Multiple Regression Model
We’ll look at the bootstrap distribution for \((b_0 + b_2) + 24(b_1+b_3)\) which represents the expected weight of a 24 month old female bear.
<- ggplot(data=Bears_Bootstrap_Results,
Boot_Dist_Plot aes(x=b0+b2+24*(b1+b3))) +
geom_histogram(color="white", fill="blue") +
xlab("Mean Expected Weight in Bootstrap Samples") +
ylab("Frequency") +
ggtitle("Bootstrap Distribution for Expected Weight of 24 Month old Female Bear")
Boot_Dist_Plot
Calculate 0.025 and 0.975 quantiles:
.025 <- quantile(Bears_Bootstrap_Results$b0 + Bears_Bootstrap_Results$b2 +
q24*(Bears_Bootstrap_Results$b1 + Bears_Bootstrap_Results$b3), 0.025)
.975 <- quantile(Bears_Bootstrap_Results$b0 + Bears_Bootstrap_Results$b2 +
q24*(Bears_Bootstrap_Results$b1 + Bears_Bootstrap_Results$b3), 0.975)
c(q.025, q.975)
## 2.5% 97.5%
## 75.42443 113.49563
Add cutoff points of interval to the plot:
+ geom_vline(xintercept=c(q.025, q.975), color="red") Boot_Dist_Plot
Calculate standard error:
<- sd(Bears_Bootstrap_Results$b0 + Bears_Bootstrap_Results$b2 +
SE 24*(Bears_Bootstrap_Results$b1 + Bears_Bootstrap_Results$b3))
SE
## [1] 9.7006
Calculate estimate from original sample:
<- lm(data=Bears_Subset, Weight~Age*Sex)
M <- M$coef[1] + M$coef[3] + 24*(M$coef[2]+M$coef[4]) Estimate
Add and subtract 2 standard errors:
c(Estimate - 2*SE, Estimate + 2*SE)
## (Intercept) (Intercept)
## 71.94873 110.75113
+ geom_vline(xintercept=c(Estimate - 2*SE, Estimate + 2*SE),
Boot_Dist_Plot color="red")
Note that the standard error method is not appropriate here, due to the right-skeweness in the bootstrap distribution.
3.3 Checking for Gaps and Clusters
If there are gaps and/or clusters in the bootstrap distribution, it is inappropriate to create any kind of bootstrap confidence intervals. This is sometimes hard to tell from a histogram, as the choice of binwidth might lead to gaps that aren’t really there in the bootstrap distribution. We can use jitter plots to look for gaps or clusters in the bootstrap distribution.
To do this, create a jitterplot, with the statistic from the bootstrap distribution on the y-axis. Simply say x=1
, so there is an x-variable. The geom_jitter()
command will spread the points out so they are not all on top of one another, and we can see them.
3.3.1 Example: When the Bootstrap is Inappropriate
ggplot(data=Lakes_Bootstrap_Results, aes(y=MedianHg, x=1)) + geom_jitter()