6.10 Bayesian bootstrap regression
We implement the Bayesian bootstrap (Donnald B. Rubin 1981) for linear regression models. In particular, the Bayesian bootstrap simulates the posterior distributions by assuming that the sample cumulative distribution function (CDF) is the population CDF (this assumption is also implicit in the frequentist bootstrap (B. Efron 1979)).
Given yii.i.d.∼F, where F does not specify a particular parametric family of distributions, but instead sets \mathbb{E}(y_i \mid \boldsymbol{x}_i) = \boldsymbol{x}_i^{\top} \boldsymbol{\beta}, with \boldsymbol{x}_i being a K-dimensional vector of regressors and \boldsymbol{\beta} a K-dimensional vector of parameters, the Bayesian bootstrap generates posterior probabilities for each y_i, where the values of \boldsymbol{y} that are not observed have zero posterior probability.
The algorithm to implement the Bayesian bootstrap is the following:
Draw g ∼ Dir(α1, α2, ..., αN) such that αi=1 for all i g=(g1, g2, ..., gN) is the vector of probabilities to attach to (y1,x1), (y2,x2), ..., (yN,xN) for each Bayesian bootstrap replication. Sample (yi,xi) N times with replacement and probabilities gi, i=1,2, ...,N. Estimate β using ordinary least squares in the model E(y|X)=Xβ, y being a S1 dimensional vector, and X a S1 X K matrix from the previous stage. Repeat this process S times. The distribution of β(s) is the Bayesian distribution of β.
Example: Simulation exercise
Let’s perform a simulation exercise to evaluate the performance of the previous Algorithm for inference using the Bayesian bootstrap. The data-generating process is defined by two regressors, each distributed as standard normal. The location vector is \boldsymbol{\beta} = \left[1 \ 1 \ 1\right]^{\top}, with a variance of \sigma^2 = 1, and the sample size is 1,000.
The following Algorithm illustrates how to use our GUI to run the Bayesian bootstrap. Our GUI is based on the bayesboot command from the bayesboot package in R. Exercise 11 asks about using this package to perform inference in this simulation and compares the results with those obtained using our GUI with S = 10000.
Algorithm: Bayesian Bootstrap in Linear Regression
Select Univariate Models on the top panel
Select Bootstrap model using the left radio button
Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
Select number of bootstrap replications using the Range sliders
Select dependent and independent variables using the Formula builder table
Click the Build formula button to generate the formula in R syntax. You can modify the formula in the Main equation box using valid arguments of the formula command structure in R
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons
The following R code shows how to program the Bayesian bootstrap from scratch. We observe from the results that all 95% credible intervals encompass the population parameters, and the posterior means are close to the population parameters.
rm(list = ls()); set.seed(010101)
N <- 1000; x1 <- runif(N); x2 <- rnorm(N)
X <- cbind(x1, x2); k <- dim(X)[2]
B <- rep(1, k+1); sig2 <- 1
u <- rnorm(N, 0, sig2); y <- cbind(1, X)%*%B + u
data <- as.data.frame(cbind(y, X))
names(data) <- c("y", "x1", "x2")
Reg <- function(d){
Reg <- lm(y ~ x1 + x2, data = d)
Bhat <- Reg$coef
return(Bhat)
}
S <- 10000; alpha <- 1
BB <- function(S, df, alpha){
Betas <- matrix(NA, S, dim(df)[2])
N <- dim(df)[1]
pb <- winProgressBar(title = "progress bar", min = 0, max = S, width = 300)
for(s in 1:S){
g <- LaplacesDemon::rdirichlet(N, alpha)
ids <- sample(1:N, size = N, replace = TRUE, prob = g)
datas <- df[ids,]
names(datas) <- names(df)
Bs <- Reg(d = datas)
Betas[s, ] <- Bs
setWinProgressBar(pb, s, title=paste( round(s/S*100, 0), "% done"))
}
close(pb)
return(Betas)
}
BBs <- BB(S = S, df = data, alpha = alpha)
summary(coda::mcmc(BBs))
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## [1,] 0.9182 0.06475 0.0006475 0.0006475
## [2,] 1.1735 0.10948 0.0010948 0.0010948
## [3,] 1.0133 0.03359 0.0003359 0.0003359
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## var1 0.7882 0.8755 0.9178 0.9612 1.045
## var2 0.9560 1.1009 1.1738 1.2467 1.391
## var3 0.9467 0.9902 1.0134 1.0362 1.079