第 21 章 贝叶斯模型

主要参考 Derek S. Young (Young 2017) 和 Michael H. Kutner 等 (Kutner et al. 2005) 功能比较全面的 R 包或者框架的介绍,如 rstan 最有效的 R 包(计算结果最好) 最快实现的 R 包(计算速度最快、开发效率高)Paul-Christian Bürkner 的报告包含 LM,GLM,MLM,NLM https://github.com/InsuranceDataScience/StanWorkshop2018

JAGS 的 R 接口有苏毓松开发的 R2jags 包 (Su and Yajima 2020)

线性模型的内容主要分为四大块,分别是线性回归模型、方差分析模型、协方差分析模型和线性混合效应模型。国外 David Pollard 的线性模型 课程内容

加载 rstan,如果你的电脑配置有多核处理器,内存也充足,那么可以考虑使用并行方式去估计你的模型参数

library(rstan)
# 将编译后的模型写入磁盘,避免重新编译,在 Github Action 异构集群上应每次重新编译,否则容易触发错误 caught illegal operation
# rstan_options(auto_write = TRUE)
# 如果CPU和内存足够,设置成与马尔科夫链一样多
# options(mc.cores = max(c(floor(parallel::detectCores() / 2), 1L)))
options(mc.cores = 1L)

custom_colors <- c(
  "#4285f4", # GoogleBlue
  "#34A853", # GoogleGreen
  "#FBBC05", # GoogleYellow
  "#EA4335" # GoogleRed
)
rstan_ggtheme_options(
  panel.background = element_rect(fill = "white"),
  legend.position = "top"
)
rstan_gg_options(
  fill = "#4285f4", color = "white",
  pt_color = "#EA4335", chain_colors = custom_colors
)

21.1 高斯过程

模拟高斯过程例子来自 Stan 参考手册 (Stan Development Team 2019)

data {
  int<lower=1> N;
  real x[N];
}
transformed data {
  matrix[N, N] K;
  vector[N] mu = rep_vector(0, N);
  for (i in 1:(N - 1)) {
    K[i, i] = 1 + 0.1;
    for (j in (i + 1):N) {
      K[i, j] = exp(-0.5 * square(x[i] - x[j]));
      K[j, i] = K[i, j];
    }
  }
  K[N, N] = 1 + 0.1;
}
parameters {
  vector[N] y;
}
model {
  y ~ multi_normal(mu, K);
}
library(cmdstanr)
mod <- cmdstan_model(file = "code/normal_gp.stan")

stan 库内置了核函数为二次幂指数的实现,因此可以直接调用 cov_exp_quad 函数计算协方差矩阵

data {
  int<lower=1> N;
  real x[N];
}
transformed data {
  matrix[N, N] K = cov_exp_quad(x, 1.0, 1.0);
  vector[N] mu = rep_vector(0, N);
  for (n in 1:N)
    K[n, n] = K[n, n] + 0.1;
}
parameters {
  vector[N] y;
}
model {
  y ~ multi_normal(mu, K);
}
library(cmdstanr)
mod <- cmdstan_model(file = "code/compat_gp.stan")

以 MASS 的 topo 数据集引出高斯过程回归模型问题复杂性

21.2 正态分布

我们以估计正态分布参数为例说明贝叶斯估计方法

\[Y \sim \mathcal{N}(\mu,\sigma^2)\] 已知 \(y_1,y_2,\ldots,y_n\) 是来自正态总体 \(\mathcal{N}(\mu,\sigma^2)\) 的一个样本,我们需要估计这个正态分布模型的参数 \(\mu\)\(\sigma^2\)

最大似然估计,简单推导过程,计算代码;再讲 stan 的计算步骤

// The input data is a vector 'y' of length 'N'.
data {
  int<lower=0> N;
  vector[N] y;
}
// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {
  real mu;
  real<lower=0> sigma;
}
// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  y ~ normal(mu, sigma);
}
library(cmdstanr)
mod <- cmdstan_model(file = "code/normal_dist.stan")

打包观测数据,初始化待估参数值,指定链条数,其中 dataList 必须与 stan 代码中数据块声明保持一致(如变量名称,长度),每条链使用不同的初始值,选择合适的初始值可以有效地提高收敛的速度。

# 数据准备
set.seed(20190427)
# 设置参数
mu <- 10
sd <- 2
# 样本量
nobs <- 500
nchains <- 4
# 生成随机数
y <- rnorm(n = nobs, mean = mu, sd = sd)
# 给每条链设置不同的参数初始值
inits_data <- lapply(1:nchains, function(i) {
  list(
    mu = runif(1, min(y), max(y)),
    sigma = runif(1, 1, 10)
  )
})
# 拟合模型
normal_fit <- sampling(normal_dist,
  data = list(
    N = nobs,
    y = y
  ),
  init = inits_data,
  warmup = 1000, # 每条链预处理迭代次数
  iter = 2000, # 每条链总迭代次数
  chains = nchains, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)
normal_fit <- mod$sample(
  data = list(
    N = nobs,
    y = y
  ),
  init = inits_data,
  iter_warmup = 1000, # 每条链预处理迭代次数
  iter_sampling = 2000, # 每条链总迭代次数
  chains = nchains, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)

检查收敛性,Rhat 决定收敛性,所有待估参数的Rhat必须小于1.1,同时有效样本数量 n_eff 除以抽样总数 N 必须小于0.001,否则收敛性是值得怀疑的。马尔科夫链蒙特卡罗采样的轨迹图(trace plot)

stan_trace(normal_fit, nrow = 2) +
  theme_minimal()
Markov chain traceplots

图 21.1: Markov chain traceplots

拟合结果及解释

# 模型参数估计结果
print(normal_fit)
## Inference for Stan model: 4eee79da1367e8c54e2f1c4e8c84d3e1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd    2.5%     25%     50%     75%   97.5%
## mu      10.02    0.00 0.09    9.84    9.95   10.02   10.08   10.19
## sigma    2.03    0.00 0.07    1.90    1.98    2.02    2.07    2.16
## lp__  -601.59    0.02 1.02 -604.35 -601.95 -601.29 -600.87 -600.60
##       n_eff Rhat
## mu     3555    1
## sigma  3469    1
## lp__   1813    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 06:26:37 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

抽取均值和方差,绘制后验分布图

est_mean <- rstan::extract(normal_fit, "mu")$mu
est_sd <- rstan::extract(normal_fit, "sigma")$sigma
# plot results
par(mfrow = c(1, 2))
hist(est_mean, breaks = 50, col = "#4285f4", border = "white", 
     xlab = expression(mu), main = "")
abline(v = mu, lwd = 2, col = "#EA4335")
hist(est_sd, breaks = 50, col = "#4285f4", border = "white",
     xlab = expression(sigma), main = "")
abline(v = sd, lwd = 2, col = "#EA4335")
均值和方差的后验分布及贝叶斯估计值

图 21.2: 均值和方差的后验分布及贝叶斯估计值

21.3 分层正态模型

Multilevel Models 多水平模型、Hierarchical Models 层次模型

21.3.1 美国八校教育考试数据集

// Stan 编写的模型
// saved as 8schools.stan
data {
  int<lower=0> J;         // number of schools 
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // standard error of effect estimates 
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector[J] eta;          // unscaled deviation from mu by school
}
transformed parameters {
  vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(eta | 0, 1);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
}
library(cmdstanr)
mod <- cmdstan_model(file = "code/eight_schools.stan")
# 模型拟合
eight_schools_fit <- sampling(eight_schools,
  data = list( # 观测数据
    J = 8,
    y = c(28, 8, -3, 7, -1, 1, 18, 12),
    sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
  ),
  warmup = 1000, # 每条链预处理迭代次数
  iter = 2000, # 每条链总迭代次数
  chains = 4, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)
## Warning: There were 4 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
eight_schools_fit <- mod$sample(
  data = list( # 观测数据
    J = 8,
    y = c(28, 8, -3, 7, -1, 1, 18, 12),
    sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
  ),
  iter_warmup = 1000, # 每条链预处理迭代次数
  iter_sampling = 2000, # 每条链总迭代次数
  chains = 4, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)

或者

eight_schools_fit <- stan(
  model_name = "eight_schools",
  # file = "code/stan/8schools.stan",
  model_code = "
  // saved as 8schools.stan
  data {
    int<lower=0> J;         // number of schools 
    real y[J];              // estimated treatment effects
    real<lower=0> sigma[J]; // standard error of effect estimates 
  }
  parameters {
    real mu;                // population treatment effect
    real<lower=0> tau;      // standard deviation in treatment effects
    vector[J] eta;          // unscaled deviation from mu by school
  }
  transformed parameters {
    vector[J] theta = mu + tau * eta;        // school treatment effects
  }
  model {
    target += normal_lpdf(eta | 0, 1);       // prior log-density
    target += normal_lpdf(y | theta, sigma); // log-likelihood
  }
  ",
  data = list( # 观测数据
    J = 8,
    y = c(28, 8, -3, 7, -1, 1, 18, 12),
    sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
  ),
  warmup = 1000, # 每条链预处理迭代次数
  iter = 2000, # 每条链总迭代次数
  chains = 4, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)
Warning messages:
1: There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems

模型拟合结果

print(eight_schools_fit, digits = 1)
## Inference for Stan model: 39b7c714de7a00171b8f4ed5ac9f72ed.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu         8.0     0.1 5.1  -1.8   4.6   8.1  11.4  18.4  1544    1
## tau        6.6     0.2 5.8   0.3   2.6   5.2   8.7  22.2  1402    1
## eta[1]     0.4     0.0 0.9  -1.4  -0.2   0.4   1.0   2.2  3679    1
## eta[2]     0.0     0.0 0.9  -1.8  -0.6   0.0   0.6   1.7  3372    1
## eta[3]    -0.2     0.0 0.9  -2.1  -0.8  -0.2   0.4   1.6  4024    1
## eta[4]     0.0     0.0 0.9  -1.8  -0.6  -0.1   0.5   1.7  3956    1
## eta[5]    -0.3     0.0 0.9  -2.0  -0.9  -0.4   0.2   1.4  3223    1
## eta[6]    -0.2     0.0 0.9  -2.0  -0.8  -0.3   0.4   1.6  3327    1
## eta[7]     0.4     0.0 0.9  -1.5  -0.2   0.4   0.9   2.1  3177    1
## eta[8]     0.0     0.0 1.0  -1.9  -0.6   0.1   0.7   1.9  3931    1
## theta[1]  11.4     0.2 8.3  -1.6   6.0  10.5  15.3  31.6  2512    1
## theta[2]   7.8     0.1 6.5  -5.1   3.8   7.9  11.8  20.2  4251    1
## theta[3]   6.3     0.1 7.7 -10.7   2.0   6.6  11.0  20.7  3796    1
## theta[4]   7.4     0.1 6.6  -6.2   3.4   7.6  11.6  20.2  4027    1
## theta[5]   5.2     0.1 6.4  -9.5   1.4   5.6   9.6  16.4  3998    1
## theta[6]   6.3     0.1 6.6  -7.9   2.3   6.5  10.5  18.9  3942    1
## theta[7]  10.8     0.1 6.8  -0.8   6.1  10.1  14.6  26.2  3051    1
## theta[8]   8.6     0.1 8.0  -6.8   3.7   8.2  13.0  26.1  2892    1
## lp__     -39.6     0.1 2.7 -45.4 -41.1 -39.4 -37.8 -34.9  1242    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 06:26:38 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
# 获取马尔科夫链迭代点列
eight_schools_sim <- extract(eight_schools_fit, permuted = FALSE)
# 提取参数 mu 的四条迭代点列
eight_schools_mu_sim <- eight_schools_sim[, , "mu"]

eight_schools_sim 是一个三维数组,如果 permuted = TRUE 则会合并四条马氏链的迭代结果,变成一个列表

str(eight_schools_sim)
##  num [1:1000, 1:4, 1:19] 17.32 3.71 9.23 10.76 10.86 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ iterations: NULL
##   ..$ chains    : chr [1:4] "chain:1" "chain:2" "chain:3" "chain:4"
##   ..$ parameters: chr [1:19] "mu" "tau" "eta[1]" "eta[2]" ...
class(eight_schools_sim)
## [1] "array"

模型诊断:查看迭代点列的平稳性

matplot(eight_schools_mu_sim,
  xlab = "Iteration", ylab = expression(mu),
  type = "l", lty = "solid", col = custom_colors
)
# abline(h = apply(eight_schools_mu_sim, 2, mean), col = custom_colors)
legend("bottomleft",
  legend = paste0("chain:", seq(4)), box.col = "white", inset = 0.01,
  lty = "solid", horiz = TRUE, col = custom_colors
)

或者使用 rstan 提供的 traceplot 函数或者 stan_trace 函数,rstan 大量依赖 ggplot2 绘图,所以如果你熟悉 GGplot2 可以很方便地定制属于自己的风格,除了 rstan 提供的 rstan_ggtheme_optionsrstan_gg_options 两个函数外,还可以使用 ggplot2 自带的大量配置选项和主题,如 theme_minimal 主题,因为 stan_trace等作图函数返回的是一个 ggplot 对象。

# traceplot(eight_schools_fit, pars = "mu")
stan_trace(eight_schools_fit, pars = "mu") +
  theme_minimal() +
  labs(x = "Iteration", y = expression(mu))
马氏链的迭代序列

图 21.3: 马氏链的迭代序列

序列的自相关图

acf(eight_schools_mu_sim)

类似地,我们这里也使用 stan_ac 函数绘制自相关图

stan_ac(eight_schools_fit, pars = "mu", separate_chains = TRUE) +
  theme_minimal()
马氏链的自相关图

图 21.4: 马氏链的自相关图

可以用 stan_hist 函数绘制参数 \(\mu\) 的后验分布图,它没有 separate_chains 参数,所以不能分链条绘制

stan_hist(eight_schools_fit, pars = "mu") + theme_minimal()

参数 \(\mu\)\(\tau\) 的采样点散点图

stan_scat(eight_schools_fit, pars = c("mu","tau")) + theme_minimal()

参数 \(\mu\) 的后验密度图

stan_dens(eight_schools_fit, pars = "mu", separate_chains = TRUE) + 
  theme_minimal() +
  labs(x = expression(mu), y = "Density")

查看参数 \(\tau\) 的 95% 置信区间

print(eight_schools_fit, "tau", probs = c(0.025, 0.975))
## Inference for Stan model: 39b7c714de7a00171b8f4ed5ac9f72ed.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##     mean se_mean   sd 2.5% 97.5% n_eff Rhat
## tau 6.57    0.15 5.76 0.29 22.17  1402    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 06:26:38 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

从模拟数据获得与 print(fit) 一样的结果

# rstan 自带 summary 函数汇总
summary(eight_schools_fit)$summary
##                  mean    se_mean        sd        2.5%         25%
## mu         8.03363750 0.12997455 5.1076135  -1.7905546   4.6243972
## tau        6.56630564 0.15368707 5.7551173   0.2903407   2.5664054
## eta[1]     0.39084544 0.01523904 0.9243495  -1.4484276  -0.2105463
## eta[2]    -0.01812982 0.01543025 0.8959719  -1.8162239  -0.5908863
## eta[3]    -0.18900664 0.01478608 0.9379389  -2.0692488  -0.8185928
## eta[4]    -0.04807925 0.01391410 0.8751997  -1.7845835  -0.6143242
## eta[5]    -0.34945379 0.01554657 0.8826264  -2.0480796  -0.9222439
## eta[6]    -0.22756422 0.01549464 0.8937398  -1.9501428  -0.8314175
## eta[7]     0.35213086 0.01591392 0.8970345  -1.4587431  -0.2169395
## eta[8]     0.04238602 0.01525136 0.9561752  -1.8664661  -0.6066864
## theta[1]  11.43380521 0.16474689 8.2577797  -1.5683606   5.9558640
## theta[2]   7.84507858 0.09945511 6.4847970  -5.1230115   3.8335129
## theta[3]   6.27040719 0.12473558 7.6852088 -10.6539064   2.0347747
## theta[4]   7.43583185 0.10416959 6.6106399  -6.2301460   3.4026316
## theta[5]   5.15555881 0.10188047 6.4420825  -9.4607360   1.4040800
## theta[6]   6.27246302 0.10490295 6.5865382  -7.8555822   2.3312351
## theta[7]  10.75556096 0.12368961 6.8320866  -0.8375143   6.1380401
## theta[8]   8.56153284 0.14802194 7.9599953  -6.8323827   3.7252409
## lp__     -39.60232704 0.07530593 2.6540844 -45.4335624 -41.1121259
##                    50%         75%      97.5%    n_eff      Rhat
## mu        8.065200e+00  11.3520719  18.424622 1544.256 0.9995936
## tau       5.179738e+00   8.7277772  22.169354 1402.277 1.0005617
## eta[1]    4.110242e-01   1.0093911   2.172188 3679.234 0.9992893
## eta[2]    1.548274e-04   0.5544142   1.748869 3371.652 0.9998038
## eta[3]   -1.968897e-01   0.4358008   1.647165 4023.859 1.0002767
## eta[4]   -6.734432e-02   0.5165025   1.695018 3956.436 0.9999675
## eta[5]   -3.552985e-01   0.2210551   1.442920 3223.181 0.9994154
## eta[6]   -2.514102e-01   0.3700746   1.571887 3327.049 0.9990629
## eta[7]    3.554341e-01   0.9353854   2.071842 3177.342 1.0010123
## eta[8]    7.444842e-02   0.6745345   1.906864 3930.589 1.0003682
## theta[1]  1.045143e+01  15.3309635  31.596545 2512.419 1.0003236
## theta[2]  7.870729e+00  11.8037202  20.248392 4251.465 0.9995417
## theta[3]  6.645797e+00  10.9728182  20.724913 3796.039 0.9995315
## theta[4]  7.590394e+00  11.6322488  20.227692 4027.217 0.9995794
## theta[5]  5.607503e+00   9.5840912  16.439581 3998.257 0.9995214
## theta[6]  6.460515e+00  10.5423516  18.890980 3942.203 0.9996620
## theta[7]  1.012190e+01  14.5814212  26.166090 3050.986 0.9997217
## theta[8]  8.238568e+00  13.0317340  26.123857 2891.835 0.9997389
## lp__     -3.937146e+01 -37.8154992 -34.937648 1242.141 1.0007668
# 合并四条马氏链的结果
eight_schools_sim <- extract(eight_schools_fit, permuted = TRUE)
str(eight_schools_sim)
## List of 5
##  $ mu   : num [1:4000(1d)] 5.8 3.64 16.56 14.29 10.79 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ tau  : num [1:4000(1d)] 8.427 0.618 11.302 0.795 10.884 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ eta  : num [1:4000, 1:8] -0.16 -0.332 0.659 -1.552 1.82 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ iterations: NULL
##   .. ..$           : NULL
##  $ theta: num [1:4000, 1:8] 4.44 3.44 24.01 13.05 30.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ iterations: NULL
##   .. ..$           : NULL
##  $ lp__ : num [1:4000(1d)] -38.1 -41.2 -42.5 -43.1 -43.7 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
class(eight_schools_sim)
## [1] "list"
apply(eight_schools_sim$eta, 2, mean)
## [1]  0.39084544 -0.01812982 -0.18900664 -0.04807925 -0.34945379
## [6] -0.22756422  0.35213086  0.04238602
apply(eight_schools_sim$theta, 2, mean)
## [1] 11.433805  7.845079  6.270407  7.435832  5.155559  6.272463
## [7] 10.755561  8.561533
lapply(eight_schools_sim["mu"], mean)
## $mu
## [1] 8.033638
lapply(eight_schools_sim["tau"], mean)
## $tau
## [1] 6.566306
lapply(eight_schools_sim["lp__"], mean)
## $lp__
## [1] -39.60233
t(apply(eight_schools_sim$eta, 2, quantile, probs = c(2.5, 25, 50, 75, 97.5) / 100))
##       
##             2.5%        25%           50%       75%    97.5%
##   [1,] -1.448428 -0.2105463  0.4110242141 1.0093911 2.172188
##   [2,] -1.816224 -0.5908863  0.0001548274 0.5544142 1.748869
##   [3,] -2.069249 -0.8185928 -0.1968897135 0.4358008 1.647165
##   [4,] -1.784583 -0.6143242 -0.0673443188 0.5165025 1.695018
##   [5,] -2.048080 -0.9222439 -0.3552984797 0.2210551 1.442920
##   [6,] -1.950143 -0.8314175 -0.2514102143 0.3700746 1.571887
##   [7,] -1.458743 -0.2169395  0.3554340687 0.9353854 2.071842
##   [8,] -1.866466 -0.6066864  0.0744484175 0.6745345 1.906864
t(apply(eight_schools_sim$theta, 2, quantile, probs = c(2.5, 25, 50, 75, 97.5) / 100))
##       
##               2.5%      25%       50%       75%    97.5%
##   [1,]  -1.5683606 5.955864 10.451430 15.330964 31.59654
##   [2,]  -5.1230115 3.833513  7.870729 11.803720 20.24839
##   [3,] -10.6539064 2.034775  6.645797 10.972818 20.72491
##   [4,]  -6.2301460 3.402632  7.590394 11.632249 20.22769
##   [5,]  -9.4607360 1.404080  5.607503  9.584091 16.43958
##   [6,]  -7.8555822 2.331235  6.460515 10.542352 18.89098
##   [7,]  -0.8375143 6.138040 10.121904 14.581421 26.16609
##   [8,]  -6.8323827 3.725241  8.238568 13.031734 26.12386
lapply(eight_schools_sim["mu"], quantile, probs = c(2.5, 25, 50, 75, 97.5) / 100)
## $mu
##      2.5%       25%       50%       75%     97.5% 
## -1.790555  4.624397  8.065200 11.352072 18.424622
lapply(eight_schools_sim["tau"], quantile, probs = c(2.5, 25, 50, 75, 97.5) / 100)
## $tau
##       2.5%        25%        50%        75%      97.5% 
##  0.2903407  2.5664054  5.1797384  8.7277772 22.1693542
lapply(eight_schools_sim["lp__"], quantile, probs = c(2.5, 25, 50, 75, 97.5) / 100)
## $lp__
##      2.5%       25%       50%       75%     97.5% 
## -45.43356 -41.11213 -39.37146 -37.81550 -34.93765

参数 \(\mu\)\(\tau\) 和 lp__21 的后验分布图

pairs(eight_schools_fit, pars = c("mu", "tau", "lp__"))
参数的后验分布图

图 21.5: 参数的后验分布图

rstan 还支持从外部磁盘读取代码

fit <- stan(file = 'code/stan/8schools.stan', ...)

schools_dat <- read_rdump('data/8schools.rdump')
source('data/8schools.rdump')

分层线性模型之生长曲线模型 (Gelfand et al. 1990)

21.3.2 Rats 数据贝叶斯分层建模

# 数据准备
# modified code from https://github.com/stan-dev/example-models/tree/master/bugs_examples/vol1/rats
N <- 30
T <- 5
y <- structure(c(
  151, 145, 147, 155, 135, 159, 141, 159, 177, 134,
  160, 143, 154, 171, 163, 160, 142, 156, 157, 152, 154, 139, 146,
  157, 132, 160, 169, 157, 137, 153, 199, 199, 214, 200, 188, 210,
  189, 201, 236, 182, 208, 188, 200, 221, 216, 207, 187, 203, 212,
  203, 205, 190, 191, 211, 185, 207, 216, 205, 180, 200, 246, 249,
  263, 237, 230, 252, 231, 248, 285, 220, 261, 220, 244, 270, 242,
  248, 234, 243, 259, 246, 253, 225, 229, 250, 237, 257, 261, 248,
  219, 244, 283, 293, 312, 272, 280, 298, 275, 297, 350, 260, 313,
  273, 289, 326, 281, 288, 280, 283, 307, 286, 298, 267, 272, 285,
  286, 303, 295, 289, 258, 286, 320, 354, 328, 297, 323, 331, 305,
  338, 376, 296, 352, 314, 325, 358, 312, 324, 316, 317, 336, 321,
  334, 302, 302, 323, 331, 345, 333, 316, 291, 324
), .Dim = c(30, 5))
x <- c(8.0, 15.0, 22.0, 29.0, 36.0)
xbar <- 22.0


# 模型参数设置
chains <- 4
iter <- 1000

init <- rep(list(list(
  alpha = rep(250, 30), beta = rep(6, 30),
  alpha_c = 150, beta_c = 10,
  tausq_c = 1, tausq_alpha = 1,
  tausq_beta = 1
)), chains)
rats_fit <- stan(
  model_name = "rats",
  model_code = "
  data {
    int<lower=0> N;
    int<lower=0> T;
    real x[T];
    real y[N,T];
    real xbar;
  }
  parameters {
    real alpha[N];
    real beta[N];
  
    real alpha_c;
    real beta_c;          // beta.c in original bugs model
  
    real<lower=0> tausq_c;
    real<lower=0> tausq_alpha;
    real<lower=0> tausq_beta;
  }
  transformed parameters {
    real<lower=0> tau_c;       // sigma in original bugs model
    real<lower=0> tau_alpha;
    real<lower=0> tau_beta;
  
    tau_c = sqrt(tausq_c);
    tau_alpha = sqrt(tausq_alpha);
    tau_beta = sqrt(tausq_beta);
  }
  model {
    alpha_c ~ normal(0, 100);
    beta_c ~ normal(0, 100);
    tausq_c ~ inv_gamma(0.001, 0.001);
    tausq_alpha ~ inv_gamma(0.001, 0.001);
    tausq_beta ~ inv_gamma(0.001, 0.001);
    alpha ~ normal(alpha_c, tau_alpha); // vectorized
    beta ~ normal(beta_c, tau_beta);  // vectorized
    for (n in 1:N)
      for (t in 1:T) 
        y[n,t] ~ normal(alpha[n] + beta[n] * (x[t] - xbar), tau_c);
  }
  generated quantities {
    real alpha0;
    alpha0 = alpha_c - xbar * beta_c;
  }
  ",
  data = list(N = N, T = T, y = y, x = x, xbar = xbar),
  chains = chains, init = init, iter = iter
)
## 
## SAMPLING FOR MODEL 'rats' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 3.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.39 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.274649 seconds (Warm-up)
## Chain 1:                0.076499 seconds (Sampling)
## Chain 1:                0.351148 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'rats' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.7e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.318543 seconds (Warm-up)
## Chain 2:                0.088173 seconds (Sampling)
## Chain 2:                0.406716 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'rats' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.7e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.476594 seconds (Warm-up)
## Chain 3:                0.085621 seconds (Sampling)
## Chain 3:                0.562215 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'rats' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.4e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 4: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 4: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 4: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 4: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 4: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 4: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 4: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 4: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 4: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 4: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 4: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.292816 seconds (Warm-up)
## Chain 4:                0.085722 seconds (Sampling)
## Chain 4:                0.378538 seconds (Total)
## Chain 4:
library(cmdstanr)
mod <- cmdstan_model(file = "code/rats.stan")

rats_fit <- mod$sample(
  data = list(N = N, T = T, y = y, x = x, xbar = xbar),
  init = init,
  iter_warmup = 1000, # 每条链预处理迭代次数
  iter_sampling = 2000, # 每条链总迭代次数
  chains = chains, # 马尔科夫链的数目
  cores = 1, # 指定 CPU 核心数,可以给每条链分配一个
  verbose = FALSE, # 不显示迭代的中间过程
  refresh = 0, # 不显示采样的进度
  seed = 20190425 # 设置随机数种子,不要使用 set.seed() 函数
)
plot(rats_fit, pars = "alpha")

plot(rats_fit, pars = "beta")

21.4 贝叶斯广义线性模型

统计之都已对 mcmc 包的帮助文档做了翻译 https://cosx.org/2012/07/mcmc-case-study/

以一个广义线性模型为例说明贝叶斯数据分析的过程。模拟数据集 logit 来自 R包 mcmc,它包含5个变量,一个响应变量 y 和四个预测变量 x1,x2,x3,x4。频率派的分析可以用这样几行 R 代码实现

library(mcmc)
data(logit)
fit <- glm(y ~ x1 + x2 + x3 + x4, data = logit, 
           family = binomial(), x = TRUE)
summary(fit)
## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = binomial(), data = logit, 
##     x = TRUE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7461  -0.6907   0.1540   0.7041   2.1943  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.6328     0.3007   2.104  0.03536 * 
## x1            0.7390     0.3616   2.043  0.04100 * 
## x2            1.1137     0.3627   3.071  0.00213 **
## x3            0.4781     0.3538   1.351  0.17663   
## x4            0.6944     0.3989   1.741  0.08172 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 137.628  on 99  degrees of freedom
## Residual deviance:  87.668  on 95  degrees of freedom
## AIC: 97.668
## 
## Number of Fisher Scoring iterations: 6

现在,我们想用贝叶斯的方法来分析同一份数据,假定5个参数(回归系数)的先验分布是独立同正态分布,且均值为 0,标准差为 2。

该广义线性模型的对数后验密度(对数似然加上对数先验)可以通过下面的 R 命令给出

x <- fit$x
y <- fit$y
lupost <- function(beta, x, y) {
  eta <- as.numeric(x %*% beta)
  logp <- ifelse(eta < 0, eta - log1p(exp(eta)), -log1p(exp(-eta)))
  logq <- ifelse(eta < 0, -log1p(exp(eta)), -eta - log1p(exp(-eta)))
  logl <- sum(logp[ y == 1]) + sum(logq[y == 0])
  return(logl - sum(beta^2) / 8)
}

为了防止溢出 (overflow) 和巨量消失 (catastrophic cancellation),计算 \(\log(p)\)\(\log(q)\) 使用了如下技巧

\[\begin{align*} p &= \frac{\exp(\eta)}{1 + \exp(\eta)} = \frac{1}{1 + \exp(- \eta)} \\ q &= \frac{1}{1 + \exp(\eta)} = \frac{\exp(- \eta)}{1 + \exp(- \eta)} \end{align*}\]

然后对上式取对数

\[\begin{align*} \log(p) &= \eta - \log(1 + \exp(\eta)) = - \log(1 + \exp(- \eta)) \\ \log(q) &= - \log(1 + exp(\eta)) = - \eta - \log(1 + \exp(-\eta)) \end{align*}\]

为防止溢出,我们总是让 exp 的参数取负数,也防止在 \(|\eta|\) 很大时巨量消失。比如,当 \(\eta\) 为很大的正数时,

\[\begin{align*} p & \approx 1 \\ q & \approx 0 \\ \log(p) & \approx - \exp(-\eta) \\ \log(q) & \approx - \eta - \exp(-\eta) \end{align*}\]

\(\eta\) 为很小的数时,使用 R 内置的函数 log1p 计算,当 \(\eta\) 为大的负数时,情况类似22

有了上面这些准备,现在可以运行随机游走 Metropolis 算法模拟后验分布

set.seed(2018)
beta.init <- as.numeric(coefficients(fit))
fit.bayes <- metrop(obj = lupost, initial = beta.init, 
                    nbatch = 1e3, blen = 1, nspac = 1, x = x, y = y)
names(fit.bayes)
##  [1] "accept"       "batch"        "initial"      "final"       
##  [5] "accept.batch" "initial.seed" "final.seed"   "time"        
##  [9] "lud"          "nbatch"       "blen"         "nspac"       
## [13] "scale"        "debug"
fit.bayes$accept
## [1] 0.008

这里使用的 metrop 函数的参数说明如下:

  • 自编的 R 函数 lupost 计算未归一化的 Markov 链的平稳分布(后验分布)的对数密度;
  • beta.init 表示 Markov 链的初始状态;
  • Markov 链的 batches;
  • x,y 是提供给目标函数 lupost 的额外参数

参考文献

Gelfand, Alan E., Susan E. Hills, Amy Racine-Poon, and Adrian F. M. Smith. 1990. “Illustration of Bayesian Inference in Normal Data Models Using Gibbs Sampling.” Journal of the American Statistical Association 85 (412): 972–85. https://doi.org/10.1080/01621459.1990.10474968.

Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models. Fifth. New York, NY: McGraw-Hill/Irwin.

Stan Development Team. 2019. Bayesian Statistics Using Stan. https://github.com/stan-dev/stan-book.

Su, Yu-Sung, and Masanao Yajima. 2020. R2jags: Using R to Run JAGS. https://CRAN.R-project.org/package=R2jags.

Young, Derek S. 2017. Handbook of Regression Methods. Boca Raton, FL: Chapman; Hall/CRC.


  1. 后验分布的对数,pairs 函数中再添加参数 log = TRUE 可获得非负参数取对数后的分布图↩︎

  2. 更加精确的计算 \(\log(1-\exp(-|a|)), |a| \ll 1\) 可以借助 Rmpfrhttps://r-forge.r-project.org/projects/rmpfr/↩︎