Chapter 5 Hierarchical Model
Consider the following model: yiind∼p(y∣θi)θiind∼p(θ∣ϕ)ϕ∼p(ϕ) This is a hierarchical or multilevel model.
The joint posterior distribution can be decomposed via p(θ,ϕ∣y)=p(θ∣ϕ,y)p(ϕ∣y) where
- p(θ∣ϕ,y)∝p(y∣θ)p(θ∣ϕ)=∏ni=1p(yi∣θi)p(θi∣ϕ)=∏ni=1p(θi∣ϕ,yi)
- $p(y) p(y) p() $
- p(y∣ϕ)=∫p(y∣θ)p(θ∣ϕ)dθ=∏ni=1∫p(yi∣θi)p(θi∣ϕ)dθi=∏ni=1p(yi∣ϕ)
Example: Beta-Binomial
Consider Yiind∼Bin(ni,θi)θiind∼Be(α,β)α,β∼p(α,β) In this example, ϕ=(α,β). p(θ∣α,β,y)=∏ni=1Be(θi∣α+yi,β+ni−yi). The marginal posterior for α,β is p(α,β∣y)∝p(y∣α,β)p(α,β) with \begin{aligned} p(y\mid \alpha, \beta) &= \prod_{i = 1}^n \int p(y_i\mid \theta_i) p(\theta_i\mid \alpha, \beta) d\theta_i \\ & = \prod_{i = 1}^n \int Bin(y_i\mid n_i, \theta_i)Be(\theta_i \mid \alpha, \beta)d\theta_i \\ & = \prod_{i = 1}^n {n_i\choose y_i}\frac{B(\alpha + y_i , \beta + n_i - y_i)}{B(\alpha, \beta)} \end{aligned} Thus, y_i \mid \alpha, \beta \stackrel{ind}{\sim} \text{Beta-binomial}(n_i, \alpha, \beta).
\alpha: prior successes and \beta: prior failures
A more natural parameterization is
- prior expectation: \mu = \frac{\alpha}{\alpha + \beta}
- prior sample size \eta = \alpha + \beta
We can assume the mean (\mu) and the size (\eta) are independent a priori. For example, suppose \mu \sim Be(20, 30) and \eta \sim LN(0, 3^2) where LN is short for log-normal distribution.
= "
model_informative_prior data {
int<lower=0> N; // data
int<lower=0> n[N] ;
int<lower=0> y[N];
real<lower=0> a; // prior
real<lower=0> b;
real<lower=0> C;
real m;
}
parameters {
real<lower=0, upper=1> mu;
real<lower=0> eta;
real<lower=0, upper=1> theta[N];
}
transformed parameters {
real<lower=0> alpha;
real<lower=0> beta;
alpha = eta* mu;
beta = eta*(1-mu) ;
}
model {
mu ~ beta(a,b);
eta ~ lognormal(m,C);
# implicit joint distributions
theta ~ beta(alpha,beta);
y ~ binomial(n, theta);
}
"
= 20; b = 30; m = 0; C = 3
a = 1e4
n = mutate(data.frame(mu = rbeta(n, a, b), eta = rlnorm(n, m, C)),
prior_draws alpha = eta* mu, beta = eta*(1-mu))
= list(y=d$made, n=d$attempts, N=nrow(d), a=a, b=b, m=m, C=C)
dat = stan_model(model_code=model_informative_prior)
m = sampling(m, dat, c("mu", "eta", "alpha" "beta", "theta"), iter = 10000)
r
plot(r, pars = c('eta', 'alpha', 'beta'))
plot(r, pars = c('mu', 'theta'))
In the Bayesian Data Analysis page 110, serveral priors are discussed:
- (\log(\alpha/\beta), \log(\alpha + \beta)) \propto 1 leads to an improper posterior
- (\log(\alpha/\beta), \log(\alpha + \beta)) \sim Unif([-10^{10}, 10^{10}]\times [-10^{10}, 10^{10}]) while proper and seemingly vague is a very informative prior
- (\log(\alpha, \beta), \log(\alpha+\beta))\propto \alpha\beta(\alpha+\beta)^{-5/2} while leads to a proper posterior and is equivalent to p(\alpha, \beta) \propto (\alpha + \beta)^{-5/2}
# Stan code for default prior
= "
model_default_prior data {
int<lower=0> N;
int<lower=0> n [N];
int<lower=0> y [N];
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0, upper=1> theta[N];
}
model {
# default prior
target += -5 * log(alpha+beta)/2;
# implicit joint distributions
theta ~ beta(alpha, beta) ;
y ~ binomial(n, theta);
}
"
An alternative to jointly sampling \theta, \alpha, \beta is to
- sample \alpha, \beta \sim p(\alpha, \beta\mid y), and then
- sample \theta \stackrel{ind}{\sim} p(\theta_i \mid \alpha, \beta, y_i) \stackrel{d}{=} Be(\alpha + y_i, \beta + n_i - y_i).
The marginal posterior for \alpha, \beta is p(\alpha, \beta\mid y) \propto p(y\mid \alpha, \beta)p(\alpha, \beta) = \left[\prod_{i = 1}^n \text{Beta-Binomial}(y_i \mid n_i, \alpha, \beta) \right]p(\alpha, \beta)
# Marginalized (integrated) theta out of the model
= "
model_marginalized data {
int<lower=0> N;
int<lower=0> n[N];
int<lower=0> y[N];
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
}
model {
target += -5 * log(alpha+beta)/2;
y ~ beta_binomial(n, alpha, beta);
}
"
The forth approach:
# Stan Beta-Binomial
= "
model_seasons data {
int<lower=0> N; int<lower=0> n[N]; int<lower=0> y[N] ;
real<lower=0> a; real<lower=0> b; real<lower=0> C; real m;
}
parameters {
real<lower=0, upper=1> mu;
real<lower=0> eta;
}
transformed parameters {
real<lower=0> alpha;
real<lower=0> beta;
alpha = eta * mu;
beta = eta * (1-mu) ;
}
model {
mu ~ beta(a,b);
eta ~ lognormal(m,C);
y ~ beta_binomial(n, alpha, beta);
}
generated quantities {
real<lower=0, upper=1> theta[N] ;
for (i in 1:N)
theta [i] = beta_rng(alpha+y[i], beta + n[i]-y[i]);
}
"
Exchangeability: The set Y_1, Y_2, \ldots, Y_n is exchangeable if the joint probabilty p(y_1, \ldots, y_n) is invariant to permutation of the indices. That is, for any permutation \pi, p(y_1, \ldots, y_n) = p(y_{\pi_1}, \ldots, y_{\pi_n}).
Theorem 1: All independent and identically distributed random variables are exchangeable.
Theorem 2 (de Finetti’s Theorem): A sequence of random variables (y_1, y_2, \ldots) is infinitely exchangeable iff, for all n, p(y_1, \ldots, y_n) = \int \prod_{i = 1}^n p(y_i\mid \theta)P(d\theta) for some measure P on \theta.
Although hierarchical models are typically written using the conditional independence notation above, the assumptions underlying the model are exchangeability and functional forms for the priors. y_i \stackrel{ind}{\sim} p(y\mid \theta_i) \quad \theta_i \stackrel{ind}{\sim} p(\theta\mid \phi) \quad \phi \sim p(\phi)
Example 2: Normal hierarchical model
Suppose we have the following model: \begin{aligned} y_{ij} & \sim N(\theta_i, \sigma^2)\\ \theta_i & \sim N(\mu, \tau^2) \end{aligned} with j = 1,\ldots, n_i, i = 1, \ldots, I and n = \sum_{i = 1}^I n_i. We assume \sigma^2 = s^2 is known and p(\mu, \tau^2) \propto p(\tau), i.e. assume an improper uniform prior on \mu.
(Section 5.4 in Bayesian Data Analysis) Let \bar y_{i.} = \frac{1}{n_i} \sum_{i = 1}^{n_i} y_{ij} and s^2_i = s^2/n_{i}, then \left\{\begin{aligned} p(\tau \mid y) & \propto p(\tau) V_{\mu}^{1 / 2} \prod_{i=1}^{\mathrm{I}}\left(s_{i}^{2}+\tau^{2}\right)^{-1 / 2} \exp \left(-\frac{\left(\bar{y}_{i}-\hat{\mu}\right)^{2}}{2\left(s_{i}^{2}+\tau^{2}\right)}\right) \\ \mu \mid \tau, y & \sim N\left(\hat{\mu}, V_{\mu}\right) & \\ \theta_{i} \mid \mu, \tau, y & \sim N\left(\hat{\theta}_{i}, V_{i}\right) & \\ \end{aligned}\right. with V_{\mu}^{-1} =\sum_{j=1}^{J} \frac{1}{s_{i}^{2}+\tau^{2}} \quad \hat{\mu}=V_{\mu}\left(\sum_{i=1}^{\mathrm{I}} \frac{\bar{y}_{i}}{s_{i}^{2}+\tau^{2}}\right) \\ V_{i}^{-1} =\frac{1}{s_{i}^{2}}+\frac{1}{\tau^{2}} \quad \hat{\theta}_{i} =V_{i}\left(\frac{\bar{y}_{i}}{s_{i}^{2}}+\frac{\mu}{\tau^{2}}\right) One choice for p(\tau) is Ca^+(0, 1)
There are some alternative distributions for \theta_i:
- Heavy-tailed: \theta_i \sim t_{\nu}(\mu, \tau^2)
- Peak at zero: \theta_i \sim Laplace(\mu, \tau^2)
- Point mass at zero: \theta_i \sim \pi\delta_0 + (1 - \pi)N(\mu, \tau^2)
Hierarchical models
- allow the data to inform us about similarities across groups
- provide data driven shrinkage toward a grand mean
- lots of shrinkage when means are similar
- little shrinkage when means are different