7.3 Gibbs Sampler
The attraction of an algorithm like single component Metropolis-Hastings is that it converts a \(p\)-dimensional problem into \(p\) separate 1-dimensional problems, each if which is likely simple to solve. This advantage is not unlike that seen with coordinate descent algorithms discussed previously. SCMH can however be very exploratory and may not be efficient in exploring the parameter space.
Gibbs sampling is a variant of SCMH that uses full conditional distributions for the proposal distribution for each component. Given a target density \(\pi(x)=\pi(x_1,\dots,x_p)\), we cycle through sampling from \(\pi(x_i\mid x_{-i})\) to update the \(i\)th component. Rather than go though an accept/reject step as with Metropolis-Hastings, we always accept, for reasons that will be detailed below.
For example, with a three-component density \(\pi(x, y, z)\), the full conditional distributions associated with this density are
\[ \pi(x\mid y, z), \] \[ \pi(y\mid x, z), \] and \[ \pi(z\mid x, y). \]
If our current state is \((x_n,y_n,z_n)\) at the \(n\)th iteration, then we update our parameter values with the following steps.
- Sample \(x_{n+1}\sim\pi(x\mid y_n, z_n)\)
- Sample \(y_{n+1}\sim\pi(y\mid x_{n+1}, z_n)\)
- Sample \(z_{n+1}\sim\pi(z\mid x_{n+1}, y_{n+1})\)
At each step of the sampling we use the most recent values of all the other components in the full conditional distribution. In a landmark paper, Geman and Geman showed that if \(p(x_n,y_n,z_n)\) is the density of our parameters at the \(n\)th iteration, then as \(n\rightarrow\infty\),
\[\begin{eqnarray*} p(x_n, y_n, z_n) & \rightarrow & \pi(x, y, z)\\ p(x_n) & \rightarrow & \pi(x)\\ p(y_n) & \rightarrow & \pi(y)\\ p(z_n) & \rightarrow & \pi(z) \end{eqnarray*}\]
7.3.1 Example: Bivariate Normal Distribution
Suppose we want to simulate from a bivariate Normal distribution with mean \(\mu = (\mu_1, \mu_2)\) and covariance matrix
\[ \Sigma = \left( \begin{array}{cc} \sigma_1^2 & \rho\\ \rho & \sigma_2^2 \end{array} \right) \]
Although there are exact ways to do this, we can make use of Gibbs sampling too simulate a Markov chain that will converge to a bivariate Normal. At the \(n\)th iteration with state vector \((x_n, y_n)\), we can update our state with the following steps.
Sample \(x_{n+1}\sim\mathcal{N}(\mu_1 + \rho\frac{\sigma_1}{\sigma_2}(y_n - \mu_2),\,\sigma_1^2(1-\rho))\)
Sample \(y_{n+1}\sim\mathcal{N}(\mu_2+\rho\frac{\sigma_2}{\sigma_1}(x_{n+1}-\mu_1),\,\sigma_2^2(1-\rho))\)
If \(\rho\) is close to \(1\), then this algorithm will likely take a while to converge. But it is simple to simulate and demonstrates the mechanics of Gibbs sampling.
7.3.2 Example: Normal Likelihood
Suppose we observe \(y_1,\dots,y_n\stackrel{\text{iid}}{\sim}\mathcal{N}(\mu, \tau^{-1})\), where \(\tau^{-1}\) is the precision of the Normal distribution. We can use independent priors for the two parameters as
\[\begin{eqnarray*} \mu & \sim & \mathcal{N}(0, w^{-1})\\ \tau & \sim & \text{Gamma}(\alpha, \beta) \end{eqnarray*}\]
Because the priors are independent here, we will have to use Gibbs sampling to sample from the posterior distribution of \(\mu\) and \(\tau\).
Recall that with Bayes’ Theorem, the joint posterior of \(\mu\) and \(\tau\) is
\[\begin{eqnarray*} \pi(\mu,\tau\mid \mathbf{y}) & \propto & \pi(\mu,\tau, \mathbf{y})\\ & = & f(\mathbf{y}\mid\mu,\tau)p(\mu)p(\tau) \end{eqnarray*}\]
where \(f(\mathbf{y}\mid\mu,\tau)\) is the joint Normal density of the observed data.
In the Gibbs sampling procedure there will be two full conditional distributions that we will want to sample from. They are
\[\begin{eqnarray*} p(\mu\mid\tau,\mathbf{y}) & \propto & f(\mathbf{y}\mid\mu,\tau)p(\mu)p(\tau)\\ & \propto & f(\mathbf{y}\mid\mu,\tau)p(\mu)\\ & \propto & \exp\left(-\frac{\tau}{2}\sum(y_i-\mu)^2\right)\exp\left(-\frac{w}{2}\mu^2\right)\\ & = & \exp\left(-\frac{(n\tau+w)}{2}\mu^2-\left(\sum y_i\right)\tau\mu\right)\\ & = & \mathcal{N}\left(\frac{\tau}{n\tau+w}\sum y_i,\, \frac{1}{n\tau+w}\right) \end{eqnarray*}\]
and
\[\begin{eqnarray*} p(\tau\mid\mu,\mathcal{y}) & \propto & f(\mathbf{y}\mid\mu,\tau)p(\mu)p(\tau)\\ & \propto & f(\mathbf{y}\mid\mu,\tau)p(\tau)\\ & \propto & \tau^{\frac{n}{2}}\exp\left(-\frac{\tau}{2}\sum(y_i-\mu)^2\right)\tau^{\alpha-1}\exp(-\tau\beta)\\ & = & \tau^{\alpha-1+\frac{n}{2}} \exp\left(-\tau\left[\beta+\frac{1}{2}\sum(y_i-\mu)^2\right]\right)\\ & = & \text{Gamma}\left(\alpha+\frac{n}{2},\,\beta+\frac{1}{2}\sum(y_i-\mu)^2\right) \end{eqnarray*}\]
The Gibbs sampler therefore alternates between sampling from a Normal distribution and a Gamma distribution. In this case, the priors were chosen so that the full conditional distributions could be sampled in closed form.
7.3.3 Gibbs Sampling and Metropolis Hastings
Gibbs sampling has a fairly straightforward connection to the single component Metropolis-Hastings algorithm described earlier. We can reframe the Gibbs sampling algorithm for component \(i\) at iteration \(n\) as
Sample \(y_i\sim q_i(y\mid x^{(n)}_i,x^{(n)}_{-i}) = \pi(y\mid x_{-i}^{(n)})\). Here, the proposal distribution simply the full conditional distribution of \(x_i\) given \(x_{-i}\).
The acceptance probability is then \[ \alpha(y_i\mid x_i, x_{-i}) = \min\left( \frac{ \pi(y_i\mid x_{-i})\pi(x_i\mid x_{-i}) }{ \pi(x_i\mid x_{-i})\pi(y_i\mid x_{-i}) }, 1 \right) = 1 \] The Gibbs sampling algorithm is like the SCMH algorithm but it always accepts the proposal.
7.3.4 Hybrid Gibbs Sampler
Given the relationship between Gibbs sampling and SCMH, we can use this to extend the basic Gibbs algorithm. In some cases, we will not be able to sample directly from the full conditional distribution of a component. In those cases, we can substitute a standard Metropolis-Hastings step with a proposal/acceptance procedure.
For components that require a Metropolis-Hastings step, we will need to come up with a proposal density for those components. In this case, the target density is simply the full conditional of component \(i\) given the other components. For component \(i\) then, the algorithm procedes as
Sample \(y_i\sim q(y\mid x)\)
Compute the acceptance probability \[ \alpha(y_i\mid x) = \min\left(\frac{ \pi(y_i\mid x_{-i})q(x_i\mid x) }{ \pi(x_i\mid x_{-i})q(y_i\mid x) }, 1 \right) \] We then accept the proposal with probability \(\alpha(y_i\mid x)\).
Regardless of whether the proposal is accepted or not, we can move on to the next component of the parameter vector.
7.3.5 Reparametrization
In some Metropolis-Hastings or hybrid Gibbs sampling problems we may have parameters where it is easier to sample from a full conditional of a transformed version of the parameter. For example, we may need to sample from the full conditional \(p(\lambda\mid\cdot)\) of a parameter that only takes values between \(0\) and \(1\). Doing something like a random walk proposal can be tricky given the restricted range.
One solution is the transform the parameter to a space that has a infinte range. For a parameter \(\lambda\in(0,1)\) we can use the logit transformation to map it to \((-\infty,\infty)\), i.e.
\[ z = \text{logit}(\lambda) = \log\frac{\lambda}{1-\lambda}. \]
Then we can do the proposal and acceptance steps on the transformed parameter \(z\), which has an infinite domain. However, we need to be careful that the acceptance ratio is calculated properly to account for the nonlinear transformation of \(\lambda\). In this case we need to compute the determinant of the Jacobian of the transformation mapping \(z\) back to \(\lambda\).
The algorithm at iteration \(n\) in the transformed space would work as follows.
Sample \(z^\star\sim g(z\mid z_n)\), where \(z_n=\text{logit}(\lambda_n)\) and \(g\) is the proposal density.
Compute \[ \alpha(z^\star\mid z_n) = \min\left(\frac{ p(\text{logit}^{-1}(z^\star)\mid\cdot) }{ p(\text{logit}^{-1}(z_n)\mid\cdot) }\frac{ g(z_n\mid z^\star) }{ g(z^\star\mid z_n) }\frac{ |J(z^\star)| }{ |J(z_n)| }, 1 \right) \] where \(|J(z)|\) is the determinant of the Jacobian of the transformation that maps from \(z\mapsto \lambda\), i.e. the function \[ \text{logit}^{-1}(z) = \frac{\exp(z)}{1+\exp(z)}. \]
In R, we can easily compute this Jacobian (and that for any other transformation) with the deriv()
function.
<- local({
Jacobian <- deriv(~ exp(x) / (1 + exp(x)), "x",
J function.arg = TRUE)
function(x) {
<- J(x)
val drop(attr(val, "gradient"))
}
})Jacobian(0)
## x
## 0.25
Becuase the transformation here is for a single parameter, the determinant is straightforward. However, for a multi-dimensional mapping, we would need an extra step to compute the determinant. The det()
function can be used for this purpose, but bear in mind that it almost always makes sense to use the logarithm = TRUE
argument to det()
.
7.3.6 Example: Bernoulli Random Effects Model
Suppose we observe longitudinal binary data that follow
\[\begin{eqnarray*} y_{ij} & \sim & \text{Bernoulli}(\lambda_{ij})\\ \text{logit}(\lambda_{ij}) & = & \alpha_i\\ \alpha_i & \sim & \mathcal{N}(\mu, \sigma^2)\\ \mu & \sim & \mathcal{N}(0, D)\\ \sigma & \sim & \text{InverseGamma}(a, b) \end{eqnarray*}\]
where \(i=1,\dots,n\) and \(j=1,\dots,G\).
For this example, assume that the hyperparameters \(a\), \(b\), and \(D\) are known. The goal is to sample from the posterior of \(\mu\) and \(\sigma\). The joint posterior of the parameters given the data is
\[\begin{eqnarray*} \pi(\mu, \sigma, \mathbf{\alpha}\mid\mathbf{y}) & \propto & \prod_{i=1}^n \left[ \prod_{j=1}^G p(y_{ij}\mid\text{logit}^{-1}(\alpha_i) \right]\\ & & \times \varphi(\alpha_i\mid\mu,\sigma^2) \varphi(\mu\mid 0, D) g(\sigma) \end{eqnarray*}\]
where \(p()\) is the Bernoulli density, \(\varphi\) is the Normal density, and \(g()\) is the inverse gamma density.
To implement the Gibbs sampler, we need to cycle through three classes of full conditional distributions. First is the full conditional for \(\sigma\), which can be written in closed form given the prior. In order to compute the full conditional, we simply select everything from the full posterior that has a \(\sigma\) in it.
\[\begin{eqnarray*} p(\sigma\mid\cdot) & \propto & \left[\prod_{i=1}^n \varphi(\alpha_i\mid\mu,\sigma^2)\right] g(\sigma)\\ & = & \text{InverseGamma}\left(a + \frac{n}{2}, b+\frac{1}{2}\sum(\alpha_i-\mu)^2\right) \end{eqnarray*}\]
Similarly, we can pick out all the components of the full posterior that have a \(\mu\) and the full conditional distribution for \(\mu\) is
\[\begin{eqnarray*} p(\mu\mid\cdot) & \propto & \left[\prod_{i=1}^n \varphi(\alpha_i\mid\mu,\sigma^2)\right] \varphi(\mu\mid,0,D)\\ & = & \mathcal{N}\left( \frac{D}{D+\sigma^2/n}\,\bar{\alpha},\,\frac{\sigma^2/n}{\sigma^2/n + D}D \right) \end{eqnarray*}\]
where \(\bar{\alpha} = \frac{1}{n}\sum\alpha_i\).
Finally, the full conditionals for \(\alpha_i\) can be compute independently because the \(\alpha_i\)s are assumed to be independent.
\[ p(\alpha_i\mid\cdot) \propto \left[\prod_{j=1}^G p(y_{ij}\mid\text{logit}^{-1}(\alpha_i)) \right] \varphi(\alpha_i\mid\mu,\sigma^2). \]
Unfortunately, there is no way to simplify this further with the combination of the Bernoulli likelihood and the Normal random effects distribution. Therefore, some proposal/acceptance step will have to be run to sample from the \(\alpha_i\)s.