# 3 Markov chain Monte Carlo

The focus of this chapter is on one class of simulation-based algorithms for approximating complex distributions without having to sample directly from those distributions. These methods have been tremendously successful in modern computational statistics.

Discrete time Markov processes on general state spaces, or Markov chains as we shall call such processes here, are described in some detail in the Applied Stochastic Processes module. Here, we investigate one particular use of these processes, as a mechanism for obtaining samples suitable for approximating complex distributions of interest.

Some definitions, background and useful results on Markov chains are provided in Appendix A.

## 3.1 The Basis of Markov chain Monte Carlo (MCMC)

In Chapter 2 we saw various methods for obtaining samples from distributions as well as some uses for such samples. The range of situations in which we like to make use of samples from distributions is, unfortunately, somewhat wider than the range of situations in which we can obtain such samples (easily, efficiently, or at all in some cases).

As the earlier Applied Stochastic Processes course will have provided a sound introduction to Markov chains for anyone able to attend it and not already familiar with them, we don’t repeat that introduction here. Appendix A may provide a useful reference if these things are new to you (or, indeed, if it’s some time since you’ve thought about them) and here we confine ourselves to a few essential definitions.

There are various conventions in the literature, but we will use the term Markov chain to refer to any discrete time Markov process, whatever may be its state space. (Sometimes the state space is also assumed to be discrete.) We’ll assume here that the target distribution $$f$$ is a continuous distribution over $$E \subseteq {\mathbb{R}}^d$$ for definiteness and for compact notation, but it should be realised that these techniques can be used in much greater generality.

For definiteness, we’ll let $$K$$ denote the density of the transition kernel of a Markov chain and we’ll look for $$f$$-invariant Markov kernels, i.e. those for which: $\int_{x \in E} \int_{x^\prime \in A} f({\boldsymbol{x}}) K({\boldsymbol{x}},{\boldsymbol{x}}^\prime) d{\boldsymbol{x}} d{\boldsymbol{x}}^\prime = \int_{x\in A} f({\boldsymbol{x}}) d{\boldsymbol{x}}$ for every measurable set $$A$$. We have assumed here that $$f$$ admits a density and $$K(x,\cdot)$$ admits a density for any $${\boldsymbol{x}}$$, and we can of course simplify this slightly and write the invariance condition as $$\int_{E}f({\boldsymbol{x}})K({\boldsymbol{x}},{\boldsymbol{x}}^\prime)d{\boldsymbol{x}}= f({\boldsymbol{x}}^\prime)$$. Intuitively, an $$f$$-invariant Markov kernel is one which preserves the distribution $$f$$ in the sense that if one samples $${\boldsymbol{X}}\sim f$$ and then conditional upon $${\boldsymbol{X}}$$ taking the value $${\boldsymbol{x}}$$, sample $${\boldsymbol{X}}^\prime \sim K({\boldsymbol{x}},\cdot)$$ then, marginally, $${\boldsymbol{X}}^\prime \sim f$$.

If $${\boldsymbol{X}}_0,{\boldsymbol{X}}_1,\dots$$ is a Markov chain with some initial distribution $$\mu_0$$ and $$f$$-invariant transition $$K$$ then its clear that if at any time $$s$$ that $${\boldsymbol{X}}_s \sim f$$, then for every $$t > s$$ we have $${\boldsymbol{X}}_t \sim f$$. That is, if the marginal distribution of the state of the Markov chain is $$f$$ at any time then the marginal distribution of the state of the Markov chain at any later time is also $$f$$. This encourages us to consider using as an estimator of $$I_h = \int h({\boldsymbol{x}}) f({\boldsymbol{x}}) d{\boldsymbol{x}}$$ the sample path average of the function of interest: \begin{aligned} \hat{I}_h^{MCMC} = \frac{1}{t} \sum_{i=1}^t h({\boldsymbol{X}}_i)\end{aligned} which is of exactly the same form as the simple Monte Carlo estimator—except that it makes use of the trajectory of a Markov chain with $$f$$ its invariant distribution instead of a collection of iid realisations from $$f$$ itself.

However, there are two other issues which we need to consider before we could expect this estimator to have good properties:

• Is any $${\boldsymbol{X}}_t \sim f$$? We know that if this is ever true it remains true for all subsequent times but we don’t know that this situation is ever achieved.

• How does the dependence between consecutive states influence the estimator? Consider $$X_1 \sim f$$ and $$X_i = X_{i-1}$$ for all $$i > 1$$. This is a Markov chain whose states are all marginally distributed according to $$f$$, but we wouldn’t expect $$\hat{I}_h^{MCMC}$$ to behave well if we used such a chain.

Next we’ll briefly consider some problems which could arise and what behaviour we might need in order to have some confidence in an MCMC estimator before seeing some results which formally justify the approach.

### 3.1.1 Selected Properties and Potential Failure Modes

Not all $$f$$-invariant Markov kernels are suitable for use in Monte Carlo simulation. In addition to preserving the correct distribution, we need some notion of mixing or forgetting: we need the chain to move around the space in such a way that serial dependencies decay over time. The identity transition which sets $$X_t = X_{t-1}$$ with probability one is $$f$$-invariant for every $$f$$, but is of little use for MCMC purposes.

There are certain properties of Markov chains that are important because we can use them to ensure that various pathological things don’t happen in the course of our simulations. We give a very brief summary here; see Appendix A for a slightly more formal presentation and some references.

##### Periodicity

A Markov chain is periodic if the state space can be partitioned by a collection of more than one disjoint sets in such a way that the chain moves cyclically between elements of this partition. If such a partition exists then the number of elements in it is known as the period of the Markov chain; otherwise, the chain is termed aperiodic. In Markov chain Monte Carlo algorithms we generally require that the simulated chains are aperiodic (otherwise it’s clear that the chain cannot ever forget in which element of the partition it started and, if initialised at some value, $${\boldsymbol{x}}_0$$, will have disjoint support at time $$t$$ and $$t+1$$ for all $$t$$ and hence can never reach distribution $$f$$).

##### Reducibility

A discrete space Markov chain is reducible if a chain cannot evolve from (almost) any point in the state space to any other; otherwise it is irreducible. In the case of chains defined on continuous spaces it is necessary to introduce a reference distribution, say $$\phi$$, and to term the chain $$\phi$$-irreducible if any set of positive probability under $$\phi$$ can be reached with positive probability from any starting point. In MCMC applications we require that the chains we use are $$f$$-irreducible; otherwise, the parts of the space that would be explored by the evolution of the chain would depend strongly on the starting value—and this would remain true even if the chain were run for an infinitely long time.

##### Transience

Another significant type of undesirable behaviour is transience. Loosely speaking, a Markov chain is transient if it is expected to visit sets of positive probability under its invariant distribution only finitely often, even if permitted to run for infinite time. This means that in some sense the chain tends to drift off to infinity. In order for results like the law of large numbers to be adapted to the Markov chain setting, we require that sets of positive probability would in principle be visited arbitrarily often if the chain were to run for long enough. A $$\phi$$-irreducible Markov chain is recurrent if the expected number of returns to any set of positive $$\phi$$-probability is infinite. We’ll focus on chains which have a stronger property. A $$\phi$$-irreducible Markov chain is Harris recurrent if the probability that any set which has positive probability under $$\phi$$ is visited infinitely often by the chain (over an infinite time period) is one for all starting values.

### 3.1.2 A Few Theoretical Results

Formally, we justify Markov chain Monte Carlo by considering the asymptotic properties of the Markov chain (actually, in some situations it’s possible to deal with finite sample properties but these are rather specialised settings). The following two results are two (amongst many similar theorems with subtly different conditions) which are to MCMC as the strong law of large numbers and the central limit theorem are to simple Monte Carlo.

Theorem 3.1 (An Ergodic Theorem). If $$\left({\boldsymbol{X}}_{i}\right)_{i \in {\mathbb{N}}}$$ is an $$f$$-invariant, Harris recurrent Markov chain, then the following strong law of large numbers holds (convergence is with probability 1) for any integrable function $$h:E\rightarrow{\mathbb{R}}$$: $\lim_{t\rightarrow \infty} \frac1t \sum\limits_{i=1}^t h({\boldsymbol{X}}_i) \overset{a.s.}{=}\int h(x) f(x) dx.$

Theorem 3.2 (A Markov Chain Central Limit Theorem). Under technical regularity conditions (see Jones (2004) for a summary of various combinations of conditions) it is possible to obtain a central limit theorem for the ergodic averages of a Harris recurrent, $$f$$-invariant Markov chain, and a function $$h:E\rightarrow {\mathbb{R}}$$ which has at least two finite moments (depending upon the combination of regularity conditions assumed, it may be necessary to have a finite moment of order $$2+\delta$$ for some $$\delta > 0$$): \begin{aligned} & \lim_{t\rightarrow\infty} \sqrt{t} \left[\frac{1}{t}\sum\limits_{i=1}^t h({\boldsymbol{X}}_i) - \int h(x) f(x) dx \right] \overset{\mathcal{D}}{=}{\textsf{N}\left( 0,\sigma^2(h) \right)}, \\ & \sigma^2(h) = {\mathbb{E}_{}\left[(h({\boldsymbol{X}}_1) - \bar{h})^2\right]} + 2\sum_{k=2}^\infty {\mathbb{E}_{}\left[(h({\boldsymbol{X}}_1) - \bar{h})(h({\boldsymbol{X}}_k) - \bar{h})\right]},\end{aligned} where $$\bar{h} = \int h(x) \mu(x) dx$$.

Although the variance is not a straightforward thing to calculate in practice (it’s rarely possible) this expression is informative. It quantifies what we would expect intuitively, that the stronger the (positive3) relationship between successive elements of the chain the higher the variance of the resulting estimates. If $$K(x,\cdot) =f(\cdot)$$ so that we obtain a sequence of iid samples form the target distribution then we recover the variance of the simple Monte Carlo estimator.

## 3.2 Constructing MCMC Algorithms

So, having established that we can in principle employ Markov chains with $$f$$-invariant kernels to approximate expectations with respect to $$f$$, a natural question is how can we construct an $$f$$-invariant Markov kernel? Fortunately, there are some general methods which are very widely applicable.

### 3.2.1 Gibbs Samplers

We begin with a motivating example which shows that for realistic problems it may be possible to characterise and sample from the full conditional distributions associated with each variable separately; that is, the conditional distribution of any one variable given any particular value for all of the other variables, even when it is not possible to sample directly form their joint distribution. We’ll use this to motivate a strategy of sampling iteratively from these full conditional distributions in order to obtain a realisation of a Markov chain, before going on to demonstrate that this approach falls within the MCMC framework described above.

Example 3.1 (Poisson change point model). Assume the following Poisson model of two regimes for $$n$$ random variables $$Y_1,\dots,Y_n$$. \begin{aligned} Y_i&\sim \textsf{Poi}(\lambda_1)&\textrm{ for } i&=1,\dots,M,\\ Y_i&\sim \textsf{Poi}(\lambda_2)&\textrm{ for } i&=M+1,\dots,n.\end{aligned}

A conjugate prior distribution for $$\lambda_j$$ is the $${\textsf{Gamma}\left( \alpha_j,\beta_j \right)}$$ distribution with density $f(\lambda_j)=\frac{1}{\Gamma(\alpha_j)}\lambda_j^{\alpha_j-1}\beta_j^{\alpha_j}\exp(-\beta_j\lambda_j).$ The joint distribution of $$Y_1,\dots,Y_n$$, $$\lambda_1$$, $$\lambda_2$$, and $$M$$ is $\begin{gathered} f(y_1,\dots,y_n,\lambda_1,\lambda_2,M)= \left(\prod_{i=1}^M \frac{\exp(-\lambda_1)\lambda_1^{y_i}}{y_i!}\right)\cdot \left(\prod_{i=M+1}^n \frac{\exp(-\lambda_2)\lambda_2^{y_i}}{y_i!}\right)\\ \cdot\frac{1}{\Gamma(\alpha_1)}\lambda_1^{\alpha_1-1}\beta_1^{\alpha_1}\exp(-\beta_1\lambda_1)\cdot\frac{1}{\Gamma(\alpha_2)}\lambda_2^{\alpha_2-1}\beta_2^{\alpha_2}\exp(-\beta_2\lambda_2).\end{gathered}$ If $$M$$ is known, the posterior distribution of $$\lambda_1$$ has the density $f(\lambda_1\mid Y_1,\dots,Y_n,M)\propto \lambda_1^{\alpha_1-1+\sum_{i=1}^M y_i}\exp(-(\beta_1+M)\lambda_1),$ so \begin{align} \lambda_1\mid Y_1,\dots Y_n,M &\sim {\textsf{Gamma}\left( \alpha_1 + \sum_{i=1}^M y_i,\beta_1 + M \right)},\tag{3.1}\\ \lambda_2\mid Y_1,\dots Y_n,M &\sim {\textsf{Gamma}\left( \alpha_2 + \sum_{i=M+1}^n y_i,\beta_2 + n-M \right)}.\tag{3.2} \end{align} Now assume that we do not know the change point $$M$$ and that we assume a uniform prior on the set $$\{1,\dots,M-1\}.$$ It is easy to compute the distribution of $$M$$ given the observations $$Y_1,\dots Y_n$$, and $$\lambda_1$$ and $$\lambda_2$$. It is a discrete distribution with probability density function proportional to $$$p(M)\propto \lambda_1^{\sum_{i=1}^M y_i} \cdot \lambda_2^{\sum_{i=M+1}^n y_i}\cdot \exp((\lambda_2-\lambda_1)\cdot M).\tag{3.3}$$$ The conditional distributions in (3.1) to (3.3) are all easy to sample from. It is however rather difficult to sample from the joint posterior of $$(\lambda_1,\lambda_2,M)$$.

The example above suggests the strategy of alternately sampling from the (full) conditional distributions ((3.1) to (3.3) in the example). This tentative strategy however raises some questions.

• Is the joint distribution uniquely specified by the conditional distributions? We know that it is not determined by the collection of marginal distributions and so this is an important question (an algorithm which makes use of only these distributions could only be expected to provide information about the joint distribution if these conditionals do characterise that distribution).

• Sampling alternately from the conditional distributions yields a Markov chain: the newly proposed values only depend on the present values, not the past values. Will this approach yield a Markov chain with the correct invariant distribution? Will the Markov chain converge to the invariant distribution?

#### 3.2.1.1 The Hammersley–Clifford Theorem

We begin by addressing the first of these questions via a rather elegant result known as the Hammersley–Clifford Theorem, although Hammersley and Clifford never actually published the result.

Definition 3.1 (Positivity condition). A distribution with density $$f(x_1,\dots,x_p)$$ and marginal densities $$f_{X_i}(x_i)$$ is said to satisfy the positivity condition if $$f(x_1,\dots,x_p)>0$$ for all $$x_1,\dots,x_p$$ with $$f_{X_i}(x_i)>0$$.

The positivity condition thus implies that the support of the joint density $$f$$ is the Cartesian product of the support of the marginals $$f_{X_i}$$.

Theorem 3.3 (Hammersley–Clifford). Let $$(X_1,\dots,X_p)$$ satisfy the positivity condition and have joint density $$f(x_1,\dots,x_p)$$. Then for all $$(\xi_1,\dots,\xi_p)\in \textrm{supp}(f)$$ $f(x_1,\dots,x_p)\propto\prod_{j=1}^p \frac{ f_{X_{j}\mid X_{-j}}(x_{j}\mid x_{1},\dots,x_{j-1},\xi_{j+1},\dots,\xi_{p})}{ f_{X_{j}\mid X_{-j}}(\xi_{j}\mid x_{1},\dots,x_{j-1},\xi_{j+1},\dots,\xi_{p})}.$

Proof. We have $$$f(x_1,\dots,x_{p-1},x_p)=f_{X_p\mid X_{-p}}(x_p\mid x_1,\dots,x_{p-1}) f(x_1,\dots,x_{p-1})\tag{3.4}$$$ and by exactly the same argument $$$f(x_1,\dots,x_{p-1},\xi_p)=f_{X_p\mid X_{-p}}(\xi_p\mid x_1,\dots,x_{p-1}) f(x_1,\dots,x_{p-1}),\tag{3.5}$$$ thus, using (3.4) and (3.5) in turn, \begin{align} f(x_1,\dots,x_p)&=\underbrace{f(x_1,\dots,x_{p-1})}_{= f(x_1,\dots,,x_{p-1},\xi_p)/f_{X_p\mid X_{-p}}(\xi_p\mid x_1,\dots,x_{p-1})}f_{X_p\mid X_{-p}}(x_p\mid x_1,\dots,x_{p-1}) \\ &=f(x_1,\dots,x_{p-1},\xi_p)\frac{f_{X_p\mid X_{-p}}(x_p\mid x_1,\dots,x_{p-1})}{f_{X_p\mid X_{-p}}(\xi_p\mid x_1,\dots,x_{p-1})}\\ &=\dots\\ &=f(\xi_1,\dots,\xi_p) \frac{f_{X_1\mid X_{-1}}(x_1\mid \xi_2,\dots,\xi_{p})}{f_{X_1\mid X_{-1}}(\xi_1\mid \xi_2,\dots,\xi_{p})} \cdots \frac{f_{X_p\mid X_{-p}}(x_p\mid x_1,\dots,x_{p-1})}{f_{X_p\mid X_{-p}}(\xi_p\mid x_1,\dots,x_{p-1})}. \end{align} The positivity condition guarantees that the conditional densities are non-zero.

Note that the Hammersley–Clifford theorem does not guarantee the existence of a joint probability distribution for every choice of conditionals, as the following example shows. In Bayesian modelling such problems arise most often when using improper prior distributions.

Example 3.2 Consider the following “model” \begin{aligned} X_1\mid X_2 &\sim {\textsf{Exp}\left( \lambda X_2 \right)},\\ X_2\mid X_1 &\sim {\textsf{Exp}\left( \lambda X_1 \right)},\end{aligned} for which it would be easy to design a Gibbs sampler. Trying to apply the Hammersley–Clifford theorem, we obtain $f(x_1,x_2)\propto\frac{f_{X_1\mid X_2}(x_1\mid \xi_2) \cdot f_{X_2\mid X_1}(x_2\mid x_1)}{ f_{X_1\mid X_2}(\xi_1\mid \xi_2) \cdot f_{X_2\mid X_1}(\xi_2\mid x_1)}= \frac{\lambda \xi_2\exp(-\lambda x_1\xi_2) \cdot \lambda x_1 \exp(-\lambda x_1x_2)}{\lambda \xi_2\exp(-\lambda \xi_1\xi_2) \cdot \lambda x_1 \exp(-\lambda x_1\xi_2)} \propto \exp(-\lambda x_1x_2)$ The integral $$\iint \exp(-\lambda x_1x_2) \;dx_1 \; dx_2$$, however, is not finite: there is no two-dimensional probability distribution with $$f(x_1,x_2)$$ as its density.

#### 3.2.1.2 Gibbs Sampling Algorithm

The generic Gibbs sampler is widely accepted as being first proposed by Geman and Geman (1984) and popularised within the general statistical community by Gelfand and Smith (1990). Denote $$x_{-i}:=(x_1,\dots,x_{i-1},x_{i+1},\dots,x_p)$$.

Algorithm 3.1 ((Systematic sweep) Gibbs sampler). Starting with $$(X_1^{(0)},\dots,X_p^{(0)})$$ iterate for $$t=1,2,\dots$$

• 1. Draw $$X_1^{(t)}\sim f_{X_1\mid X_{-1}}(\cdot\mid X^{(t-1)}_2,\dots,X^{(t-1)}_p)$$.

• $$\vdots$$

• j. Draw $$X_j^{(t)}\sim f_{X_j\mid X_{-j}}(\cdot\mid X^{(t)}_1,\dots,X^{(t)}_{j-1},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)$$.

• $$\vdots$$

• p. Draw $$X_p^{(t)}\sim f_{X_p\mid X_{-p}}(\cdot\mid X^{(t)}_1,\dots,X^{(t)}_{p-1})$$.

Figure 3.1 illustrates the Gibbs sampler. The conditional distributions used in the Gibbs sampler are often referred to as full conditionals (being conditional upon everything except the variable being sampled at each step). Note that the Gibbs sampler is not reversible. Liu, Wong, and Kong (1995) proposed the following algorithm that yields a reversible chain.

Algorithm 3.2 (Random sweep Gibbs sampler). Starting with $$(X_1^{(0)},\dots,X_p^{(0)})$$ iterate or $$t=1,2,\dots$$

1. Draw an index $$j$$ from a distribution on $$\{1,\dots,p\}$$ (e.g. uniform).

2. Draw $$X_j^{(t)}\sim f_{X_j\mid X_{-j}}(\cdot\mid X^{(t-1)}_1,\dots,X^{(t-1)}_{j-1},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)$$, and set $$X^{(t)}_{\iota}:=X^{(t-1)}_{\iota}$$ for all $$\iota\neq j$$.

#### 3.2.1.3 Convergence of Gibbs Samplers

First we must establish whether that joint distribution $$f(x_1,\dots,x_p)$$ is indeed the stationary distribution of the Markov chain generated by the Gibbs sampler. All the results in this section will be derived for the systematic scan Gibbs sampler (Algorithm 3.1). Very similar results hold for the random scan Gibbs sampler (Algorithm 3.2).

To proceed with such an analysis, we first have to determine the transition kernel corresponding to the Gibbs sampler.

Lemma 3.1 The transition kernel of the Gibbs sampler is \begin{aligned} K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)}) &= f_{X_1\mid X_{-1}} (x_1^{(t)}\mid x_2^{(t-1)},\dots,x_p^{(t-1)})\cdot f_{X_2\mid X_{-2}} (x_2^{(t)}\mid x_1^{(t)},x_3^{(t-1)},\dots,x_p^{(t-1)}) \cdot \dots \cdot f_{X_p\mid X_{-p}} (x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)})\end{aligned}

Proof. We have, for any (measurable $$\mathcal{X}$$): \begin{aligned} {\mathbb{P}}({\boldsymbol{x}}^{(t)}\in \mathcal{X}\mid {\boldsymbol{x}}^{(t-1)}={\boldsymbol{x}}^{(t-1)}) &= \int_{\mathcal{X}} f_{({\boldsymbol{x}}^{t}\mid {\boldsymbol{x}}^{(t-1)})}({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})\;d{\boldsymbol{x}}^{(t)}\\ &=\int_{\mathcal{X}} \underbrace{f_{X_1\mid X_{-1}}(x_1^{(t)}\mid x_2^{(t-1)},\dots,x_p^{(t-1)})}_{\textrm{corresponds to step 1. of the algorithm}}\cdot \underbrace{f_{X_2\mid X_{-2}} (x_2^{(t)}\mid x_1^{(t)},x_3^{(t-1)},\dots,x_p^{(t-1)})}_{\textrm{corresponds to step 2. of the algorithm}} \cdot \dots \\ & \qquad \cdot \underbrace{f_{X_p\mid X_{-p}} (x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)})}_{\textrm{corresponds to step p. of the algorithm}}\;d{\boldsymbol{x}}^{(t)}.\end{aligned}

Proposition 3.1 The joint distribution $$f(x_1,\dots,x_p)$$ is indeed the invariant distribution of the Markov chain $$({\boldsymbol{x}}^{(0)},{\boldsymbol{x}}^{(1)},\dots)$$ generated by the Gibbs sampler.

Proof. Assume that $${\boldsymbol{x}}^{(t-1)} \sim f$$, then \begin{aligned} {\mathbb{P}}({\boldsymbol{x}}^{(t)}\in\mathcal{X}) &=\int_{\mathcal{X}} \int f({\boldsymbol{x}}^{(t-1)}) K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)}) \; d{\boldsymbol{x}}^{(t-1)}\;d{\boldsymbol{x}}^{(t)}\\ &=\int_{\mathcal{X}} \int \cdots \underbrace{\underbrace{\int f(x_1^{(t-1)},\dots,x_p^{(t-1)})\;dx_1^{(t-1)} }_{=f(x_2^{(t-1)},\dots,x_p^{(t-1)})} f_{X_1\mid X_{-1}} (x_1^{(t)}\mid x_2^{(t-1)},\dots,x_p^{(t-1)})}_{=f(x_1^{(t)},x_2^{(t-1)},\dots,x_p^{(t-1)})} \cdots \\ & \qquad \qquad \cdot f_{X_p\mid X_{-p}} (x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)}) dx_2^{(t-1)} \dots dx_p^{(t-1)} d{\boldsymbol{x}}^{(t)}\\ &= \int_{\mathcal{X}} \int \cdots \underbrace{\underbrace{\int f(x_1^{(t)},x_2^{(t-1)},\dots,x_p^{(t-1)})\;dx_2^{(t-1)} }_{=f(x_1^{(t)},x_3^{(t-1)},\dots,x_p^{(t-1)})} f_{X_2\mid X_{-2}} (x_2^{(t)}\mid x_1^{(t)},x_3^{(t-1)},\dots,x_p^{(t-1)})}_{=f(x_1^{(t)},x_2^{(t)},x_3^{(t-1)},\dots,x_p^{(t-1)})} \cdots \\ & \qquad \qquad \cdot f_{X_p\mid X_{-p}} (x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)}) dx_3^{(t-1)} \dots dx_p^{(t-1)} d{\boldsymbol{x}}^{(t)}\\ &=\dots\\ &=\int_{\mathcal{X}} \underbrace{\underbrace{\int f(x_1^{(t)},\dots,x_{p-1}^{(t)},x_p^{(t-1)}) \;dx_p^{(t-1)}}_{=f(x_1^{(t)},\dots,x_{p-1}^{(t)})} f_{X_p\mid X_{-p}}(x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)})}_{=f(x_1^{(t)},\dots,x_p^{(t)})} d{\boldsymbol{x}}^{(t)}\\ &=\int_{\mathcal{X}}f(x_1^{(t)},\dots,x_p^{(t)})\;d{\boldsymbol{x}}^{(t)}\end{aligned} Thus $$f$$ is the density of $${\boldsymbol{x}}^{(t)}$$ (if $${\boldsymbol{x}}^{(t-1)}\sim f$$).

So far we have established that $$f$$ is indeed the invariant distribution of the Gibbs sampler. Next, we have to analyse under which conditions the Markov chain generated by the Gibbs sampler will converge to $$f$$.

First of all we have to study under which conditions the resulting Markov chain is irreducible (really, we mean $$f$$-irreducible, of course, here and in the following we mean “irreducibility” with respect to the target distribution $$f$$). The following example shows that such irreducibility does not hold for every possible target distribution.

Example 3.3 (Reducible Gibbs sampler). Consider Gibbs sampling from the uniform distribution on $$C_1\cup C_2$$ with $$C_1:=\{(x_1,x_2):\|(x_1,x_2)-(1,1)\|\leq 1\}$$ and $$C_2:=\{(x_1,x_2):\|(x_1,x_2)-(-1,-1)\|\leq 1\}$$, i.e. $f(x_1,x_2)=\frac{1}{2\pi} \mathbb{I}_{C_1\cup C_2}(x_1,x_2).$ Figure 3.2 shows the density as well the first few samples obtained by starting a Gibbs sampler with $$X_1^{(0)}<0$$ and $$X_2^{(0)}<0$$.

It is easy to that when the Gibbs sampler is started in $$C_1$$ it will stay there and never reach $$C_2$$. The reason for this is that the conditional distribution $$X_2\mid X_1$$ ($$X_1\mid X_2$$) is for $$X_1<0$$ ($$X_2<0$$) entirely concentrated on $$C_1$$.

The following proposition gives a sufficient condition for irreducibility (and thus the recurrence) of the Markov chain generated by the Gibbs sampler. There are less strict conditions for the irreducibility and aperiodicity of the Markov chain generated by the Gibbs sampler (see e.g. Robert and Casella 2004 Lemma 10.11).

Proposition 3.2 If the joint distribution $$f(x_1,\dots,x_p)$$ satisfies the positivity condition, the Gibbs sampler yields an $$f$$-irreducible, recurrent Markov chain.

Proof. Let $$\mathcal{X}\subseteq \textrm{supp}(f)$$ be a set with $$\int_{\mathcal{X}} f(x_1^{(t)},\dots,x_p^{(t)}) d(x_1^{(t)},\dots,x_p^{(t)})>0$$. Then $\int_{\mathcal{X}} K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)}) d{\boldsymbol{x}}^{(t)} = \int_{\mathcal{X}} \underbrace{f_{X_1\mid X_{-1}} (x_1^{(t)}\mid x_2^{(t-1)},\dots,x_p^{(t-1)})}_{>0\textrm{ (on a set of non-zero measure)}}\cdots \underbrace{f_{X_p\mid X_{-p}} (x_p^{(t)}\mid x_1^{(t)},\dots,x_{p-1}^{(t)})}_{>0\textrm{ (on a set of non-zero measure)}} d{\boldsymbol{x}}^{(t)}>0.$ Thus the Markov Chain $$({\boldsymbol{x}}^{(t)})_t$$ is strongly $$f$$-irreducible. As $$f$$ is the invariant distribution of the Markov chain, it is recurrent.

If the transition kernel is absolutely continuous with respect to the dominating measure, then recurrence even implies Harris recurrence (see e.g. Robert and Casella 2004 Lemma 10.9).

Now we have established all the necessary ingredients to state an ergodic theorem for the Gibbs sampler, which is a direct consequence of Theorems A.1 and A.2.

Theorem 3.4 If the Markov chain generated by the Gibbs sampler is irreducible and recurrent (which is e.g. the case when the positivity condition holds), then for any integrable function $$h:E\rightarrow{\mathbb{R}}$$: $\lim_{n\rightarrow \infty} \frac1n \sum\limits_{t=1}^n h({\boldsymbol{x}}^{(t)}) \rightarrow {\mathbb{E}_{f}\left[ h({\boldsymbol{X}})\right]}$ for almost every starting value $${\boldsymbol{x}}^{(0)}$$. If the chain is Harris recurrent, then the above result holds for every starting value $${\boldsymbol{x}}^{(0)}$$.

Theorem 3.4 guarantees that we can approximate expectations $${\mathbb{E}_{f}\left[ h({\boldsymbol{x}})\right]}$$ by their empirical counterparts using a single Markov chain.

Example 3.4 Assume that we want to use a Gibbs sampler to estimate $${\mathbb{P}}(X_1\geq 0,X_2\geq 0)$$ for a $${\textsf{N}\left( \left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right),\left(\begin{array}{cc}\sigma^2_{1}&\sigma_{12}\\\sigma_{12}&\sigma^2_{2}\end{array}\right) \right)}$$ distribution. The marginal distributions are $X_1\sim {\textsf{N}\left( \mu_1,\sigma_1^2 \right)} \qquad \textrm{and}\qquad X_2\sim {\textsf{N}\left( \mu_2,\sigma_2^2 \right)}.$ In order to construct a Gibbs sampler, we need the conditional distributions $$X_1\mid X_2=x_2$$ and $$X_2\mid X_1=x_1$$. We have4 \begin{aligned} f(x_1,x_2)&\propto\exp\left(-\frac{1}{2}\left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)' \left(\begin{array}{cc}\sigma^2_{1}&\sigma_{12}\\\sigma_{12}&\sigma^2_{2}\end{array}\right)^{-1}\left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)\right)\\ &\propto \exp\left(-\frac{(x_1-(\mu_1+\sigma_{12}/\sigma^2_{22}(x_2-\mu_2)))^2}{2 (\sigma^2_1 - (\sigma_{12})^2/\sigma_2^2)} \right),\end{aligned} i.e. $(X_1\mid X_2=x_2)\sim {\textsf{N}\left( \mu_1+\sigma_{12}/\sigma^2_{2}(x_2-\mu_2),\sigma^2_{1}-(\sigma_{12})^2/\sigma^2_{2} \right)}.$

Thus the Gibbs sampler for this problem consists of iterating for $$t=1,2,\dots$$

1. Draw $$\displaystyle X_1^{(t)} \sim {\textsf{N}\left( \mu_1+\sigma_{12}/\sigma^2_{2}(X^{(t-1)}_2-\mu_2),\sigma^2_{1}-(\sigma_{12})^2/\sigma^2_{2} \right)}$$.

2. Draw $$\displaystyle X_2^{(t)} \sim {\textsf{N}\left( \mu_2+\sigma_{12}/\sigma^2_{1}(X^{(t)}_1-\mu_1),\sigma^2_{2}-(\sigma_{12})^2/\sigma^2_{1} \right)}$$.

Now consider the special case $$\mu_1=\mu_2=0$$, $$\sigma_1^2=\sigma_2^2=1$$ and $$\sigma_{12}=0.3$$. Figures 3.43.6 show the sample paths of this Gibbs sampler.
Using Theorem 3.4 we can estimate $${\mathbb{P}}(X_1\geq 0,X_2\geq 0)$$ by the proportion of samples $$(X_1^{(t)},X_2^{(t)})$$ with $$X_1^{(t)}\geq 0$$ and $$X_2^{(t)}\geq 0$$. Figure 3.3 shows this estimate.

A Gibbs sampler is of course not the optimal way to sample from a $${\textsf{N}\left( {\boldsymbol{\mu}},{\boldsymbol{\Sigma}} \right)}$$ distribution. A more efficient way is: draw $$Z_1,\dots,Z_p \overset{\text{iid}}{\sim}{\textsf{N}\left( 0,1 \right)}$$ and set $$(X_1,\dots,X_p)'={\boldsymbol{\Sigma}}^{1/2} (Z_1,\dots,Z_p)' + {\boldsymbol{\mu}}$$. As we shall see, in some instances the loss of efficiency arising from Gibbs sampling can be very severe.

Note that the realisations $$({\boldsymbol{x}}^{(0)},{\boldsymbol{x}}^{(1)},\dots)$$ form a Markov chain, and are thus not independent, but typically positively correlated. The correlation between the $${\boldsymbol{x}}^{(t)}$$ is larger if the Markov chain moves only slowly (the chain is then said to be slowly mixing). For the Gibbs sampler this is typically the case if the variables $$X_j$$ are strongly (positively or negatively) correlated, as the following example shows.

Example 3.5 (Sampling from a highly correlated bivariate Gaussian).

Figures 3.73.9 show the results obtained when sampling from a bivariate Normal distribution as in Example 3.4, however with $$\sigma_{12}=0.99$$. This yields a correlation of $$\rho(X_1,X_2)=0.99$$. This Gibbs sampler is a lot slower mixing than the one considered in Example 3.4 (and displayed in Figure 3.43.6): due to the strong correlation the Gibbs sampler can only perform very small movements. This makes subsequent samples $$X_j^{(t-1)}$$ and $$X_j^{(t)}$$ highly correlated and this leads to slower convergence, as the plot of the estimated densities show (compare Figures 3.53.6 with 3.83.9).

### 3.2.2 Metropolis and Beyond

Although the Gibbs sampler is appealing and appears generally applicable, there are some difficulties with it. In particular, it requires that the full conditional distributions are known and can be sampled from (and in order to be efficient these need to be the full conditional distributions of groups of highly-dependent subsets of the random variables which can further complicate the problem). We turn our attention now to a still more broadly-applicable class of MCMC algorithms based around an accept/reject mechanism.

The Metropolis–Hastings algorithm dates back to Metropolis et al. (1953) and Hastings (1970). Like rejection sampling (Algorithm 2.1), the Metropolis–Hastings algorithm is based on proposing values sampled from a proposal distribution, which are then accepted with a certain probability that reflects how likely it is that they are from the target distribution $$f$$.

The main drawback of the rejection sampling algorithm is that it is often very difficult to come up with a suitable proposal distribution that leads to an efficient algorithm. One way around this problem is to allow for “local updates”, i.e. let the proposed value depend on the last accepted value. This makes it easier to come up with a suitable (conditional) proposal, however at the price of yielding a Markov chain instead of a sequence of independent realisations.

Algorithm 3.3 (Metropolis–Hastings). Starting with $${\boldsymbol{X}}^{(0)}:=(X_1^{(0)},\dots,X_p^{(0)})$$ iterate for $$t=1,2,\dots$$

1. Draw $${\boldsymbol{X}}\sim q(\cdot\mid {\boldsymbol{X}}^{(t-1)})$$.

2. Compute $$$\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})=\min\left\{1,\frac{\displaystyle f({\boldsymbol{X}})\cdot q({\boldsymbol{X}}^{(t-1)}\mid {\boldsymbol{X}})}{\displaystyle f({\boldsymbol{X}}^{(t-1)})\cdot q({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})} \right\}. \tag{3.6}$$$

3. With probability $$\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})$$ set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}$$, otherwise set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}^{(t-1)}$$.

Figure 3.10 illustrates the Metropolis–Hastings algorithm. Note that if the algorithm rejects the newly proposed value (open disks joined by dotted lines in Figure 3.10) it accepts its current value $${\boldsymbol{X}}^{(t-1)}$$ instead. (The Metropolis–Hastings algorithm never rejects.) The probability that the Metropolis–Hastings algorithm accepts the newly proposed state $${\boldsymbol{X}}$$ given that it currently is in state $${\boldsymbol{X}}^{(t-1)}$$ is $a({\boldsymbol{x}}^{(t-1)})=\int \alpha({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)}) \; d{\boldsymbol{x}}.$ Just like the Gibbs sampler, the Metropolis–Hastings algorithm generates a Markov chain, whose properties will be discussed in the next section.

Remark. The probability of acceptance (3.6) does not depend on the normalisation constant: if $$f({\boldsymbol{x}})=C\cdot \pi({\boldsymbol{x}})$$, then $\frac{ f({\boldsymbol{x}})\cdot q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}})}{ f({\boldsymbol{x}}^{(t-1)})\cdot q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)})} = \frac{ C\pi({\boldsymbol{x}})\cdot q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}})}{ C\pi({\boldsymbol{x}}^{(t-1)})\cdot q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)})} =\frac{ \pi({\boldsymbol{x}})\cdot q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}})}{ \pi({\boldsymbol{x}}^{(t-1)})\cdot q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)})}.$ Thus $$f$$ only needs to be known up to normalisation constant. Similarly, it is enough to know $$q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}})$$ up to a multiplicative constant independent of $${\boldsymbol{x}}^{(t-1)}$$ and $${\boldsymbol{x}}$$.

### 3.2.3 Convergence of Metropolis–Hastings

Lemma 3.2 The transition kernel5 of the Metropolis–Hastings algorithm is $$$K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})=\alpha({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}) + (1-a({\boldsymbol{x}}^{(t-1)}))\delta_{{\boldsymbol{x}}^{(t-1)}}({\boldsymbol{x}}^{(t)}),\tag{3.7}$$$ where $$\delta_{{\boldsymbol{x}}^{(t-1)}}(\cdot)$$ denotes a probability distribution which places a mass of one at $${\boldsymbol{x}}^{(t-1)}$$.

Proof. We have \begin{align*} {\mathbb{P}}({\boldsymbol{x}}^{(t)}\in \mathcal{X}\mid {\boldsymbol{x}}^{(t-1)}={\boldsymbol{x}}^{(t-1)})= & {\mathbb{P}}({\boldsymbol{x}}^{(t)}\in \mathcal{X}, \textrm{new value accepted}\mid {\boldsymbol{x}}^{(t-1)}={\boldsymbol{x}}^{(t-1)})\\ & {}+{\mathbb{P}}({\boldsymbol{x}}^{(t)}\in \mathcal{X}, \textrm{new value rejected}\mid {\boldsymbol{x}}^{(t-1)}={\boldsymbol{x}}^{(t-1)})\\ =&\int_{\mathcal{X}} \alpha({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})\;d{\boldsymbol{x}}^{(t)}\\ & {}+ \underbrace{\underbrace{\mathbb{I}_{\mathcal{X}}({\boldsymbol{x}}^{(t-1)})}_{=\int_{\mathcal{X}} \delta_{{\boldsymbol{x}}^{(t-1)}}(d{\boldsymbol{x}}^{(t)})} \underbrace{{\mathbb{P}}(\textrm{new value rejected}\mid {\boldsymbol{x}}^{(t-1)}={\boldsymbol{x}}^{(t-1)})}_{=1-a({\boldsymbol{x}}^{(t-1)})}}_{=\int_{\mathcal{X}} (1-a({\boldsymbol{x}}^{(t-1)}))\delta_{{\boldsymbol{x}}^{(t-1)}}(d{\boldsymbol{x}}^{(t)})}\\ =&\int_{\mathcal{X}}\alpha({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})\;d{\boldsymbol{x}}^{(t)}\\ & {}+ \int_{\mathcal{X}}(1-a({\boldsymbol{x}}^{(t-1)}))\delta_{{\boldsymbol{x}}^{(t-1)}}(d{\boldsymbol{x}}^{(t)}). \end{align*}

Proposition 3.3 The Metropolis–Hastings kernel (3.7) satisfies the detailed balance condition $K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})f({\boldsymbol{x}}^{(t-1)}) = K({\boldsymbol{x}}^{(t)},{\boldsymbol{x}}^{(t-1)})f({\boldsymbol{x}}^{(t)})$ and thus $$f({\boldsymbol{x}})$$ is the invariant distribution of the Markov chain $$({\boldsymbol{x}}^{(0)},{\boldsymbol{x}}^{(1)},\dots)$$ generated by the Metropolis–Hastings sampler. Furthermore the Markov chain is reversible.

Proof. We have that \begin{aligned} \alpha({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})f({\boldsymbol{x}}^{(t-1)}) &= \min\left\{1,\frac{f({\boldsymbol{x}}^{(t)}) q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})}{f({\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})} \right\}q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})f({\boldsymbol{x}}^{(t-1)})\\ &=\min\left\{f({\boldsymbol{x}}^{(t-1)})q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}),f({\boldsymbol{x}}^{(t)}) q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})\right\} \\ &=\min\left\{\frac{f({\boldsymbol{x}}^{(t-1)})q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})}{f({\boldsymbol{x}}^{(t)}) q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})},1\right\}q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})f({\boldsymbol{x}}^{(t)})\\ &=\alpha({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})f({\boldsymbol{x}}^{(t)}),\end{aligned} and thus \begin{aligned} K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})f({\boldsymbol{x}}^{(t-1)})= {}& \underbrace{\alpha({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)}) q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})f({\boldsymbol{x}}^{(t-1)})}_{=\alpha({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}}^{(t)})f({\boldsymbol{x}}^{(t)})}\\ & {}+ \underbrace{(1-a({\boldsymbol{x}}^{(t-1)}))\underbrace{\delta_{{\boldsymbol{x}}^{(t-1)}}({\boldsymbol{x}}^{(t)})}_{=0 \textrm{ if {\boldsymbol{x}}^{(t)}\neq {\boldsymbol{x}}^{(t-1)}}}f({\boldsymbol{x}}^{(t-1)})}_{(1-a({\boldsymbol{x}}^{(t)}))\delta_{{\boldsymbol{x}}^{(t)}}({\boldsymbol{x}}^{(t-1)})f({\boldsymbol{x}}^{(t)})} & = K({\boldsymbol{x}}^{(t)},{\boldsymbol{x}}^{(t-1)}) f({\boldsymbol{x}}^{(t)}).\end{aligned} The other conclusions follow by Proposition A.2, suitably adapted to the continuous case (i.e. replacing the sums by integrals).

Next we need to examine whether the Metropolis–Hastings algorithm yields an irreducible chain. As with the Gibbs sampler, this is not necessarily the case, as the following example shows.

Example 3.6 (Reducible Metropolis–Hastings). Consider using a Metropolis–Hastings algorithm for sampling from a uniform distribution on $$[0,1]\cup [2,3]$$ and a $${\textsf{U}{\left[x^{(t-1)}-\delta,x^{(t-1)}+\delta\right]}}$$ distribution as proposal distribution $$q(\cdot\mid x^{(t-1)})$$. Figure 3.11 illustrates this example. It is easy to see that the resulting Markov chain is not irreducible if $$\delta\leq 1$$: in this case the chain either stays in $$[0,1]$$ or $$[2,3]$$.

Under mild assumptions on the proposal $$q(\cdot\mid {\boldsymbol{x}}^{(t-1)})$$ one can however establish the irreducibility of the resulting Markov chain:

• If $$q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})$$ is positive for all $${\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)}\in \textrm{supp}(f)$$, then it is easy to see that we can reach any set of non-zero probability under $$f$$ within a single step. The resulting Markov chain is thus strongly irreducible. Even though this condition seems rather restrictive, many popular choices of $$q(\cdot\mid {\boldsymbol{x}}^{(t-1)})$$ like multivariate Gaussian or t-distributions fulfil this condition.

• G. Roberts and Tweedie (1996, Theorem 2.2) gives a more general condition for the irreducibility of the resulting Markov chain: they only require that $\exists \;\epsilon > 0,\, \delta > 0:\; q({\boldsymbol{x}}^{(t)}\mid {\boldsymbol{x}}^{(t-1)})>\epsilon \textrm{ if } \|{\boldsymbol{x}}^{(t-1)}-{\boldsymbol{x}}^{(t)}\|<\delta$ together with the boundedness (away from both zero and infinity) of $$f$$ on any compact set.

The Markov chain $$({\boldsymbol{x}}^{(0)},{\boldsymbol{x}}^{(1)},\dots)$$ is further aperiodic if there is positive probability that the chain remains in the current state, i.e. $${\mathbb{P}}({\boldsymbol{x}}^{(t)}={\boldsymbol{x}}^{(t-1)})>0$$, which is the case if ${\mathbb{P}}\left(f({\boldsymbol{x}}^{(t-1)})q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)})> f({\boldsymbol{x}})q({\boldsymbol{x}}^{(t-1)}\mid {\boldsymbol{x}})\right)>0.$ Note that this condition is not met if we use a “perfect” proposal which has $$f$$ as invariant distribution: in this case we accept every proposed value with probability $$1$$ (see e.g. remark after Example 3.10).

Proposition 3.4 The Markov chain generated by the Metropolis–Hastings algorithm is Harris-recurrent if it is irreducible.

Proof. Recurrence follows from the irreducibility and the fact that $$f$$ is the invariant distribution. For a proof of Harris recurrence see Tierney (1994, Corollary 2).

As we have now established (Harris-)recurrence, we are now ready to state an ergodic theorem (using Theorems A.1 and A.2.

Theorem 3.5 If the Markov chain generated by the Metropolis–Hastings algorithm is irreducible, then for any integrable function $$h:E\rightarrow{\mathbb{R}}$$: $\lim_{n\rightarrow \infty} \frac1n \sum\limits_{t=1}^n h({\boldsymbol{x}}^{(t)}) \rightarrow {\mathbb{E}_{f}\left[ h({\boldsymbol{X}})\right]}$ for every starting value $${\boldsymbol{x}}^{(0)}$$.

As with the Gibbs sampler, the above ergodic theorem allows for inference using a single Markov chain.

### 3.2.4 The random walk Metropolis algorithm

In this section we will focus on an important special case of the Metropolis–Hastings algorithm: the random walk Metropolis–Hastings algorithm. Assume that we generate the newly proposed state $${\boldsymbol{X}}$$ not using the fairly general $$${\boldsymbol{X}}\sim q(\cdot\mid {\boldsymbol{X}}^{(t-1)}),\tag{3.8}$$$ from Algorithm 3.3, but rather $$${\boldsymbol{X}}={\boldsymbol{X}}^{(t-1)}+{\boldsymbol{\epsilon}}, \qquad {\boldsymbol{\epsilon}}\sim g,\tag{3.9}$$$ with $$g$$ being a symmetric distribution. It is easy to see that (3.9) is a special case of (3.8) using $$q({\boldsymbol{x}}\mid {\boldsymbol{x}}^{(t-1)})=g({\boldsymbol{x}}-{\boldsymbol{x}}^{(t-1)})$$. When using (3.9) the probability of acceptance simplifies to $\min\left\{1,\frac{\displaystyle f({\boldsymbol{X}})\cdot q({\boldsymbol{X}}^{(t-1)}\mid {\boldsymbol{X}})}{\displaystyle f({\boldsymbol{X}}^{(t-1)})\cdot q({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})}\right\} = \min\left\{1,\frac{\displaystyle f({\boldsymbol{X}})}{\displaystyle f({\boldsymbol{X}}^{(t-1)})}\right\},$ as $$q({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})= g({\boldsymbol{X}}-{\boldsymbol{X}}^{(t-1)})= g({\boldsymbol{X}}^{(t-1)}-{\boldsymbol{X}})=q({\boldsymbol{X}}^{(t-1)}\mid {\boldsymbol{X}})$$ using the symmetry of $$g$$. This yields the following algorithm, a special case of Algorithm 3.3, which is actually the original algorithm proposed by Metropolis et al. (1953).

Algorithm 3.4 (Random walk Metropolis). Starting with $${\boldsymbol{X}}^{(0)}:=(X_1^{(0)},\dots,X_p^{(0)})$$ and using a symmetric distribution $$g$$, iterate for $$t=1,2,\dots$$

1. Draw $${\boldsymbol{\epsilon}}\sim g$$ and set $${\boldsymbol{X}}={\boldsymbol{X}}^{(t-1)}+{\boldsymbol{\epsilon}}$$.

2. Compute $\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})=\min\left\{1,\frac{\displaystyle f({\boldsymbol{X}})}{\displaystyle f({\boldsymbol{X}}^{(t-1)})} \right\}.$

3. With probability $$\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})$$ set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}$$, otherwise set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}^{(t-1)}$$.

Example 3.7 (Bayesian probit model). In a medical study on infections resulting from birth by Cæsarean section (taken from Fahrmeir and Tutz 2001) three influence factors have been studied: an indicator whether the Cæsarian was planned or not ($$z_{i1}$$), an indicator of whether additional risk factors were present at the time of birth ($$z_{i2}$$), and an indicator of whether antibiotics were given as a prophylaxis ($$z_{i3}$$). The response $$Y_i$$ is the number of infections that were observed amongst $$n_i$$ patients having the same influence factors (covariates). The data is given in Table 3.1.

 Number of births planned risk factors antibiotics with infection total $$y_i$$ $$n_i$$ $$z_{i1}$$ $$z_{i2}$$ $$z_{i3}$$ 11 98 1 1 1 1 18 0 1 1 0 2 0 0 1 23 26 1 1 0 28 58 0 1 0 0 9 1 0 0 8 40 0 0 0

The data can be modelled by assuming that $Y_i \sim \textsf{Bin}(n_i,\pi_i),\qquad \pi=\Phi({\boldsymbol{z}}_i'{\boldsymbol{\beta}}),$ where $${\boldsymbol{z}}_i=(1,z_{i1},z_{i2},z_{i3})$$ and $$\Phi(\cdot)$$ being the CDF of the $$\textsf{N}(0,1)$$ distribution. Note that $$\Phi(t)\in[0,1]$$ for all $$t\in\mathbb{R}$$.

A suitable prior distribution for the parameter of interest $${\boldsymbol{\beta}}$$ is $${\boldsymbol{\beta}}\sim N({\boldsymbol{0}},\mathbb{I}/\lambda)$$. The posterior density of $${\boldsymbol{\beta}}$$ is $f({\boldsymbol{\beta}}\mid y_1,\dots,y_n)\propto \left(\prod_{i=1}^N \Phi({\boldsymbol{z}}_i'{\boldsymbol{\beta}})^{y_i} \cdot (1-\Phi({\boldsymbol{z}}_i'{\boldsymbol{\beta}}))^{n_i-y_i} \right) \cdot \exp\left(-\frac{\lambda}{2}\sum_{j=0}^3 \beta_j^2\right).$ We can sample from the above posterior distribution using the following random walk Metropolis algorithm. Starting with any $${\boldsymbol{\beta}}^{(0)}$$ iterate for $$t=1,2,\dots$$:

1. Draw $${\boldsymbol{\epsilon}}\sim N({\boldsymbol{0}},{\boldsymbol{\Sigma}})$$ and set $${\boldsymbol{\beta}}={\boldsymbol{\beta}}^{(t-1)}+{\boldsymbol{\epsilon}}$$.

2. Compute $\alpha({\boldsymbol{\beta}}\mid {\boldsymbol{\beta}}^{(t-1)})=\min\left\{1,\frac{\displaystyle f({\boldsymbol{\beta}}\mid Y_1,\dots,Y_n)}{\displaystyle f({\boldsymbol{\beta}}^{(t-1)}\mid Y_1,\dots,Y_n)} \right\}.$

3. With probability $$\alpha({\boldsymbol{\beta}}\mid {\boldsymbol{\beta}}^{(t-1)})$$ set $${\boldsymbol{\beta}}^{(t)}={\boldsymbol{\beta}}$$, otherwise set $${\boldsymbol{\beta}}^{(t)}={\boldsymbol{\beta}}^{(t-1)}$$.

*To keep things simple, we choose the covariance $${\boldsymbol{\Sigma}}$$ of the proposal to be $$0.08 \cdot \mathbb{I}$$.*

Table 3.2: Parameter estimates obtained for the Bayesian probit model from Example 3.7.
Posterior mean 95% credible interval
intercept $$\beta_0$$ -1.0952 -1.4646 -0.7333
planned $$\beta_1$$ 0.6201 0.2029 1.0413
risk factors $$\beta_2$$ 1.2000 0.7783 1.6296
antibiotics $$\beta_3$$ -1.8993 -2.3636 -1.471

Figure 3.123.14 and Table 3.2 show the results obtained using 50,000 samples (you might want to consider a longer chain in practice). Note that the convergence of the $$\beta_j^{(t)}$$ is to a distribution, whereas the cumulative averages $$\sum_{\tau=1}^t\beta_j^{(\tau)}/t$$ converge, as the ergodic theorem implies, to a value. For Figure 3.123.14 and Table 3.2 the first 10,000 samples have been discarded (“burn-in”).

#### 3.2.4.1 Choosing the proposal distribution

The efficiency of a Metropolis–Hastings sampler depends on the choice of the proposal distribution $$q(\cdot\mid {\boldsymbol{x}}^{(t-1)})$$. An ideal choice of proposal would lead to a small correlation of subsequent realisations $${\boldsymbol{X}}^{(t-1)}$$ and $${\boldsymbol{X}}^{(t)}$$. This correlation has two sources:

• the correlation between the current state $${\boldsymbol{X}}^{(t-1)}$$ and the newly proposed value $${\boldsymbol{X}}\sim q(\cdot\mid {\boldsymbol{X}}^{(t-1)})$$, and

• the correlation introduced by retaining a value $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}^{(t-1)}$$ because the newly generated value $${\boldsymbol{X}}$$ has been rejected.

Thus we would ideally want a proposal distribution that both allows for fast changes in the $${\boldsymbol{X}}^{(t)}$$ and yields a high probability of acceptance. Unfortunately these are two competing goals. If we choose a proposal distribution with a small variance, the probability of acceptance will be high, however the resulting Markov chain will be highly correlated, as the $${\boldsymbol{X}}^{(t)}$$ change only very slowly. If, on the other hand, we choose a proposal distribution with a large variance, the $${\boldsymbol{X}}^{(t)}$$ can potentially move very fast, however the probability of acceptance will be rather low.

Example 3.8 Assume we want to sample from a $$N(0,1)$$ distribution using a random walk Metropolis–Hastings algorithm with $$\varepsilon \sim N(0,\sigma^2)$$. At first sight, we might think that setting $$\sigma^2=1$$ is the optimal choice, this is however not the case. In this example we examine the choices: $$\sigma^2=0.1$$, $$\sigma^2=1$$, $$\sigma^2=2.38^2$$, and $$\sigma^2=10^2$$. Figure 3.15 shows the sample paths of a single run of the corresponding random walk Metropolis–Hastings algorithm. Rejected values are drawn as grey open circles. Table 3.3 shows the average correlation $$\rho(X^{(t-1)},X^{(t)})$$ as well as the average probability of acceptance $$\alpha(X\mid X^{(t-1)})$$ averaged over 100 runs of the algorithm. Choosing $$\sigma^2$$ too small yields a very high probability of acceptance, however at the price of a chain that is hardly moving. Choosing $$\sigma^2$$ too large allows the chain to make large jumps; however, most of the proposed values are rejected, so the chain remains for a long time at each accepted value. The results suggest that $$\sigma^2=2.38^2$$ is the optimal choice. This corresponds to the theoretical results of Gelman, Roberts, and Gilks (1995) and the many papers which have extended the original result.

 Autocorrelation Probability of acceptance $$\rho(X^{(t-1)},X^{(t)})$$ $$\alpha(X,X^{(t-1)})$$ Mean 95% CI Mean 95% CI $$\sigma^2=0.1^2$$ 0.9901 (0.9891,0.9910) 0.9694 (0.9677,0.9710) $$\sigma^2=1$$ 0.7733 (0.7676,0.7791) 0.7038 (0.7014,0.7061) $$\sigma^2=2.38^2$$ 0.6225 (0.6162,0.6289) 0.4426 (0.4401,0.4452) $$\sigma^2=10^2$$ 0.8360 (0.8303,0.8418) 0.1255 (0.1237,0.1274)

Finding the ideal proposal distribution $$q(\cdot \mid {\boldsymbol{x}}^{(t-1)})$$ is an art. The optimal proposal would be sampling directly from the target distribution. The very reason for using a Metropolis–Hastings algorithm is, however, that we cannot sample directly from the target! This difficulty is the price we have to pay for the generality of the Metropolis–Hastings algorithm. Popular choices for random walk proposals are multivariate Gaussian or t-distributions. The latter have heavier tails, making them a safer choice. The covariance structure of the proposal distribution should ideally reflect the covariance of the target distribution. G. O. Roberts, Gelman, and Gilks (1997) propose to adjust the proposal such that the acceptance rate is around $$1/2$$ for one- or two dimensional target distributions, and around $$1/4$$ for larger dimensions, which is in line with the results we obtained in the above simple example and the guidelines which motivate them. Remember, however, that these are just rough guidelines and there is little to be gained from fine-tuning acceptance rates to several decimal places.

Example 3.9 (Bayesian probit model (continued)). In the Bayesian probit model we studied in Example 3.7 we drew ${\boldsymbol{\epsilon}}\sim {\textsf{N}\left( {\boldsymbol{0}},{\boldsymbol{\Sigma}} \right)}$ with $${\boldsymbol{\Sigma}}=0.08\cdot \mathbf{I}$$, i.e. we modelled the components of $${\boldsymbol{\epsilon}}$$ to be independent. The proportion of accepted values we obtained was $$13.9\%$$. Table 3.4 shows the corresponding autocorrelation. The resulting Markov chain can be made faster mixing by using a proposal distribution that represents the covariance structure of the posterior distribution of $${\boldsymbol{\beta}}$$. This can be done by resorting to the frequentist theory of generalised linear models (GLM): it suggests that the asymptotic covariance of the maximum likelihood estimate $$\hat{\boldsymbol{\beta}}$$ is $$({\boldsymbol{z}}'{\boldsymbol{D}}{\boldsymbol{z}})^{-1}$$, where $${\boldsymbol{z}}$$ is the matrix of the covariates, and $${\boldsymbol{D}}$$ is a suitable diagonal matrix. When using $${\boldsymbol{\Sigma}}=2\cdot ({\boldsymbol{z}}'{\boldsymbol{D}}{\boldsymbol{z}})^{-1}$$ in the algorithm presented in Example 3.7 we can obtain better mixing performance: the autocorrelation is reduced (see Table 3.4), and the proportion of accepted values obtained increases to 20.0%. Note that the determinant of both choices of $${\boldsymbol{\Sigma}}$$ was chosen to be the same, so the improvement of the mixing behaviour is due to a difference in the structure of the covariance.

Table 3.4: Autocorrelation $$\rho(\beta_j^{(t-1)},\beta^{(t)}_j)$$ between subsequent samples for two choices of the covariance $${\boldsymbol{\Sigma}}$$.
$$\beta_0$$ $$\beta_1$$ $$\beta_2$$ $$\beta_3$$
Autocorrelation for $${\boldsymbol{\Sigma}}=0.08\cdot\mathbf{I}$$ 0.9496 0.9503 0.9562 0.9532
Autocorrelation for $${\boldsymbol{\Sigma}}=2\cdot ({\boldsymbol{z}}'{\boldsymbol{D}}{\boldsymbol{z}})^{-1}$$ 0.8726 0.8765 0.8741 0.8792

### 3.2.5 The (Metropolised) Independence Sampler

The random walk proposals considered thus far are appealing because we can in principle employ them without detailed knowledge of the structure of the target distribution nor dedicating considerable effort to their design. However, if we do have information about the target distribution we may wish to use it to design global rather than local proposals and hence, we might hope, to reduce the autocorrelation of the chain.

This is indeed possible and if we can construct proposal distributions which have a similar form to the target distribution we can obtain good performance within an MCMC algorithm.

Choosing proposals of the form $$q(x^{(t)}\mid x^{(t-1)}) = q(x^{(t)})$$ (i.e. which are independent of the current state) leads to what is known as the Metropolised Independence Sampler or, sometimes, just the Independence Sampler. This name is potentially a little misleading, as this algorithm does not yield independent samples; it simply employs proposals which are themselves independent of the current state of the chain.

Algorithm 3.5 (Metropolised Independence Sampler). Starting with $${\boldsymbol{X}}^{(0)}:=(X_1^{(0)},\dots,X_p^{(0)})$$ iterate for $$t=1,2,\dots$$

1. Draw $${\boldsymbol{X}}\sim q(\cdot)$$.

2. Compute \begin{align} \alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})&=\min\left\{1,\frac{\displaystyle f({\boldsymbol{X}})\cdot q({\boldsymbol{X}}^{(t-1)})}{\displaystyle f({\boldsymbol{X}}^{(t-1)})\cdot q({\boldsymbol{X}})} \right\}\notag\\ &=\min\left\{1,\frac{\displaystyle f({\boldsymbol{X}}) / q({\boldsymbol{X}})}{\displaystyle f({\boldsymbol{X}}^{(t-1)}) / q({\boldsymbol{X}}^{(t-1)})} \right\}. \tag{3.10} \end{align}

3. With probability $$\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})$$ set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}$$, otherwise set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}^{(t-1)}$$.

The form of the acceptance probability given in Equation (3.10) is highly suggestive: the ratio within the minimum is exactly a ratio of importance weights. If we sampled independently from $$q$$ and used those samples to approximate expectations with respect to $$f$$ by importance sampling, we’d be using exactly the numerator of this ratio as the importance weight. If we used the same strategy within a rejection sampling setting, assuming this ratio to be bounded, then we’d need an acceptance probability proportional to this ratio.

Later we will see that under those conditions in which the independence sampler proposal would be a good rejection sampling proposal it will also be a good proposal within a MCMC setting. First, however, it’s interesting to consider the relationship between the independence sampler and its rejection-sampling counterpart.

Proposition 3.5 (Acceptance Rates). If $$f(x) / q(x) \leq M < \infty$$ the acceptance rate of the independence sampler is at least as high as that of the corresponding rejection sampler.

Proof. Simply expanding the acceptance probability at any point $$x$$ we establish that: $\begin{gathered} a(x) = \int q(y) \alpha(x,y) dy = \int q(y) \min\left(1,\frac{f(y) / q(y)}{f(x) / q(x)} \right) dy = \int \min\left(q(y) ,\frac{f(y)}{f(x) / q(x)} \right) dy \\ \geq \int \min\left(f(y)/M, f(y)/M \right) dy \geq 1/M,\end{gathered}$ and as this holds for any $$M$$ which bounds $$f/q$$, the acceptance rate of the independence sampler is lower bounded by the best possible acceptance rate for any rejection sampler.

However, this comes at a cost: with rejection sampling one obtains independent samples from the target; with the independence sampler this is not the case.

#### 3.2.5.1 Ergodicity and the Independence Sampler

One method of assessing the convergence of Markov chains is to look at how far away from the invariant distribution it is possible for the marginal distribution of the chain to remain after a certain number of iterations. In order to make such an assessment it is necessary to define a distance on the space of probability measures.

Definition 3.2 (Total Variation). The total variation distance between two probability distributions, $$f$$ and $$g$$, may be defined as: \begin{aligned} {\left\vert\left\vert f-g\right\vert\right\vert_{TV}} :=& 2 \sup_{A} \left|\int_A f(x) - g(x) dx \right|.\end{aligned}

Actually, in the case of probability densities, this is exactly the $$L_1$$ distance between those densities and you may find this formulation easier to interpret:

Proposition 3.6 For any pair of probability densities defined on a common space $$E$$: ${\left\vert\left\vert f-g\right\vert\right\vert_{TV}} = \int |f(x) - g(x)| dx.$

Proof. Let $$A^\star = \{ x : f(x) > g(x)\}$$. It is clear that for all $$A$$: $\left|\int_A (f(x) - g(x)) dx \right| \leq \left|\int_{A^\star} f(x) - g(x) dx \right|.$ Noting further that $$\int_{A \cup A^c} (f(x) - g(x))dx = 0$$, we can establish that for any (measurable) A: $\int_A (f(x) - g(x)) dx = -\int_{A^c} (f(x) - g(x)) dx$ and so $2 \left|\int_A f(x) - g(x) dx \right| = \left|\int_A f(x) - g(x) dx \right| +\left|\int_{A^c} f(x) - g(x) dx \right|.$ On $$A^\star$$, $$f(x) > g(x)$$ while on $$(A^\star)^c$$ the reverse is true, so: $\left|\int_{A^\star} f(x) - g(x) dx \right| = \int_{A^\star} f(x) - g(x) dx = \int_{A^\star} |f(x) - g(x)| dx$ and $\left|\int_{(A^\star)^c} f(x) - g(x) dx \right| = - \int_{(A^\star)^c} f(x) - g(x) dx = \int_{(A^\star)^c} |f(x) - g(x)| dx.$

Combining everything, we establish that: \begin{aligned} {\left\vert\left\vert f-g\right\vert\right\vert_{TV}} := 2 \sup_{A} \left|\int_A f(x) - g(x) dx \right| &= 2 \left|\int_{A^\star} f(x) - g(x) dx \right|\\ &= \left|\int_{A^\star} f(x) - g(x) dx \right| +\left|\int_{(A^\star)^c} f(x) - g(x) dx \right|\\ &= \int_{A^\star} |f(x) - g(x)| dx + \int_{(A^\star)^c} |f(x) - g(x)| dx \\ &= \int |f(x) - g(x)| dx.\end{aligned}

Having defined total variation, we can define three forms of ergodicity.

Definition 3.3 (Forms of Ergodicity). An $$f$$-invariant Markov kernel, $$K$$, is said to be ergodic if $\lim_{n\rightarrow\infty} {\left\vert\left\vert K^n(x,\cdot) - f(\cdot)\right\vert\right\vert_{TV}} = 0$ where $${\left\vert\left\vert K^n(x,\cdot) - f(\cdot)\right\vert\right\vert_{TV}} = \int |K^n(x,y) - f(y)|dy$$.

If this statement can be strengthened to: ${\left\vert\left\vert K^n(x,\cdot) - f(\cdot)\right\vert\right\vert_{TV}} \leq M(x) \rho^{n}$ for some $$M(x) < \infty$$ and $$\rho < 1$$ then the kernel is said to be geometrically ergodic and if it can be further strengthened to: ${\left\vert\left\vert K^n(x,\cdot) - f(\cdot)\right\vert\right\vert_{TV}} \leq M \rho^{n}$ for some $$M < \infty$$ which does not depend upon $$x$$ then it is said to be uniformly geometrically ergodic or simply uniformly ergodic.

These are useful because they tell us something about the qualitative rate of convergence of the Markov chain to stationarity. However, we should bear in mind that if we don’t know the constants $$M$$ and $$\rho$$ then even a uniformly ergodic chain can in practice converge rather slowly (see G. O. Roberts and Rosenthal (2011) for some examples).

We’re now in a position to state and prove one celebrated result about independence samplers.

Proposition 3.7 If an independence sampler uses proposal $$q$$ and target $$f$$ and $$f(y) / q(y) \leq M < \infty$$ then the associated Markov kernel is uniformly ergodic.

Proof. We follow the argument of Robert and Casella (2004, Exercise 7.11). First we show that $$f(y) / q(y) \leq M \Rightarrow K(x,y) \geq f(y)/M$$: \begin{align*} K(x,y) &= q(y) \alpha(x,y) + (1-a(x)) \delta_x(y) \geq q(y) \alpha(x,y)\\ &\geq q(y) \min \left(\frac{f(y) / q(y)}{f(x)/q(x)},1 \right) = \min \left(\frac{f(y)}{f(x)/q(x)}, q(y) \right). \end{align*} Under the assumptions of the proposition we have that $$f(x)/q(x) \leq M$$ and $$q(y) \geq f(y)/M$$ and so: $$$K(x,y) \geq \min \left(\frac{f(y) }{M}, f(y) / M \right) = f(y) / M.\tag{3.11}$$$ Now we establish a preliminary result, defining $$A^\star(x) = \{ y : f(y) > K(x,y) \}$$: \begin{aligned} \sup_A \left|\int_A K(x,y) - f(y) dy \right|&= \left|\int_{A^\star(x)} K(x,y) - f(y) dy\right|\\ &= \int_{A^\star(x)} f(y) - K(x,y) dy\\ &\leq \int_{A^\star(x)} f(y) - (1/M) f(y) dy = 1-\frac{1}{M},\end{aligned} using Equation (3.11) to bound the negative term from below. We use this as a base case for induction. We have (by substituting this bound into the definition of the total variation norm) for $$n=1$$: $${\left\vert\left\vert K^n(x,\cdot) -f(\cdot)\right\vert\right\vert_{TV}} \leq 2 ( 1- 1/M)^n$$.

We now turn to the induction step, and assume that the hypothesis $${\left\vert\left\vert K^n(x,\cdot) -f(\cdot)\right\vert\right\vert_{TV}} \leq 2 ( 1- 1/M)^n$$ holds for some $$n$$ and write, for any (measurable) $$A$$: \begin{aligned} \int_A (K^{n+1}(x,y) - f(y)) dy = \int_A \int_A (K^n(u,y) - f(y)) dy (K(x,u) - f(u)) du\end{aligned} (you can check this by remembering that $$K$$ is $$f$$-invariant and expanding the right hand side explicitly).

The induction hypothesis tells us that the integral over $$y$$ is bounded by $$(1-1/M)^n$$ (it’s easy to establish that the integral over any set of the difference between any pair of probability densities is at most half of the total variation distance between those densities), and a similar argument to that used to prove the base case establishes that the integral over $$u$$ can then be bounded by $$(1-1/M)$$: \begin{aligned} \int_A (K^{n+1}(x,y) - f(y)) dy &= \int_A \int_A (K^n(u,y) - f(y)) dy (K(x,u) - f(u)) du\\ &\leq \int (1-1/M)^n (K(x,u) - f(u)) du \leq (1-1/M)^{n+1}\end{aligned}

Writing the total variation as twice the supremum over $$A$$ of quantities of this form completes the argument.

## 3.3 Composing kernels: Mixtures and Cycles

It can be advantageous, especially in the case of more complex distributions, to combine different Metropolis–Hastings updates into a single algorithm. Each of the different Metropolis–Hastings updates corresponds to a transition kernel $$K^{(j)}$$. As with the substeps of Gibbs sampler there are two simple and valid ways of combining the transition kernels $$K^{(1)},\dots,K^{(r)}$$:

• As in the systematic scan Gibbs sampler, we can cycle through the kernels in a deterministic order, i.e. first carry out the Metropolis–Hastings update corresponding the the kernel $$K^{(1)}$$, then carry out the one corresponding to $$K^{(2)}$$, etc. until we start again with $$K^{(1)}$$. The transition kernel of this composite chain is $K^{\circ}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})=\int\!\cdots\int K^{(1)} ({\boldsymbol{x}}^{(t-1)},{\boldsymbol{\xi}}^{(1)})K^{(2)} ({\boldsymbol{\xi}}^{(1)},{\boldsymbol{\xi}}^{(2)})\cdots K^{(r)} ({\boldsymbol{\xi}}^{(r-1)},{\boldsymbol{x}}^{(t)}) \;d{\boldsymbol{\xi}}^{(r-1)}\cdots d{\boldsymbol{\xi}}^{(1)}.$ If each of the transition kernels $$K^{(j)}$$ has the invariant distribution $$f$$; i.e. $\int f({\boldsymbol{x}}^{(t-1)})K({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})\;d{\boldsymbol{x}}^{(t-1)}=f({\boldsymbol{x}}^{(t)}),$ then $$K^{\circ}$$ has $$f$$ as invariant distribution, too, as \begin{aligned} & {} \int f({\boldsymbol{x}}^{(t-1)}) K^{\circ}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})\;d{\boldsymbol{x}}^{(t-1)}\\ &= \int \underbrace{\cdots \underbrace{\int \underbrace{\int K^{(1)} ({\boldsymbol{x}}^{(t-1)},{\boldsymbol{\xi}}^{(1)}) f({\boldsymbol{x}}^{(t-1)})d{\boldsymbol{x}}^{(t-1)}}_{=f({\boldsymbol{\xi}}^{(1)})} K^{(2)} ({\boldsymbol{\xi}}^{(1)},{\boldsymbol{\xi}}^{(2)})d{\boldsymbol{\xi}}^{(1)}}_{f({\boldsymbol{\xi}}^{(2)})} \cdots d{\boldsymbol{\xi}}^{(r-2)}}_{=f({\boldsymbol{\xi}}^{(r-1)})} K^{(r)} ({\boldsymbol{\xi}}^{(r-1)},{\boldsymbol{x}}^{(t)})d{\boldsymbol{\xi}}^{(r-1)}\\ &=f({\boldsymbol{x}}^{(t)}).\end{aligned}

• Alternatively, we can, as in the random scan Gibbs sampler, choose each time at random which of the kernels should be used, i.e. use the kernel $$K^{(j)}$$ with probability $$w_j>0$$ ($$\sum_{\iota=1}^rw_{\iota}=1$$). The corresponding kernel of the composite chain is the mixture $K^{+}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})=\sum_{\iota=1}^r w_{\iota} K^{(\iota)}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)}).$ Once again, if each of the transition kernels $$K^{(j)}$$ has the invariant distribution $$f$$, then $$K^{+}$$ has $$f$$ as invariant distribution: $\int f({\boldsymbol{x}}^{(t-1)}) K^{+}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})\;d{\boldsymbol{x}}^{(t-1)} = \sum_{\iota=1}^r w_{\iota} \underbrace{\int f({\boldsymbol{x}}^{(t-1)})K^{(\iota)}({\boldsymbol{x}}^{(t-1)},{\boldsymbol{x}}^{(t)})\;d{\boldsymbol{x}}^{(t-1)}}_{=f({\boldsymbol{x}}^{(t)})}=f({\boldsymbol{x}}^{(t)}).$

Example 3.10 (One-at-a-time Metropolis–Hastings). One example of a method using composite kernels is the so-called one-at-a-time Metropolis–Hastings algorithm. Consider the case of a $$p$$-dimensional random variable $${\boldsymbol{X}}=(X_1,\dots,X_p)$$. The Metropolis–Hastings Algorithms 3.3 and 3.4 update all components at a time. It can, however, be difficult to come up with a suitable proposal distribution $$q(\cdot\mid {\boldsymbol{x}}^{(t-1)})$$ (or $$g$$) for all variables. Alternatively, we could, as in the Gibbs sampler, update each component separately. For this we need $$p$$ proposal distributions $$q_1, \dots,q_p$$ for updating each of the $$X_j$$. The $$j$$-th proposal $$q_j$$ (and thus the $$j$$-th kernel $$K^{(j)}$$) corresponds to updating the $$X_j$$.
As mentioned above we can cycle deterministically through the kernels (corresponding to the kernel $$K^{\circ}$$), yielding the following algorithm. Starting with $${\boldsymbol{X}}^{(0)}=(X_1^{(0)},\dots,X_p^{(0)})$$ iterate

• 1.

1. Draw $$X_1\sim q_1(\cdot\mid X^{(t-1)}_2,\dots,X^{(t-1)}_p)$$.

2. Compute $$\alpha_1=\min\left\{1,\frac{ f(X_1,X_2^{(t-1)},\dots,X_p^{(t-1)})\cdot q_1(X_1^{(t-1)}\mid X_1,X^{(t-1)}_2,\dots,X^{(t-1)}_p)}{ f(X_1^{(t-1)},X_2^{(t-1)},\dots,X_p^{(t-1)})\cdot q_1(X_1\mid X_1^{(t-1)},X^{(t-1)}_2,\dots,X^{(t-1)}_p)} \right\}$$.

3. With probability $$\alpha_1$$ set $$X_1^{(t)}=X_1$$, otherwise set $$X_1^{(t)}=X_1^{(t-1)}$$.

• $$\vdots$$

• j.

1. Draw $$X_j\sim q_j(\cdot\mid X^{(t)}_1,\dots,X^{(t)}_{j-1},X^{(t-1)}_{j},\dots,X^{(t-1)}_p)$$.

2. Compute $$\alpha_j=\min\left\{1,\frac{ f(X_1^{(t)},\dots,X_{j-1}^{(t)},X_j,X_{j+1}^{(t-1)},\dots,X_p^{(t-1)})\cdot q_j(X_j^{(t-1)}\mid X_1^{(t)},\dots,X^{(t)}_{j-1},X_{j},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)}{ f(X_1^{(t)},\dots,X_{j-1}^{(t)},X_j^{(t-1)},X_{j+1}^{(t-1)},\dots,X_p^{(t-1)})\cdot q_j(X_j\mid X_1^{(t)},\dots,X^{(t)}_{j-1},X_{j}^{(t-1)},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)} \right\}$$.

3. With probability $$\alpha_j$$ set $$X_j^{(t)}=X_j$$, otherwise set $$X_j^{(t)}=X_j^{(t-1)}$$.

• $$\vdots$$

• p.

1. Draw $$X_p\sim q_p(\cdot\mid X^{(t)}_1,\dots,X^{(t)}_{p-1},X^{(t-1)}_{p})$$.

2. Compute $$\alpha_p=\min\left\{1,\frac{ f(X_1^{(t)},\dots,X_{p-1}^{(t)},X_p)\cdot q_p(X_p^{(t-1)}\mid X_1^{(t)},\dots,X^{(t)}_{p-1},X_{p})}{ f(X_1^{(t)},\dots,X_{p-1}^{(t)},X_p^{(t-1)})\cdot q_p(X_p\mid X_1^{(t)},\dots,X^{(t)}_{p-1},X_{p}^{(t-1)})} \right\}$$.

3. With probability $$\alpha_p$$ set $$X_p^{(t)}=X_p$$, otherwise set $$X_p^{(t)}=X_p^{(t-1)}$$.

The corresponding random sweep algorithm (corresponding to $$K^+$$) is: Starting with $${\boldsymbol{X}}^{(0)}=(X_1^{(0)},\dots,X_p^{(0)})$$ iterate

1. Draw an index $$j$$ from a distribution on $$\{1,\dots,p\}$$ (e.g. uniform)

2. Draw $$X_j\sim q_j(\cdot\mid X^{(t-1)}_1,\dots,X^{(t-1)}_p)$$.

3. Compute $$\alpha_j=\min\left\{1,\frac{ f(X_1^{(t-1)},\dots,X_{j-1}^{(t-1)},X_j,X_{j+1}^{(t-1)},\dots,X_p^{(t-1)})\cdot q_j(X_j^{(t-1)}\mid X_1^{(t-1)},\dots,X^{(t-1)}_{j-1},X_{j},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)}{ f(X_1^{(t-1)},\dots,X_{j-1}^{(t-1)},X_j^{(t-1)},X_{j+1}^{(t-1)},\dots,X_p^{(t-1)})\cdot q_j(X_j\mid X_1^{(t-1)},\dots,X^{(t-1)}_{j-1},X_{j}^{(t-1)},X^{(t-1)}_{j+1},\dots,X^{(t-1)}_p)} \right\}$$.

4. With probability $$\alpha_j$$ set $$X_j^{(t)}=X_j$$, otherwise set $$X_j^{(t)}=X_j^{(t-1)}$$.

5. Set $$X^{(t)}_{\iota}:=X^{(t-1)}_{\iota}$$ for all $$\iota\neq j$$.

Note the similarity to the Gibbs sampler. Indeed, the Gibbs sampler is a special case of a one-at-a-time Metropolis–Hastings algorithm as the following remark shows.

Remark. The Gibbs sampler for a $$p$$-dimensional distribution is a special case of a one-at-a-time Metropolis–Hastings algorithm: the (systematic scan) Gibbs sampler (Algorithm 3.1 is a cycle of $$p$$ kernels, whereas the random scan Gibbs sampler (Algorithm 3.2 is a mixture of these kernels. The proposal $$q_j$$ corresponding to the $$j$$-th kernel consists of drawing $$X^{(t)}_j\sim f_{X_j\mid X_{-j}}$$. The corresponding probability of acceptance is uniformly equal to $$1$$.

Proof. The update of the $$j$$-th component of the Gibbs sampler consists of sampling from $$X_j\mid X_{-j}$$, i.e. it has the proposal $q_j(x_j\mid {\boldsymbol{x}}^{(t-1)})= f_{X_j\mid X_{-j}}(x_j\mid x^{(t)}_1,\dots,x^{(t)}_{j-1},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p).$ We obtain for the $$j$$-th kernel that \begin{aligned} &\frac{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j,x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)q_j(x_j^{(t-1)}\mid x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j,x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)} { f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j^{(t-1)},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)q_j(x_j\mid x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j^{(t-1)},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p) } \\ &= \frac{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j,x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)f_{X_j\mid X_{-j}}(x_j^{(t-1)}\mid x^{(t)}_1,\dots,x^{(t)}_{j-1},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)} { f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j^{(t-1)},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)f_{X_j\mid X_{-j}}(x_j\mid x^{(t)}_1,\dots,x^{(t)}_{j-1},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p) } \\ &= \frac{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j,x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)\frac{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j^{(t-1)},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)}{ f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)}} { f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j^{(t-1)},x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)\frac{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},x_j,x^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)}{f(x^{(t)}_1,\dots,x^{(t)}_{j-1},^{(t-1)}_{j+1},\dots,x^{(t-1)}_p)} } \\ &=1,\end{aligned} thus $$\alpha_j\equiv 1$$.

As explained above, the composite kernels $$K^+$$ and $$K^{\circ}$$ have the invariant distribution $$f$$, if all kernels $$K^{(j)}$$ have $$f$$ as invariant distribution. Similarly, it is sufficient for the irreducibility of the kernels $$K^+$$ and $$K^{\circ}$$ that all kernels $$K^{(j)}$$ are irreducible. This is however not a very useful condition, nor is it a necessary condition. Often, some of the kernels $$K^{(j)}$$ focus on certain subspaces, and thus cannot be irreducible for the entire space. The kernels $$K^{(j)}$$ corresponding to the Gibbs sampler are not irreducible themselves: the $$j$$-th Gibbs kernel $$K^{(j)}$$ only updates $$X_j$$, not the other $$X_{\iota}$$ ($$\iota\neq j$$).

## 3.4 Diagnosing Convergence

### 3.4.1 Practical considerations

The theory of Markov chains guarantees that a Markov chain that is irreducible and has invariant distribution $$f$$ converges to the invariant distribution. The ergodic theorems allow for approximating expectations $${\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$$ by their corresponding empirical means $\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)}) \longrightarrow {\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$ using the entire chain. In practice, however, often only a subset of the chain $$({\boldsymbol{X}}^{(t)})_t$$ is used:

Burn-in

Depending on how $${\boldsymbol{X}}^{(0)}$$ is chosen, the distribution of $$({\boldsymbol{X}}^{(t)})_t$$ for small $$t$$ might still be far from the stationary distribution $$f$$. Thus it might be beneficial to discard the first iterations $${\boldsymbol{X}}^{(t)}$$, $$t=1,\dots,T_0$$. This early stage of the sampling process is often referred to as burn-in period. How large $$T_0$$ has to be chosen depends on how fast mixing the Markov chain $$({\boldsymbol{X}}^{(t)})_t$$ is. Figure 3.16 illustrates the idea of a burn-in period.

Thinning

Markov chain Monte Carlo methods typically yield a Markov chain with positive autocorrelation, i.e. $$\rho(X_k^{(t)},X_k^{(t+\tau)})$$ is positive for small $$\tau$$. This suggests building a subchain by only keeping every $$m$$-th value ($$m>1$$), i.e. we consider a Markov chain $$({\boldsymbol{Y}}^{(t)})_t$$ with $${\boldsymbol{Y}}^{(t)}={\boldsymbol{X}}^{(m\cdot t)}$$ instead of $$({\boldsymbol{X}}^{(t)})_t$$. If the correlation $$\rho({\boldsymbol{X}}^{(t)},{\boldsymbol{X}}^{(t+\tau)})$$ decreases monotonically in $$\tau$$, then $\rho(Y_k^{(t)},Y_k^{(t+\tau)})=\rho(X_k^{(t)},X_k^{(t+m\cdot\tau)})<\rho(X_k^{(t)},X_k^{(t+\tau)}),$ i.e. the thinned chain $$({\boldsymbol{Y}}^{(t)})_t$$ exhibits less autocorrelation than the original chain $$({\boldsymbol{X}}^{(t)})_t$$. Thus thinning can be seen as a technique for reducing the autocorrelation, however at the price of yielding a chain $$({\boldsymbol{Y}}^{(t)})_{t=1,\dots \lfloor T/m\rfloor}$$, whose length is reduced to $$(1/m)$$-th of the length of the original chain $$({\boldsymbol{X}}^{(t)})_ {t=1,\dots,T}$$. Even though thinning is very popular, it cannot be justified when the objective is estimating $${\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$$ [excepting special cases in which one can exploit specific features to save computations when working with the thinned chain; Owen (2017)] as the following lemma shows.

Lemma 3.3 Let $$({\boldsymbol{X}}^{(t)})_{t=1,\dots,T}$$ be a the random variables obtained from a Markov chain at stationarity, with $${\boldsymbol{X}}^{(t)} \sim f$$ and $$({\boldsymbol{Y}}^{(t)})_{t=1,\dots,\lfloor T/m\rfloor}$$ a second sequence defined by $${\boldsymbol{Y}}^{(t)}:={\boldsymbol{X}}^{(m\cdot t)}$$. If $${\mathbb{V}\textrm{ar}_{f}\left[h({\boldsymbol{X}}^{(t)})\right]}<\infty$$, then ${\mathbb{V}\textrm{ar}_{}\left[\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]}\leq {\mathbb{V}\textrm{ar}_{}\left[\frac{1}{\lfloor T/m \rfloor}\sum_{t=1}^{\lfloor T/m \rfloor} h({\boldsymbol{Y}}^{(t)})\right]}.$

Proof. To simplify the proof we assume that $$T$$ is divisible by $$m$$, i.e. $$T/m \in \mathbb{N}$$. Using $\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})=\sum_{\tau=0}^{m-1} \sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\tau)})$ and ${\mathbb{V}\textrm{ar}_{}\left[ \sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\tau_1)})\right]} = {\mathbb{V}\textrm{ar}_{}\left[ \sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\tau_2)})\right]}$ for $$\tau_1,\tau_2\in\{0,\dots,m-1\}$$, we obtain that \begin{aligned} {\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]}&= {\mathbb{V}\textrm{ar}_{}\left[\sum_{\tau=0}^{m-1} \sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\tau)})\right]}\\ &= m\cdot {\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m)})\right]} + \sum_{\eta \neq \tau=0}^{m-1} \underbrace{ \mathbb{C}ov\left[\sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\eta)}),\sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m+\tau)})\right]}_{\leq {\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m)})\right]}} \\ &\leq m^2 \cdot {\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^{T/m} h({\boldsymbol{X}}^{(t\cdot m)})\right]} = m^2\cdot {\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^{T/m} h({\boldsymbol{Y}}^{(t)})\right]}.\end{aligned} Thus ${\mathbb{V}\textrm{ar}_{}\left[\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]}=\frac{1}{T^2}{\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]}\leq \frac{m^2}{T^2}{\mathbb{V}\textrm{ar}_{}\left[\sum_{t=1}^{T/m} h({\boldsymbol{Y}}^{(t)})\right]}={\mathbb{V}\textrm{ar}_{}\left[\frac{1}{T/m}\sum_{t=1}^{T/m} h({\boldsymbol{Y}}^{(t)})\right]}.$

The concept of thinning can be useful for other reasons. If storage is limited it may not be possible to store all of an arbitrarily long chain; in this context, it can be much better to store the thinned skeleton of a long chain than to consider the entire sample path of a shorter chain. Furthermore, it can be easier to assess the convergence of the thinned chain $$({\boldsymbol{Y}}^{(t)})_t$$ as opposed to entire chain $$({\boldsymbol{X}}^{(t)})_t$$.

### 3.4.2 Tools for monitoring convergence

Although the theory presented in the preceding chapters guarantees the convergence of the Markov chains to the required distributions, this does not imply that a finite sample from such a chain yields a good approximation to the target distribution. As with all approximating methods this must be confirmed in practice.

This section tries to give a brief overview over various approaches to diagnosing convergence. A more detailed review with many practical examples can be found in or . There are numerous R packages (CODA is well known; mcmcse provides useful tools for assessing some aspects of performance) that provides a vast selection of tools for diagnosing convergence.

Diagnosing convergence is an art. The techniques presented in the following are no more than exploratory tools that help you judge whether the chain has reached its stationary regime. This section contains several cautionary examples where the different tools for diagnosing convergence fail.

Broadly speaking, convergence assessment can be split into the following three categories, each of which considers the assessment of a different aspect of convergence:

Convergence to the target distribution.

The first, and most important, question is whether $$({\boldsymbol{X}}^{(t)})_t$$ yields a sample from the target distribution? In order to answer this question we need to assess …

• whether $$({\boldsymbol{X}}^{(t)})_t$$ has reached a stationary regime, and

• whether $$({\boldsymbol{X}}^{(t)})_t$$ covers the entire support of the target distribution.

Convergence of the averages.

Does $$\sum_{t=1}^T h({\boldsymbol{X}}^{(t)}) / T$$ provide a good approximation to the expectation $${\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$$ under the target distribution?

Comparison to i.i.d. sampling.

How much information is contained in the sample from the Markov chain compared to i.i.d. sampling?

### 3.4.3 Basic plots

The most basic approach to diagnosing the output of a Markov Chain Monte Carlo algorithm is to plot the sample path $$({\boldsymbol{X}}^{(t)})_t$$. Note that the convergence of $$({\boldsymbol{X}}^{(t)})_t$$ is in distribution, i.e. the sample path is not supposed to converge to a single value. Ideally, the plot should be oscillating very fast and show very little structure or trend. In general terms, the smoother such a plot seems, the slower the mixing of the associated chain.

Note however that this plot suffers from the “you’ve only seen where you’ve been” problem. It is impossible to see from a plot of the sample path whether the chain has explored the entire support of the distribution (without additional information).

Example 3.11 (A simple mixture of two Gaussians). Consider sampling from a mixture of two well-separated Gaussians $f(x)=0.4 \cdot \phi_{(-1,0.2^2)}(x) + 0.6 \cdot \phi_{(2,0.3^2)}(x)$ (see Figure 3.17 for a plot of the density) using a random walk Metropolis algorithm with an $${\textsf{N}\left( 0,{\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]} \right)}$$ increment distribution. If we choose the proposal variance $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}$$ too small, we only sample from one component of the mixture, not from the mixture itself. Figures 3.173.19 show the sample paths for two choices of $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}$$: $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=0.4^2$$ and $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=1.2^2$$. The first choice of $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}$$ is too small: the chain is very likely to remain in one of the two modes of the distribution. Note that it is impossible to tell from Figure 3.18 alone that the chain has not explored the entire support of the target.

In order to diagnose the convergence of sample averages, one can look at a plot of the cumulative averages $$(\sum_{\tau=1}^t h(X^{(\tau)})/t)_t$$. Note that the convergence of the cumulative averages is—as the ergodic theorems suggest—to a value ($${\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$$). Figure 3.13 shows plots of the cumulative averages. An alternative to plotting the cumulative means is using the so-called CUSUMs $\sum_{\tau=1}^t\left[h(X_j^{(\tau)}) - \frac{1}{T}\sum_{i=1}^T h(X_j^{(i)})\right],$ which is none other than the difference between the cumulative sums and the corresponding (pro-rated) estimate of the limit $${\mathbb{E}_{f}\left[h({\boldsymbol{X}})\right]}$$. For a fast mixing chain (for which increments might be approximately i.i.d. Normal, say), a CUSUM plot should roughly resemble a Brownian bridge from 0 to 0, i.e. being highly irregular and centred around 0. Slow mixing chains exhibit long excursions away from 0.

Example 3.12 (A pathological generator for the Beta distribution). The following MCMC algorithm (for details, see Robert and Casella 2004 Problem 7.5) yields a sample from the $$\textsf{Beta}(\alpha,1)$$ distribution. Starting with any $$X^{(0)}$$ iterate for $$t=1,2,\dots$$

1. With probability $$1-X^{(t-1)}$$, set $$X^{(t)}=X^{(t-1)}$$.

2. Otherwise draw $$X^{(t)} \sim \textsf{Beta}(\alpha+1,1)$$.

This algorithm yields a very slowly converging Markov chain, to which no central limit theorem applies. This slow convergence can be seen in a plot of the cumulative means (Figure 3.203.21); $$\alpha = 1$$ so $${\mathbb{E}_{}\left[X\right]} = 1/2$$).

Note that it is impossible to tell from a plot of the cumulative means whether the Markov chain has explored the entire support of the target distribution.

### 3.4.4 Non-parametric tests of stationarity

A variety of nonparametric tests can be employed to establish whether the samples from a Markov chain behave in particular ways. This section presents an illustration of the (informal, approximate) use of the Kolmogorov–Smirnov test to assess whether there is evidence that a Markov chain has not yet reached stationarity.

In its simplest version, it is based on splitting the chain into three parts: $$({\boldsymbol{X}}^{(t)})_{t=1,\dots,\lfloor T/3\rfloor}$$,$$({\boldsymbol{X}}^{(t)})_{t=\lfloor T/3\rfloor+1,\dots,2\lfloor T/3\rfloor}$$, and $$({\boldsymbol{X}}^{(t)})_{t=2\lfloor T/3\rfloor+1,\dots,T}$$. The first block is considered to be the burn-in period. If the Markov chain has reached its stationary regime after $$\lfloor T/3\rfloor$$ iterations, the second and third block should be from the same distribution. Thus we should be able to tell whether the chain has converged by comparing the distribution of $$({\boldsymbol{X}}^{(t)})_{t=\lfloor T/3\rfloor+1,\dots,2\lfloor T/3\rfloor}$$ to the distribution of $$({\boldsymbol{X}}^{(t)})_{t=2\lfloor T/3\rfloor+1,\dots,T}$$ using suitable nonparametric two-sample tests. One such test is the Kolmogorov–Smirnov test.

Definition 3.4 (Kolmogorov–Smirnov Statistic). The two-sample Kolmogorov–Smirnov test for comparing two i.i.d. samples $$Z_{1,1},\dots,Z_{1,n}$$ and $$Z_{2,1},\dots,Z_{2,n}$$ is based on comparing their empirical CDFs $\hat F_k(z)=\frac{1}{n} \sum_{i=1}^{n} \mathbb{I}_{(-\infty,z]}(Z_{k,i}).$ The Kolmogorov–Smirnov test statistic is the maximum difference between the two empirical CDFs: $K=\sup_{z \in \mathbb{R}} |\hat F_1(z) - \hat F_2(z)|.$ For $$n\rightarrow \infty$$ the CDF of $$\sqrt{n}\cdot K$$ converges to the CDF $R(k)=1-\sum_{i=1}^{\infty} (-1)^{i-1} \exp(-2i^2k^2).$

As the Kolmogorov–Smirnov test is designed for i.i.d. samples, we do not apply it to the $$({\boldsymbol{X}}^{(t)})_t$$ directly, but to a thinned chain $$({\boldsymbol{Y}}^{(t)})_t$$ with $${\boldsymbol{Y}}^{(t)}={\boldsymbol{X}}^{(m\cdot t)}$$: the thinned chain is less correlated and thus closer to being an i.i.d. sample. This, of course, still formally violates the conditions under which the Kolmogorov–Smirnov test is exact but one can still hope to obtain useful information when the conditions are close to being satisfied.

We can now use the Kolmogorov–Smirnov statistics to compare the distribution of the second block, $$({\boldsymbol{Y}}^{(t)})_{t=\lfloor T/(3m)\rfloor+1,\dots,2\lfloor T/(3m)\rfloor}$$, with that of the third, $$({\boldsymbol{Y}}^{(t)})_{t=2\lfloor T/(3m)\rfloor+1,\dots,\lfloor T/m\rfloor}$$:

$K=\sup_{x \in \mathbb{R}} \left|\widehat F_{({\boldsymbol{Y}}^{(t)})_{t=\lfloor T/(3m)\rfloor+1,\dots,2\lfloor T/(3m)\rfloor}}(x) - \widehat F_{({\boldsymbol{Y}}^{(t)})_{t=2\lfloor T/(3m)\rfloor+1,\dots,\lfloor T/m\rfloor}}(x)\right|.$ As the thinned chain is not an i.i.d. sample, we cannot use the Kolmogorov–Smirnov test as a formal statistical test (besides, we would run into problems of multiple testing). However, we can use it as an informal tool by monitoring the standardised statistic $$\sqrt{t}K_t$$ as a function of $$t$$., where $$K_t$$ denotes the Kolmogorov–Smirnov statistic obtained from the sample consisting of the first $$t$$ observations only. If a significant proportion of the values of this standardised statistic are above the corresponding quantile of the asymptotic distribution, it is safe to assume that the chain has not yet reached its stationary regime.

Example 3.13 (Gibbs sampling from a bivariate Gaussian (continued)). In this example we consider sampling from a bivariate Gaussian distribution, once with $$\rho(X_1,X_2)=0.3$$ and once with $$\rho(X_1,X_2)=0.99$$. The former leads to a fast mixing chain, the latter to a very slowly mixing chain. Figure 3.223.23 shows the plots of the standardised Kolmogorov–Smirnov statistic. It suggests that the sample size of 10,000 is large enough for the low-correlation setting, but not large enough for the high-correlation setting.

Note that this use of the Kolmogorov–Smirnov test suffers from the “you’ve only seen where you’ve been” problem, as it is based on comparing $$({\boldsymbol{Y}}^{(t)})_{t=\lfloor T/(3m)\rfloor+1,\dots,2\lfloor T/(3m)\rfloor}$$ and $$({\boldsymbol{Y}}^{(t)})_{t=2\lfloor T/(3m)\rfloor+1,\dots,\lfloor T/m\rfloor}$$. A plot of the Kolmogorov–Smirnov statistic for the chain with $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=0.4$$ from Example 3.11 would not reveal anything unusual.

### 3.4.5 Riemann sums and control variates

A simple tool for diagnosing convergence of a one-dimensional Markov chain can be based on the fact that $\int_E f(x) \; dx =1.$ We can estimate this integral using the Riemann sum $$$\sum_{t=2}^T (X^{[t]}-X^{[t-1]}) f(X^{[t]}),\tag{3.12}$$$ where $$X^{[1]}\leq \dots \leq X^{[T]}$$ is the ordered sample from the Markov chain. If the Markov chain has explored all the support of $$f$$, then (3.12) should be around $$1$$ (as it is an estimate of the integral of a probability density). Note that this method, often referred to as Riemann sums , requires that the density $$f$$ is known inclusive of normalisation constants (and thus to apply this technique to a univariate marginal of a multivariate problem would require that at least one univariate marginal of the target distributions is known exactly).

Example 3.14 (A simple mixture of two Gaussians (continued)). In Example 3.11 we considered two random-walk Metropolis algorithms: one ($${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=0.4^2$$) failed to explore the entire support of the target distribution, whereas the other one ($${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=1.2^2$$) managed to. The corresponding Riemann sums are $$0.598$$ and $$1.001$$, clearly indicating that the first algorithm does not explore the entire support.

Riemann sums can be seen as a special case of a technique called control variates. The idea of control variates is essentially to compare several ways of estimating the same quantity using the same collection of samples. If the different estimates disagree, the chain has not yet converged. Note that the technique of control variates is only useful if the different estimators converge about as fast as the quantity of interest—otherwise we would obtain an overly optimistic, or an overly conservative estimate of whether the chain has converged. In the special case of the Riemann sum we compare two quantities: the constant $$1$$ and the Riemann sum (3.12).

### 3.4.6 Comparing multiple chains

A family of convergence diagnostics (see e.g. Gelman and Rubin 1992; Brooks and Gelman 1998) is based on running $$L>1$$ chains—which we will denote by $$({\boldsymbol{X}}^{(1,t)})_t,\dots,({\boldsymbol{X}}^{(L,t)})_t$$—with overdispersed starting values $${\boldsymbol{X}}^{(1,0)},\dots,{\boldsymbol{X}}^{(L,0)}$$ (in the sense that the variance of the starting values should be larger than the variance of the target distribution). These starting values should in principle be chosen to give reasonable coverage of the support of the target distribution.

All $$L$$ chains should converge to the same distribution, so comparing the plots described in Section @ref(#seccdplots) for the $$L$$ different chains should not reveal any difference. A more formal approach to diagnosing whether the $$L$$ chains are all from the same distribution can be based on comparing the inter-quantile distances.

We can estimate the inter-quantile distances in two ways. The first consists of estimating the inter-quantile distance for each of the $$L$$ chains and averaging over these results, i.e. our estimate is $$\sum_{l=1}^L \delta^{(l)}_{\gamma}/L,$$ where $$\delta^{(l)}_{\gamma}$$ is the distance between the $$\gamma$$ and $$(1-\gamma)$$ quantile of the $$l$$-th chain $$(X_k^{(l,t)})_t$$. Alternatively, we can pool the data first, and then compute the distance, $$\hat{\delta}_\gamma$$, between the $$\gamma$$ and $$(1-\gamma)$$ quantile of the pooled data. If all chains are a sample from the same distribution, both estimates should be roughly the same, so their ratio $\hat S^{\textrm{interval}}_{\gamma}=\frac{\sum_{l=1}^L \delta_{\gamma}^{(l)}/L}{\hat{\delta}_{\gamma}}$ can be used as a tool to diagnose whether all chains sampled from the same distribution, in which case the ratio should be around 1.

Alternatively, one could compare the variances within the $$L$$ chains to the pooled estimate of the variance (see Brooks and Gelman 1998 for more details).

Example 3.15 (A simple mixture of two Gaussians (continued)). In the example of the mixture of two Gaussians we will consider $$L=8$$ chains initialised with iid samples from a $${\textsf{N}\left( 0,10^2 \right)}$$ distribution. Figure 3.243.25 shows the sample paths of the $$8$$ chains for both choices of $${\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}$$. The corresponding values of $$\hat S^{\textrm{interval}}_{0.05}$$ are: \begin{aligned} {\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=0.4^2 &: \hat S^{\textrm{interval}}_{0.05}=\frac{0.9789992}{3.630008}=0.2696962\\ {\mathbb{V}\textrm{ar}_{}\left[\varepsilon\right]}=1.2^2 &: \hat S^{\textrm{interval}}_{0.05}=\frac{3.634382}{3.646463}=0.996687.\end{aligned}

Note that this method depends crucially on the choice of initial values $${\boldsymbol{X}}^{(1,0)},\dots,{\boldsymbol{X}}^{(L,0)}$$, and thus can easily fail, as the following example shows.

Example 3.16 (Witch’s hat distribution). Consider a distribution with the following density: $f(x_1,x_2) \propto \left\{ \begin{array}{ll} (1-\delta)\phi_{({\boldsymbol{\mu}},\sigma^2\cdot \mathbb{I})}(x_1,x_2) + \delta & \textrm{ if x_1,x_2\in (0,1),}\\ 0 & \textrm{ otherwise}, \end{array}\right.$ which is a mixture of a Gaussian and a uniform distribution, both truncated to $$(0,1)\times(0,1)$$. Figure 3.26 illustrates the density. For very small $$\sigma^2$$, the Gaussian component is concentrated in a very small area around $${\boldsymbol{\mu}}$$.

The conditional distribution of $$X_1\mid X_2$$ is $f(x_1\mid x_2)=\left\{ \begin{array}{ll} (1-\delta_{x_2})\phi_{({\boldsymbol{\mu}},\sigma^2\cdot \mathbb{I})}(x_1,x_2) + \delta_{x_2} & \textrm{ for x_1\in (0,1)}\\ 0 & \textrm{ otherwise} \end{array}\right.$ with $$\displaystyle \delta_{x_2}=\frac{\delta}{\delta+(1-\delta)\phi_{(\mu_2,\sigma^2)}(x_2)}$$.

Assume we want to estimate $${{\mathbb{P}}_{}\left(0.49 < X_1,X_2\leq 0.51\right)}$$ for $$\delta=10^{-3}$$, $${\boldsymbol{\mu}}=(0.5,0.5)'$$, and $$\sigma=10^{-5}$$ using a Gibbs sampler. Note that 99.9% of the mass of the distribution is concentrated in a very small area around $$(0.5,0.5)$$, i.e. $${{\mathbb{P}}_{}\left(0.49 < X_1,X_2\leq 0.51\right)}\approx 0.999$$.

Nonetheless, it is very unlikely that the Gibbs sampler visits this part of the distribution. This is due to the fact that unless $$x_2$$ (or $$x_1$$) is very close to $$\mu_2$$ (or $$\mu_1$$), $$\delta_{x_2}$$ (or $$\delta_{x_1}$$) is almost 1, i.e. the Gibbs sampler only samples from the uniform component of the distribution. Figure 3.27 shows the samples obtained from 15 runs of the Gibbs sampler (first 100 iterations only) all using different initialisations. On average only 0.04% of the sampled values lie in $$(0.49,0.51)\times(0.49,0.51)$$ yielding an estimate of $$\hat{{\mathbb{P}}}\left(0.49 < X_1,X_2\leq 0.51\right)=0.0004$$ (as opposed to $${{\mathbb{P}}_{}\left(0.49 < X_1,X_2\leq 0.51\right)}=0.999$$).

It is however close to impossible to detect this problem with any technique based on multiple initialisations. The Gibbs sampler shows this behaviour for practically all starting values. In Figure 3.27 all 15 starting values yield a Gibbs sampler that is stuck in the “brim” of the witch’s hat and thus misses 99.9% of the probability mass of the target distribution.

### 3.4.7 Comparison to i.i.d. sampling and the effective sample size

MCMC algorithms typically yield a positively correlated sample $$({\boldsymbol{X}}^{(t)})_{t=1,\dots,T}$$, which contains less information than an i.i.d. sample of size $$T$$. If the $$({\boldsymbol{X}}^{(t)})_{t=1,\dots,T}$$ are positively correlated, then the variance of the average $$$\mathbb{V}\textrm{ar}\left[\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right] \tag{3.13}$$$ is larger than the variance we would obtain from an i.i.d. sample, which is $${\mathbb{V}\textrm{ar}_{}\left[h({\boldsymbol{X}}^{(t)})\right]}/T$$.

The effective sample size (ESS) attempts to quantify the loss of information caused by this positive correlation. The effective sample size is the size an i.i.d. sample would have to have in order to obtain the same variance (3.13) as the estimate from the Markov chain $$({\boldsymbol{X}}^{(t)})_{t=1,\dots,T}$$.

As the exact computation of this quantity is generally impossible, a number of simplifying approximations are often made in order to obtain a computationally tractable proxy for this quantity. Slightly confusingly, the approximate equivalent independent sample size arrived at following this chain of approximations is also referred to as the ESS. More sophisticated approximations are used to obtain better approximations of the ESS in some settings, but we focus here on one simple and widely-applicable option.

In order to compute the variance (3.13) we make the simplifying assumption that $$(h({\boldsymbol{X}}^{(t)}))_{t=1,\dots,T}$$ is from a second-order stationary time series, i.e. $${\mathbb{V}\textrm{ar}_{}\left[h({\boldsymbol{X}}^{(t)})\right]}=\sigma^2$$, and $$\rho(h({\boldsymbol{X}}^{(t)}),h({\boldsymbol{X}}^{(t+\tau)}))=\rho(\tau)$$. Then \begin{aligned} {\mathbb{V}\textrm{ar}_{}\left[\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]}&=\frac{1}{T^2}\left( \sum_{t=1}^T \underbrace{{\mathbb{V}\textrm{ar}_{}\left[h({\boldsymbol{X}}^{(t)})\right]}}_{=\sigma^2} + 2 \sum_{1\leq s<t\leq T}\underbrace{\mathbb{C}ov\left[h({\boldsymbol{X}}^{(s)}),h({\boldsymbol{X}}^{(t)})\right]}_{=\sigma^2\cdot \rho(t-s)}\right)\\ &=\frac{\sigma^2}{T^2}\left(T + 2 \sum_{\tau=1}^{T-1} (T-\tau) \rho(\tau)\right)= \frac{\sigma^2}{T}\left(1 + 2 \sum_{\tau=1}^{T-1} \left(1-\frac{\tau}{T}\right) \rho(\tau)\right).\end{aligned} If $$\sum_{\tau=1}^{\infty} |\rho(\tau)|<\infty$$, then we can obtain from the dominated convergence theorem (see e.g. Brockwell and Davis 1991 Theorem 7.1.1 for details) that $T\cdot {\mathbb{V}\textrm{ar}_{}\left[\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}^{(t)})\right]} \longrightarrow \sigma^2 \left(1 + 2 \sum_{\tau=1}^{\infty} \rho(\tau)\right)$ as $$T\rightarrow \infty$$. Note that the variance of the simple Monte Carlo estimate of $${\mathbb{E}_{f}\left[h(X)\right]}$$ would be $$\sigma^2/T_{\textrm{ESS}}$$ if we were to use an i.i.d. sample of size $$T_{\textrm{ESS}}$$. We can now obtain the effective sample size $$T_{\textrm{ESS}}$$ by equating these two variances and solving for $$T_{\textrm{ESS}}$$, yielding $T_{\textrm{ESS}}=\frac{1}{1 + 2 \sum_{\tau=1}^{\infty} \rho(\tau)}\cdot T.$ If we assume that $$(h({\boldsymbol{X}}^{(t)}))_{t=1,\dots,T}$$ is a first-order autoregressive time series ($$\textrm{AR}(1)$$), i.e. $$\rho(\tau)=\rho(h({\boldsymbol{X}}^{(t)}),h({\boldsymbol{X}}^{(t+\tau)}))=\rho^{|\tau|}$$, then we obtain using $$1 + 2 \sum_{\tau=1}^{\infty} \rho^{\tau}=(1+\rho)/(1-\rho)$$ that $T_{\textrm{ESS}}=\frac{1-\rho}{1 + \rho} \cdot T.$

Example 3.17 (Gibbs sampling from a bivariate Gaussian (continued)). In examples 3.4 and 3.5 we obtained for the low-correlation setting that $$\rho(X_1^{(t-1)},X_1^{(t)})=0.078$$, thus the effective sample size is $T_{\textrm{ESS}}=\frac{1-0.078}{1+0.078}\cdot 10000 = 8547.$ For the high-correlation setting we obtained $$\rho(X_1^{(t-1)},X_1^{(t)})=0.979$$, thus the effective sample size is considerably smaller: $T_{\textrm{ESS}}=\frac{1-0.979}{1+0.979}\cdot 10000 = 105.$

Note that there are other more sophisticated ways of estimating ESS which do not require this crude autoregressive approximation. The mcmcse package provides an implementation of the method of Gong and Flegal (2016) which has somewhat better properties than the method employed by coda.

## 3.5 Optimisation with MCMC

So far we have studied various methods that allow for approximating expectations $${\mathbb{E}_{}\left[h({\boldsymbol{X}})\right]}$$ by ergodic averages $$\frac{1}{T}\sum_{t=1}^T h({\boldsymbol{X}}_i^{(t)})$$. This section presents an algorithm for finding the (global) mode(s) of a distribution. For definiteness, in this chapter we define the mode(s) of a distribution to be the set of global maxima of the density, i.e. $$\{{\boldsymbol{\xi}}:\; f({\boldsymbol{\xi}})\geq f({\boldsymbol{x}})\;\forall {\boldsymbol{x}}\}$$. In Section 3.5.1 we will extend this idea to finding global extrema of arbitrary functions.

We could estimate the mode of a distribution by the $${\boldsymbol{X}}^{(t)}$$ with maximal density $$f({\boldsymbol{X}}^{(t)})$$; this is however a not very efficient strategy. A sample from a Markov chain with invariant distribution $$f(\cdot)$$ samples from the whole distribution and not only from the mode(s).

This suggests modifying the distribution such that it is more concentrated around the mode(s). One way of achieving this is to consider $f_{(\beta)}(x)\propto \left(f(x)\right)^{\beta}$ for very large values of $$\beta$$.

Example 3.18 (Normal distribution). Consider the $${\textsf{N}\left( \mu,\sigma^2 \right)}$$ distribution with density $f_{(\mu,\sigma^2)}(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\propto \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$ It is easy to see that the mode of the $${\textsf{N}\left( \mu,\sigma^2 \right)}$$ distribution is $$\mu$$. We have that $\left(f_{(\mu,\sigma^2)}(x)\right)^{\beta}\propto\left(\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\right)^{\beta} =\exp\left(-\frac{(x-\mu)^2}{2\sigma^2/\beta}\right)\propto f_{(\mu,\sigma^2/\beta)}(x).$ In other words, the larger $$\beta$$ is chosen, the more concentrated the distribution will be around the mode $$\mu$$. Figure 3.28 illustrates this idea.

The result we have obtained for the Gaussian distribution in the above example actually holds in general. For $$\beta\rightarrow\infty$$ the distribution defined by the density $$f_{(\beta)}(x)$$ converges to a distribution that has all mass on the mode(s) of $$f$$ (see Figure 3.29 for an example). It is instructive to see informally why this is the case when considering a discrete random variable with probability density function $$p(\cdot)$$ and finite support $$E$$. Denote with $$E^*$$ the set of modes of $$p$$, i.e. $$p(\xi)\geq p(x)$$ for all $$\xi\in E^*$$ and $$x\in E$$, and with $$m:=p(\xi)$$ with $$\xi\in E^*$$. Then \begin{align*} p_{(\beta)}(x) &=\frac{(p(x))^{\beta}}{\sum_{y \in E^*} (p(y))^{\beta} + \sum_{y \in E\backslash E^*} (p(y))^{\beta} }\\ &=\frac{(p(x)/m)^{\beta}}{\sum_{y \in E^*} 1 + \sum_{y \in E\backslash E^*} (p(y)/m)^{\beta} }\overset{\beta\rightarrow\infty}{\longrightarrow} \left \{ \begin{array}{cc} 1/|E^*| & \textrm{if x\in E^*} \\ 0 & \textrm{if x\not\in E^*} \end{array} \right. \end{align*}

In general in the continuous case the distribution is not uniform on the modes and depends on the curvature around each mode (see Hwang 1980 for details). An outline of the argument is (Figure 3.30):

• Take $$x^\star \in \arg\max_x f(x)$$ and let:

•  $$E_1 = \{ x : f(x) \geq f(x^\star) - \epsilon\}.$$

•  $$E_2 = \{ x : f(x) \geq f(x^\star) - 2\epsilon\} \setminus E_1.$$

•  $$D = \mathbb{R}^d \setminus (E_1 \cup E_2).$$

• Consider: $$\lim\limits_{\beta\rightarrow \infty} P_{(\beta)}(E_1 \cup E_2) / P_{(\beta)}(D)$$.

• Use: $$P_{(\beta)}(D) / P_{(\beta)}(E_1) \rightarrow 0$$.

Under slightly stronger regularity conditions one can employ an elementary argument along the same lines as the one described in the discrete case above.

We can use a random-walk Metropolis algorithm to sample from $$f_{(\beta)}(\cdot)$$. The probability of accepting a move from $${\boldsymbol{X}}^{(t-1)}$$ to $${\boldsymbol{X}}$$ would be $\min\left\{1,\frac{f_{(\beta)}({\boldsymbol{X}})}{f_{(\beta)}({\boldsymbol{X}}^{(t-1)})}\right\}=\min\left\{1,\left(\frac{f({\boldsymbol{X}})}{f({\boldsymbol{X}}^{(t-1)})}\right)^{\beta}\right\}.$ Note that this probability does not depend on the (generally unknown) normalisation constant of $$f_{(\beta)}(\cdot)$$. It is however difficult to directly sample from $$f_{(\beta)}$$ for large values of $$\beta$$: for $$\beta\rightarrow \infty$$ the probability of accepting a newly proposed $$X$$ becomes $$1$$ if $$f(X)>f(X^{(t-1)})$$ and $$0$$ otherwise. Thus $$X^{(t)}$$ converges to a local extrema of the density $$f$$, however not necessarily a mode of $$f$$ (i.e. a global extremum of the density). Whether $$X^{(t)}$$ gets caught in a local extremum or not, depends on whether we can reach the mode from the local extrema of the density within one step. The following example illustrates this problem.

Example 3.19 Consider the following simple optimisation problem of finding the mode of the distribution defined on $$\{1,2,\dots,5\}$$ by $p(x)=\left\{\begin{array}{ll} 0.4 & \textrm{ for x=2}\\ 0.3 & \textrm{ for x=4}\\ 0.1 & \textrm{ for x=1,3,5.} \end{array}\right.$ Figure 3.31 illustrates this distribution. Clearly, the (global) mode of $$p(x)$$ is at $$x=2$$. Assume we want to sample from $$p_{(\beta)}(x)\propto p(x)^{\beta}$$ using a random walk Metropolis algorithm with proposed value $$X=X^{(t-1)}+\varepsilon$$ with $${{\mathbb{P}}_{}\left(\varepsilon=\pm 1\right)}=0.5$$ for $$X^{(t-1)}\in\{2,3,4\}$$, $${{\mathbb{P}}_{}\left(\varepsilon=+1\right)}=1$$ for $$X^{(t-1)}=1$$, and $${{\mathbb{P}}_{}\left(\varepsilon=-1\right)}=1$$ for $$X^{(t-1)}=5$$. In other words, we can either move one to the left, stay in the current value (when the proposed value is rejected), or move one to the right. Note that for $$\beta\rightarrow \infty$$ the probability for accepting a move from $$4$$ to $$3$$ converges to $$0$$, as $$p(4)>p(3)$$. As the Markov of chain can only move from $$4$$ to $$2$$ only via $$3$$, it cannot escape the local extremum at $$4$$ for $$\beta\rightarrow + \infty$$.*

For large $$\beta$$ the distribution $$f_{(\beta)}(\cdot)$$ is concentrated around the modes, however at the price of being difficult to sample from: the resulting Markov chain has very poor mixing properties: for large $$\beta$$ the algorithm can hardly move away from a local extremum surrounded by areas of low probability (the density of such a distribution would have many local extrema separated by areas where the density is effectively $$0$$).

The key idea of simulated annealing6 is to sample from a target distribution that changes over time: $$f_{(\beta_t)}(\cdot)$$ with $$\beta_t\rightarrow \infty$$. Before we consider different strategies for choosing the sequence $$(\beta_t)$$, we generalise the framework developed so far to finding the global extrema of arbitrary functions.

### 3.5.1 Minimising an arbitrary function

Consider that we want to find the global minimum of a function $$h:E\rightarrow \mathbb{R}$$. Finding the global minimum of $$H(x)$$ is equivalent to finding the mode of a distribution $f(x)\propto \exp(-H(x)) \textrm{ for x\in E},$ if such a distribution exists. In this framework, finding the mode of a density $$f$$ corresponds to finding the minimum of $$-\log(f(x))$$. As in the previous section we can raise $$f$$ to large powers to obtain a distribution $f_{(\beta_t)}(x)=\left(f(x)\right)^{\beta_t}\propto \exp(-\beta_t\cdot H(x)) \textrm{ for x\in E}.$ We hope to find the (global) minimum of $$H(x)$$, which is the (global) mode of the distribution defined by $$f_{\beta_t}(x)$$, by sampling from a Metropolis–Hastings algorithm. As suggested above we let $$\beta_t\rightarrow \infty$$. This yields the following algorithm:

Algorithm 3.6 (Simulated Annealing). Starting with $${\boldsymbol{X}}^{(0)}:=(X_1^{(0)},\dots,X_p^{(0)})$$ and $$\beta^{(0)}>0$$ iterate for $$t=1,2,\dots$$

1. Increase $$\beta^{(t-1)}$$ to $$\beta^{(t)}$$ (see below for different annealing schedules).

2. Draw $${\boldsymbol{X}}\sim q(\cdot\mid {\boldsymbol{X}}^{(t-1)})$$.

3. Compute $\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})=\min\left\{1, \exp\left(-\beta_t(H({\boldsymbol{X}})-H({\boldsymbol{X}}^{(t-1)}))\right)\cdot\frac{\displaystyle q({\boldsymbol{X}}^{(t-1)}\mid {\boldsymbol{X}})}{\displaystyle q({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})} \right\}.$

4. With probability $$\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})$$ set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}$$, otherwise set $${\boldsymbol{X}}^{(t)}={\boldsymbol{X}}^{(t-1)}$$.

If a random walk Metropolis update is used (i.e. $${\boldsymbol{X}}={\boldsymbol{X}}^{(t-1)}+{\boldsymbol{\epsilon}}$$ with $${\boldsymbol{\epsilon}}\sim g(\cdot)$$ for a symmetric $$g$$), then the probability of acceptance becomes $\alpha({\boldsymbol{X}}\mid {\boldsymbol{X}}^{(t-1)})=\min\left\{1, \exp\left(-\beta_t(H({\boldsymbol{X}})-H({\boldsymbol{X}}^{(t-1)}))\right)\right\}.$

Using the same arguments as in the previous section, it is easy to see that the simulated annealing algorithm converges to a local minimum of $$H(\cdot)$$. Whether it will be able to find the global minimum depends on how slowly we let the inverse temperature $$\beta$$ go to infinity.

Logarithmic tempering

When choosing $$\beta_t=\frac{\log(1+t)}{\beta_0}$$, the inverse temperature increases slow enough that global convergence results can be established for certain special cases. Hajek (1988) established global convergence when $$H(\cdot)$$ is optimised over a finite set using a proposal which is uniform over $$E$$ and logarithmic tempering with a suitably large $$\beta_0$$. Andrieu, Breyer, and Doucet (2001) use Foster–Lyapunov type arguments to establish convergence on more general spaces (under appropriate conditions).

Assume we choose $$\beta_0=\Delta H$$ with $$\Delta H:= \max_{x,x'\in E} |H(x)-H(x')|$$. Then the probability of reaching state $$x$$ in the $$t$$-th step is ${{\mathbb{P}}_{}\left(X^{(t)}=x\right)}=\sum_{\xi}\underbrace{{{\mathbb{P}}_{}\left(X^{(t)}=x\mid X^{(t-1)}=\xi\right)}}_{\geq \exp(-\beta_{t}\Delta H)/|E|}{{\mathbb{P}}_{}\left(X^{(t-1)}=\xi\right)} \geq \exp(-\beta_{t}\Delta H)/|E|$ Using the logarithmic tempering schedule we obtain $${{\mathbb{P}}_{}\left(X^{(t)}=x\right)}\geq 1 / \left((1 + t) |E|\right)$$ and thus the expected number of visits to state $$x$$ is $\sum_{t=0}^{\infty} {{\mathbb{P}}_{}\left(X^{(t)}=x\right)}\geq \sum_{t=0}^{\infty} [(1+t)|E|]^{-1}=\infty.$ Thus every state is recurrent. As $$\beta$$ increases we however spend an ever increasing amount of time in the global minima of $$x$$.

On the one hand visiting very state $$x$$ infinitely often implies that we can escape from local minima. On the other hand, this implies as well that we visit every state $$x$$ (regardless of how large $$H(x)$$ is) infinitely often. In other words, the reason why simulated annealing with logarithmic tempering works, is that it still behaves very much like an exhaustive search. However the only reason why we consider simulated annealing is that exhaustive search would be too slow! For this reason, logarithmic tempering has little practical relevance.

Geometric tempering

A popular choice is $$\beta_t=\alpha^t\cdot \beta_0$$ for some $$\alpha>1$$.

Example 3.20 Assume we want to find the maximum of the function \begin{align*} H(x)&=\left((x-1)^2-1\right)^2+3\cdot s(11.56 \cdot x^2),\,\textrm{ with }\\ s(x)&=\left\{ \begin{array}{ll} |x| \mod 2 & \textrm{ for 2k\leq |x|\leq 2k+1}\\ 2-|x|\mod 2 & \textrm{ for 2k+1\leq |x|\leq 2(k+1)} \end{array} \right. \end{align*} for $$k\in\mathbb{N}_0$$. Figure 3.32 shows $$H(x)$$ for $$x \in [-1,3]$$. The global minimum of $$H(x)$$ is at $$x=0$$. We simulated annealing with a geometric tempering with $$\beta_0=1$$ and $$\beta_{t}=1.001 \beta_{t-1}$$ and a random walk Metropolis algorithm with $$\varepsilon \sim {\textsf{Cauchy}\left( 0,\sqrt{0.1} \right)}$$. Figure 3.33 shows the first 1,000 iterations of the Markov chain yielded by the simulated annealing algorithm. Note that when using a Gaussian distribution with small enough a variance the simulated annealing algorithm is very likely to remain in the local minimum at $$x\approx 1.8$$.

Note that there is no guarantee that the simulated annealing algorithm converges to the global minimum of $$H(x)$$ in finite time. In practice, it would be unrealistic to expect simulated annealing to converge to a global minimum, however in most cases it will find a “good” local minimum.

### References

Andrieu, C., L. Breyer, and A. Doucet. 2001. “Convergence of Simulated Annealing Using Foster-Lyapunov Criteria.” Journal of Applied Probability 38 (4): 975–94.
Brockwell, P. J., and R. A. Davis. 1991. Time Series: Theory and Methods. 2nd ed. New York: Springer.
Brooks, S., and A. Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55.
Fahrmeir, L., and G. Tutz. 2001. Multivariate Statistical Modelling Based on Generalised Linear Models. 2nd ed. New York: Springer.
Gelfand, A. E., and A. F. M. Smith. 1990. “Sampling Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association 85: 398–409.
Gelman, A., G. O. Roberts, and W. R. Gilks. 1995. “Efficient Metropolis Jumping Rules.” In Bayesian Statistics, edited by J. M. Bernado, J. Berger, A. Dawid, and A. Smith, 5:599–607. Oxford: Oxford University Press.
Gelman, A., and B. D. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7: 457–72.
Geman, S., and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6): 721–41.
Gong, L., and J. M. Flegal. 2016. “A Practical Sequential Stopping Rule for High- Dimensional Markov Chain Monte Carlo.” Journal of Computational and Graphical Statistics 25 (3): 684–700.
Guihennec-Jouyaux, C., K. L. Mengersen, and C. P. Robert. 1998. MCMC Convergence Diagnostics: A ‘Reviewww’.” 9816. Institut National de la Statistique et des Etudes Economiques.
Hajek, B. 1988. “Cooling Schedules for Optimal Annealing.” Mathematics of Operations Research 13 (2): 311–29.
Hastings, W. K. 1970. Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika 52: 97–109.
Hwang, C.-R. 1980. Laplace’s Method Revisited: Weak Convergence of Probability Measures.” Annals of Probability 8 (6): 1177–82.
Jones, G. L. 2004. “On the Markov Chain Central Limit Theorem.” Probability Surveys 1: 299–320.
Kirkpatrick, S., C. D. Gelatt Jr., and M. P. Vecchi. 1983. “Optimization by Simulated Annealing.” Science 270 (4598): 671–80.
Liu, J. S., W. H. Wong, and A. Kong. 1995. “Covariance Structure and Convergence Rate of the Gibbs Sampler with Various Scans.” Journal of the Royal Statistical Society B 57 (1): 157–69.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, and A. H. Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21: 1087–92.
Owen, A. O. 2017. “Statistically Efficient Thinning of a Markov chain Sampler.” Journal of Computational and Graphical Statistics 26 (3): 738–44.
Philippe, A., and C. P. Robert. 2001. “Riemann Sums for MCMC Estimation and Convergence Monitoring.” Statistics and Computing 11 (2): 103–15.
Robert, C. P., and G. Casella. 2004. Monte Carlo Statistical Methods. Second. New York: Springer Verlag.
Roberts, G. O., A. Gelman, and W. Gilks. 1997. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms.” Annals of Applied Probability 7 (1): 110–20.
———. 2011. “Quantitative Non-Geometric Convergence Bounds for Independence Samplers.” Methodology and Computing in Applied Probability 13: 391–403.
Roberts, G., and R. Tweedie. 1996. “Geometric Convergence and Central Limit Theorems for Multivariate Hastings and Metropolis Algorithms.” Biometrika 83: 95–110.
Tierney, L. 1994. Markov Chains for Exploring Posterior Distributions.” Annals of Statistics 22: 1701–62.

1. In principle, if we could arrange for negative correlation we could improve upon the independent case but in practice this isn’t feasible.↩︎

2. We make use of \begin{aligned} &\left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)' \left(\begin{array}{cc}\sigma^2_{1}&\sigma_{12}\\\sigma_{12}&\sigma^2_{2}\end{array}\right)^{-1}\left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)\\ &=\frac{1}{\sigma_1^2\sigma_2^2 - (\sigma_{12})^2} \left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)' \left(\begin{array}{cc}\sigma^2_{2}&-\sigma_{12}\\-\sigma_{12}&\sigma^2_{1}\end{array}\right)\left(\left(\begin{array}{c}x_1\\x_2\end{array}\right)-\left(\begin{array}{c}\mu_1\\\mu_2\end{array}\right)\right)\\ &=\frac{1}{\sigma_1^2\sigma_2^2 - (\sigma_{12})^2} \left( \sigma_2^2 (x_1-\mu_1)^2 - 2 \sigma_{12} (x_1-\mu_1)(x_2-\mu_2) \right) + \textrm{ const}\\ &=\frac{1}{\sigma_1^2\sigma_2^2 - (\sigma_{12})^2} \left( \sigma_2^2 x_1^2 - 2 \sigma_2^2 x_1\mu_1 - 2\sigma_{12} x_1(x_2-\mu_2) \right) + \textrm{ const}\\ &=\frac{1}{\sigma_1^2 - (\sigma_{12})^2/\sigma_2^2} \left( x_1^2 - 2 x_1 (\mu_1 + \sigma_{12}/\sigma_2^2(x_2-\mu_2)) \right) + \textrm{ const}\\ &=\frac{1}{\sigma_1^2 - (\sigma_{12})^2/\sigma_2^2} \left( x_1 - (\mu_1 + \sigma_{12}/\sigma_2^2(x_2-\mu_2) \right)^2 + \textrm{ const}\\\end{aligned}↩︎

3. Note that, despite the slight abuse of notation, the transition kernel (3.7) is not absolutely continuous with respect to the Lebesgue measure (i.e. it doesn’t have a simple density).↩︎

4. The term annealing comes from metallurgy and refers to the technique of melting a metal before allowing that metal to cool down slowly in order to reach a lower energy state and consequently produce a tougher metal. Following this analogy, $$1/\beta$$ is typically referred to as temperature, $$\beta$$ as inverse temperature.↩︎