7.4 Monitoring Convergence

Once you’ve got your simulated Markov chain running, the question arises regarding when it should be stopped. On an infinite timeline, we know from theory that the samples drawn from the chain will come from the target density. On every other timeline, we must design some rule for stopping the chain in order to get things like parameter estimates or posterior distributions.

Determining how and when to stop a chain depends a bit on what exactly you are trying to do. In many cases, regardless of your inferential philosophy, you are trying to obtain an estimate of a parameter in a model. Typically, we estimate these parameters by using a summary statistic like the mean of the posterior distribution. We calculate this summary statistic by drawing samples from the posterior and computing the arithmetic mean.

Hence, we are estimating a parameter and in doing so we must worry about the usual things we worry about when estimating a parameter. In particular, we must worry about the uncertainty introduced by our procedure, which in this case is the Markov chain Monte Carlo sampling procedure. If our chain ran to infinity, we would have no uncertainty due to MCMC sampling (we would still have other kinds of uncertainty, such as from our finite dataset). But because our chain will not run to infinity, stopping the chain can be thought of as answering the question of how much Monte Carlo uncertainty is acceptable.

7.4.1 Monte Carlo Standard Errors

One way to measure the amount of uncertainty introduced through MCMC sampling is with Monte Carlo standard errors. The thought experiment here is, if we were to repeatedly run our MCMC sampler (with different random number generator seeds!) for a fixed \(N\) number of iterations, how much variability would we expect to see in our estimate of the parameter? Monte Carlo standard errors should give us a sense of this variability. If the standard errors are two big, we can run the sampler for longer. Exactly how long we will need to run the sampler to achieve a given standard error will depend on the efficiency and mixing of the sampler.

One way to compute Monte Carlo standard errors is with the method of batch means. The idea here is we divide our long MCMC sampler chain into equal size segments and compute our summary statistic on each segment. If the segments are of the proper size, we can think of them as “independent replicates”, even though individual samples of the MCMC sampler will not be independent of each other in general. From these replicates, we compute an estimate of variability from the given chain.

More specifically, suppose \(x_1, x_2, x_3, \dots\) are samples from an MCMC sampler that we have run for \(N\) iterations and we want to estimate \(\mathbb{E}[h(X)]\) where the expectation is take with respect to some posterior distribution. We first decide on a batch size \(K\) and let’s assume that \(N / K = M\) where \(M\) is an integer. We then divide the chain into segments \(x_1,\dots,x_K\), \(x_{K+1},\dots,x_{2K}\), etc. From here, we can compute our segment summary statistics,

\[\begin{eqnarray*} b_1 & = & \frac{1}{K}\sum_{i=1}^K h(x_i)\\ b_2 & = & \frac{1}{K}\sum_{i=K+1}^{2K} h(x_i)\\ \vdots & & \\ b_M & = & \frac{1}{K}\sum_{i=(M-1)K+1}^{MK} h(x_i) \end{eqnarray*}\]

Here, we can confirm that

\[ \bar{b} = \frac{1}{M}\sum_{i=1}^M b_i = \frac{1}{KM}\sum_{i=1}^{KM} h(x_i) \longrightarrow \mathbb{E}[h(X)] \] as \(N\rightarrow\infty\).

Once we have the batch means, we can compute a standard error based on the assumed ergodicity of the chain, which gives us,

\[ \sqrt{M}\left(\frac{\bar{b}-\mathbb{E}[h(X)]}{s}\right) \longrightarrow \mathcal{N}(0, 1). \] The batch means standard error is the square root of

\[ s^2 = \frac{K}{M}\sum_{i=1}^M(b_i-\bar{b})^2. \]

For a specific implementation of the batch means procedure, one can use the method of Jones, Haran, Caffo, and Neath. The source code for this procedure is available at Murali Haran’s web site. Jones, et al recommend a batch size of \(N^{1/2}\) (or the smallest integer closest to that) and that is the default setting for their software. If the chain is mixing relatively well, a batch size of \(N^{1/3}\) can be used and may be more efficient.

It’s worth noting that the Monte Carlo standard error is a quantity with units attached to it. Therefore, determining when the standard error is “small enough” will require a certain understanding of the context in which the problem is being addressed. No universial recommendation can be made on how small is small enough. Some view this property as a downside, preferring something that is “unit free”, but I see it as an important reminder that these parameters that we are estimating come from the real world and are addressing real problems.

7.4.2 Gelman-Rubin Statistic

Another approach to monitoring the convergence of a MCMC sampler is to think about what we might expect when a chain has “converged”. If we were to start multiple parallel chains in many different starting values, the theory claims that they should all eventually converge to the stationary distribution. So after some amount of time, it should be impossible to distinguish between the multiple chains. They should all “look” like the stationary distribution. One way to assess this is to compare the variation between chains to the variation within the chains. If all the chains are “the same”, then the between chain variation should be close to zero.

Let \(x_1^{(j)}, x_2^{(j)},\dots\) be samples from the \(j\)th Markov chain and suppose there are \(J\) chains run in parallel with different starting values.

For each chain, first discard \(D\) values as “burn-in” and keep the remaining \(L\) values, \(x_D^{(j)}, x_{D+1}^{(j)},\dots,x_{D+L-1}^{(j)}\). For example, you might set \(D = L\).
Calculate \[\begin{eqnarray*} \bar{x}_j & = & \frac{1}{L}\sum_{t=1}^L x_t^{(j)}\hspace{2em}\text{(chain mean)}\\ \bar{x}_\cdot & = & \frac{1}{J}\sum_{j=1}^J \bar{x}_j\hspace{2em}\text{(grand mean)}\\ B & = & \frac{L}{J-1} \sum_{j=1}^J (\bar{x}_j-\bar{x}_\cdot)^2\hspace{2em}\text{(between chain variance)}\\ s^2_j & = & \frac{1}{L-1} \sum_{t=1}^L (x_t^{(j)}-\bar{x}_j)^2\hspace{2em}\text{(within chain variance)}\\ W & = & \frac{1}{J}\sum_{j=1}^J s_j^2 \end{eqnarray*}\]
The Gelman-Rubin statistic is then \[ R = \frac{ \frac{L-1}{L}W + \frac{1}{L}B }{W} \]

We can see that as \(L\rightarrow\infty\) and as \(B\downarrow 0\), \(R\) approaches the value of \(1\). One can then reason that we should run our chains until the value of \(R\) is close to \(1\), say in the neighborhood of \(1.1\) or \(1.2\).

The Gelman-Rubin statistic is a ratio, and hence unit free, making it a simple summary for any MCMC sampler. In addition, it can be implemented without first specifying a parameter that is to be estimated, unlike Monte Carlo standard errors. Therefore, it can be a useful tool for monitoring a chain before any specific decisions about what kinds of inferences will be made from the model.

While the Gelman-Rubin statistic is intuitively appealing and is practically useful, it can demonstrate some undesirable properties. In particular, Flegal et al. report that the Gelman-Rubin statistic can be unstable, leading to both very long runs and very short runs. The result is that the uncertainty of any parameter being estimated with a MCMC sampler will be greater than that estimated using MC standard errors.

Therefore, for the purposes of parameter estimation, it would seem for now that Monte Carlo standard errors are a better way to go. The other benefit of the Monte Carlo standard error approach is that you don’t have to worry about burn-in (i.e. no samples are discarded). However, for more general Markov chain monitoring, the Gelman-Rubin statistic may be a reasonable tool to use.