4 Augmentation: Extending the Space
A very general technique in the field of simulation based inference is to augment the space on which simulation is done with auxiliary variables whose presence makes the problem easier. It may seem counterintuitive that making the space on which one must sample larger can make the sampling easier, but as we shall see in this chapter their are many techniques in the literature which can be seen as particular cases of this general strategy.
4.1 Composition Sampling
Consider the problem of drawing samples from a mixture distribution, i.e. one with a density of the form \[f_{{\boldsymbol{X}}}({\boldsymbol{x}}) = \sum\limits_{i=1}^k w_i f_i({\boldsymbol{x}})\] where \({\boldsymbol{w}}=(w_1,\dots,w_k)\) is a vector of nonnegative real numbers which sum to one. These correspond to component weights and \(\{f_i\}_{i=1}^k\) corresponds to a family of \(k\) different probability densities.
In principle one can readily develop techniques for sampling from such densities using the ideas described in Section 2.1. However, it’s convenient to have a simple generic method which can be used whenever we have techniques for sampling from the \(f_i\) individually.
Consider introducing an auxiliary variable \(Z\) which has a discrete distribution over \(\{1,\dots,k\}\) with associated vector of probability masses \({\boldsymbol{w}}\). The joint distribution \[\begin{aligned} f_{{\boldsymbol{X}},Z}({\boldsymbol{x}},z) = \sum_{i=1}^k w_i \delta_{i,z} f_z({\boldsymbol{x}}),\end{aligned}\] admits the marginal over \({\boldsymbol{x}}\): \[\begin{aligned} \sum_{z =1}^k f_{{\boldsymbol{X}},Z}({\boldsymbol{x}},z) &= \sum_{z=1}^k \sum_{i=1}^k w_i \delta_{i,z} f_z({\boldsymbol{x}}) = \sum_{i=1}^k w_i \sum_{z=1}^k \delta_{i,z} f_z({\boldsymbol{x}}) = \sum\limits_{i=1}^k w_i f_i({\boldsymbol{x}}) = f_{{\boldsymbol{X}}}({\boldsymbol{x}}),\end{aligned}\] as required.
The basis of composition sampling is that \(f_{{\boldsymbol{X}},Z}\) also admits a straightforward marginal distribution over \(z\), by construction, \(f_{Z}(z) = \sum_{i=1}^k w_i \delta_{i,z}\) and the conditional distribution of \({\boldsymbol{X}}\) given \(Z=z\) is simply \(f_{{\boldsymbol{X}}Z}({\boldsymbol{x}}z) = f_z({\boldsymbol{x}})\).
Combining these ideas, we conclude that we can sample from \(f_{{\boldsymbol{X}},Z}\) by sampling \(Z \sim {\textsf{Cat}\left( {\boldsymbol{w}} \right)}\), sampling \({\boldsymbol{X}}\) from its conditional distribution given the realized value of \(Z\): \({\boldsymbol{X}}\{Z=z\} \sim f_z\) and then discarding the auxiliary variable \(z\).
4.2 Rejection Revisited
We can also look at rejection sampling through the lens of spatial extension. Consider the following scenario. We are interested in obtaining samples from \(f_{{\boldsymbol{X}}}\), know that \(\sup_{{\boldsymbol{x}}} f_{{\boldsymbol{X}}}({\boldsymbol{x}}) / g_{{\boldsymbol{X}}}({\boldsymbol{x}}) \leq M < \infty\) for some density \(g_{{\boldsymbol{X}}}\) from which we can sample and some real constant \(M\). We extend the space by introducing an additional \(U\) which takes its values in \({\mathbb{R}}_+\) and define the joint distributions: \[\begin{aligned} g_{{\boldsymbol{X}},U}({\boldsymbol{x}},u) &\propto \mathbb{I}_{\mathsf{G}_M}(({\boldsymbol{x}},u)), & f_{{\boldsymbol{X}},U}({\boldsymbol{x}},u) &\propto \mathbb{I}_{\mathsf{F}}(({\boldsymbol{x}},u)),\end{aligned}\] where \(\mathsf{G}_M:= \{({\boldsymbol{x}},u) \in \mathcal{X} \otimes {\mathbb{R}}_+ : u \leq M g({\boldsymbol{x}})\}\) and \(\mathsf{F} := \{({\boldsymbol{x}},u) \in \mathcal{X} \otimes {\mathbb{R}}: u \leq f({\boldsymbol{x}})\}\) are simply the sets of points beneath \(Mg\) and \(f\), respectively.
If \(g_{{\boldsymbol{X}}}\) is tractable then we can straightforwardly sample from \(g_{{\boldsymbol{X}},U}\) by sampling \({\boldsymbol{X}}\sim g_{{\boldsymbol{x}}}\) and \(U  {\boldsymbol{X}}= {\boldsymbol{x}}\sim {\textsf{U}{\left[0,M g({\boldsymbol{x}})\right]}}\). We could then imagine conducting importance sampling to approximate expectations under \(f_{{\boldsymbol{X}},U}\), which would lead us to an estimator for \(I = \int f_{{\boldsymbol{X}},U}({\boldsymbol{x}},u)\varphi({\boldsymbol{x}},u) d{\boldsymbol{x}} du\) of the form \[\begin{aligned} \hat{I}_\varphi^n = \frac{\sum_{i=1}^n \frac{\mathbb{I}_{\mathsf{F}}(({\boldsymbol{X}}_i,U_i))}{\mathbb{I}_{\mathsf{G}_M}(({\boldsymbol{X}}_i,U_i))} \varphi({\boldsymbol{X}}_i)}{\sum_{i=1}^n \frac{\mathbb{I}_{\mathsf{F}}(({\boldsymbol{X}}_i,U_i))}{\mathbb{I}_{\mathsf{G}_M}(({\boldsymbol{X}}_i,U_i))}}\end{aligned}\] noting that \(\mathsf{F} \subseteq \mathsf{G}_M\) and that the probability (under the sampling mechanism by which we have just described for simulating these random variables which amounts to sampling from the uniform distribution over \(\mathsf{G}_M\)) that \({\boldsymbol{X}}_i \not\in \mathsf{G}_M\) is 0 allowing us to adopt the convention that 0/0 = 0 here, we obtain: \[\begin{aligned} \hat{I}_\varphi^n &= \frac{\sum_{i=1}^n \mathbb{I}_{\mathsf{F}}(({\boldsymbol{X}}_i,U_i)) \varphi({\boldsymbol{X}}_i,U)} {\sum_{i=1}^n \mathbb{I}_{\mathsf{F}}(({\boldsymbol{X}}_i,U_i))} = \frac{\sum_{\{i:({\boldsymbol{X}}_i,U_i) \in \mathsf{F}\}} \varphi({\boldsymbol{X}}_i,U)}{\sum_{\{i:({\boldsymbol{X}}_i,U_i) \in \mathsf{F}\}} 1}.\end{aligned}\] Note that this is simply the sample average of the function \(\varphi\) over those points which fell within \(\mathsf{F}\). If we restrict our attention to \(\varphi({\boldsymbol{x}},u) = \varphi({\boldsymbol{x}})\) (i.e. we consider functions which depend only upon \({\boldsymbol{x}}\)) it is clear that we’ve recast the simple Monte Carlo estimate of the expectation of a function under \(f_{{\boldsymbol{X}}}\) using a sample obtained using \(n\) proposals from \(g\) within a rejection sampler as an importance sampling estimate on an extended space.
The relationship between rejection and importance sampling is well known and has been studied by many authors (Chen 2005; Perron 1999).
4.3 Data Augmentation
Perhaps the most widely known spatial extension technique is that known as data augmentation, introduced by Tanner and Wong (1987).
Consider a latent variable model: a statistical model in which one has unknown parameters about which one wishes to performance inference, \({\boldsymbol{\theta}}\), observations which are known, \({\boldsymbol{y}}\), and a collection of hidden (latent) variables, \({\boldsymbol{z}}\). Typically, the joint distribution of all these quantities, say, \(f_{{\boldsymbol{Y}},{\boldsymbol{Z}},{\boldsymbol{\theta}}}\) is known but integrating out the latent variables is not feasible. Without access to \(f_{{\boldsymbol{Y}},{\boldsymbol{\theta}}}\) it’s not possible to implement, directly, an MCMC algorithm with the associated marginal posterior distribution \(f_{{\boldsymbol{\theta}}{\boldsymbol{Y}}}\) as its target.
The basis of data augmentation is to augment the vector of parameters \({\boldsymbol{\theta}}\) with these latent variables, \({\boldsymbol{z}}\) and to run an MCMC algorithm (or other Monte Carlo algorithm of your choice) which instead targets the joint posterior distribution \(f_{{\boldsymbol{\theta}},{\boldsymbol{Z}}{\boldsymbol{Y}}}\) noting that this distribution admits as its marginal in \({\boldsymbol{\theta}}\) exactly the marginal posterior distribution which was the original object of inference. A mixture model is the canonical example of a model which can be susceptible to this approach.
4.4 Multiple Augmentation for Optimisation
A closely related idea used in optimisation is based around “multiple augmentation”, introducing several replicates of unobserved quantities in order to allow the maximisation of a marginal quantity which it may not be possible to evaluate. We focus here on the State Augmentation for Maximisation of Expectations algorithm of Doucet, Godsill, and Robert (2002); similar methods are also described by others including Gaetan and Yao (2003), Jacquier, Johannes, and Polson (2007). These schemes all employ MCMC; the alternative of employing a populationbased sampling method known as Sequential Monte Carlo was explored by Johansen, Doucet, and Davy (2008).
Two common optimisation problems arise in the evaluation of statistical estimators: Maximum Likelihood Estimation: Given \(L(\theta;{\boldsymbol{x}}) = f_{{\boldsymbol{x}}}({\boldsymbol{x}};\theta)\), compute \(\hat\theta_{\textrm{ML}} = \arg\max_{\theta \in \Theta} L(\theta;{\boldsymbol{x}})\) and Maximum a Posteriori Estimation: Given \(L(\theta;{\boldsymbol{x}}) = f_{{\boldsymbol{x}}}({\boldsymbol{x}};\theta)\) and prior \(f^{\textrm{prior}}(\theta)\), compute \(\hat\theta_{\textrm{MAP}} = \arg\max_{\theta \in \Theta} f^{\textrm{prior}}(\theta) L(\theta;{\boldsymbol{x}}).\)
Both can fit our simple optimisation framework, and we can see a further illustration of the workings of the annealing method by considering the sequence of distributions obtained for simple problems.
Example 4.1 (Gaussian MAP Estimation).
If \(L(\mu;{\boldsymbol{x}}) = \prod_{i=1}^n \phi_{\mu,\sigma^2}(x_i)\) with \(\sigma^2\) known,
and \(\pi(\mu) = \phi_{\mu_0,\sigma_0^2}(\mu)\), then
the posterior is \[f^{\textrm{post}}(\mu) = {\textsf{N}\left( \mu,\frac{\sigma^2 \mu + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2} , \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2} \right)},\]
and we could aim to sample from \[f_{(\beta)}^{\text{MAP}}(\mu) \propto (f^{\textrm{post}}(\mu))^\beta \propto {\textsf{N}\left( \mu, \frac{\sigma^2 \mu + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2} , \frac{\sigma^2 \sigma_0^2}{\beta(\sigma^2 + n \sigma_0^2)} \right)}.\]
Example 4.2 (Example: Normal ML Estimation).
If \(L(\mu;{\boldsymbol{x}}) = \prod_{i=1}^n \phi_{\mu,\sigma^2}(x_i)\) with \(\sigma^2\) known,
we could view the likelihood as being proportional to a distribution over \(\mu\): \[f(\mu) = \textsf{N} \left(\mu ; \bar{x},\sigma^2/n \right),\]
and we could aim to sample from \[f_{(\beta)}^{\text{MLE}}(\mu) \propto (f(\mu))^\beta \propto \textsf{N} \left(\mu ; \bar{x},\sigma^2/ \beta n \right).\]
In both of these cases, the sequence of distributions concentrates on the maximiser of the original objective function, and so any algorithm able to sample from these distributions (for large enough \(\beta\)) will provide good approximations of the optimiser of the objective function. These methods involve target distributions that resemble the posterior distribution (either a real posterior, or one obtained using an instrumental prior for the purposes of approximating the MLE) which would have been obtained if there were many copies of the data, so the approach is sometimes referred to as “data cloning”.
Two closely related problems often arise when dealing with complicated statistical models. Marginal Maximum Likelihood Estimation: Given \(L(\theta;{\boldsymbol{x}}) = \int f_{{\boldsymbol{x}},{\boldsymbol{z}}}({\boldsymbol{x}},{\boldsymbol{z}};\theta) d{\boldsymbol{z}}\), compute \(\hat\theta_{\textrm{ML}} = \arg\max_{\theta \in \Theta} L(\theta;{\boldsymbol{x}})\); and Marginal Maximum a Posteriori Estimation: Given \(L(\theta;{\boldsymbol{x}}) = \int f_{{\boldsymbol{x}},{\boldsymbol{z}}}({\boldsymbol{x}},{\boldsymbol{z}};\theta) d{\boldsymbol{z}}\) and prior \(f^{\textrm{prior}}(\theta)\), compute \(\hat\theta_{\textrm{MMAP}} = \arg\max_{\theta \in \Theta} f^{\textrm{prior}}(\theta) L(\theta;{\boldsymbol{x}})\). Such problems often arise when one can write down a complete generative model for the process by which the data arose in terms of the parameters, but one only observes a subset of the random quantities generated within that model. For example, consider a mixture model in which we don’t observe the association of observations with mixture components or a genetic model in which we observe the DNA sequences of only the current generation of individuals: we don’t observe the sequences of their ancestors or their genealogical trees (Stephens 2007). If it is possible to integrate out the unobserved random quantities then we can proceed as usual, but unfortunately we can’t typically evaluate the marginal likelihoods.
Recall the demarginalisation technique for sampling from \(f_{{\boldsymbol{x}}}({\boldsymbol{x}})\) by defining a convenient joint distribution \(f_{{\boldsymbol{x}},{\boldsymbol{z}}}({\boldsymbol{x}},{\boldsymbol{z}})\) which admits the distribution of interest as a marginal. In order to do this, we saw that we could introduce a set of auxiliary random variables \(Z_1,\dots,Z_r\) such that \(f_{{\boldsymbol{x}}}\) is the marginal density of \((X_1,\dots,X_p)\) under the joint distribution of \((X_1,\dots,X_p,Z_1,\dots,Z_r)\), i.e. \[f(x_1,\dots,x_p)=\int f(x_1,\dots,x_n,z_1,\dots,z_r) \;d(z_1,\dots,z_r).\] The idea of introducing some auxiliary random variables in such a way that \(f_{(\beta)}({\boldsymbol{x}})\) is the marginal distribution seems a natural extension of this idea.
In order to do this, we consider \[{L}({\boldsymbol{x}},{\boldsymbol{z}}\theta) = f_{X,Z}({\boldsymbol{x}},{\boldsymbol{z}}\theta) = f_Z({\boldsymbol{z}}\theta) f_X({\boldsymbol{x}}{\boldsymbol{z}},\theta),\] and introduce a whole collection of vectors of auxiliary variables: \[f_{\beta}^{MMAP}(\theta,{\boldsymbol{z}}_1,\dots,{\boldsymbol{z}}_\beta{\boldsymbol{x}}) \propto \prod\limits_{i=1}^\beta \left[\pi(\theta) f_Z({\boldsymbol{z}}_i) f_X({\boldsymbol{x}}{\boldsymbol{z}}_i,\theta)\right].\] We can easily establish that, by exploiting the conditional independence structure of our augmented likelihood: \[\begin{aligned} f_{\beta}^{MMAP}(\theta{\boldsymbol{x}}) &\propto \int f_{\beta}^{MMAP}(\theta,{\boldsymbol{z}}_1,\dots,{\boldsymbol{z}}_\beta{\boldsymbol{x}}) d{\boldsymbol{z}}_1,\dots d{\boldsymbol{z}}_\beta\\ &\propto \pi(\theta)^\beta f_X({\boldsymbol{x}}\theta)^\beta = f^{\textrm{post}}(\theta{\boldsymbol{x}})^\beta.\end{aligned}\] This idea is the basis of the State Augmentation for Maximisation of Expectations (SAME) algorithm (Doucet, Godsill, and Robert 2002).
In the case of maximising the likelihood rather than the posterior we need to be slightly more careful. The likelihood is a probability density over the data, but need not even be integrable if viewed as a function of the parameters. We can address this problem by introducing an instrumental prior distribution (one used exclusively for computational reasons which is not intended to have any influence on the resulting inference.
Considering \[{L}(\theta;{\boldsymbol{x}},{\boldsymbol{z}}) = f_{X,Z}({\boldsymbol{x}},{\boldsymbol{z}}\theta) = f_Z({\boldsymbol{z}}\theta) f_X({\boldsymbol{x}}{\boldsymbol{z}},\theta),\] we can again consider multiple augmentation—this time for MMLE estimation—by setting \[f_{\beta}^{MMLE}(\theta,{\boldsymbol{z}}_1,\dots,{\boldsymbol{z}}_\beta{\boldsymbol{x}}) \propto \pi(\theta) \prod\limits_{i=1}^\beta \left[ f_Z({\boldsymbol{z}}_i) f_X({\boldsymbol{x}}{\boldsymbol{z}}_i,\theta)\right],\] which ensures that \[\begin{aligned} f_{\beta}^{MMLE}(\theta{\boldsymbol{x}}) &\propto \int f_{\beta}^{MMLE}(\theta,{\boldsymbol{z}}_1,\dots,{\boldsymbol{z}}_\beta{\boldsymbol{x}}) d{\boldsymbol{z}}_1,\dots d{\boldsymbol{z}}_\beta\\ &\propto \left[\pi(\theta)^{(1/\beta)} f_X({\boldsymbol{x}}\theta)\right]^\beta \approx L(\theta;{\boldsymbol{x}})^\beta\end{aligned}\] for large enough \(\beta\) under support and regularity conditions on \(\pi(\cdot)\), the instrumental prior.
Both of these augmentation strategies can give rise to a sequence of target distributions if we replace \(\beta\) with \(\beta_t\), a nondecreasing sequence of numbers of replicates of the augmenting variables. (In the SAME case it can be sensible to keep \(\beta_t\) fixed at a particular value for several iterations to give the chain time to reach equilibrium before further increasing it.) And given such a sequence of target distributions, we can apply MCMC kernels for which each is invariant in essentially the same manner as we did when considering simulated annealing. In the particular case in which we can sample from all of the relevant full conditional distributions, this gives rise to Algorithm 4.1; more general cases can be dealt with via obvious extensions.
Algorithm 4.1 (The SAME Gibbs Sampler). Starting with \({\boldsymbol{\theta}}^{(0)}\) iterate for \(t=1,2,\dots\)
Increase \(\beta^{(t1)}\) to \(\beta^{(t)}\) (if necessary).
For \(k=1,\dots,\beta_t\), sample: \[{\boldsymbol{z}}_{k}^{(t)} \sim f_{Z}({\boldsymbol{z}}_{k}^{(t)}  x, {\boldsymbol{\theta}}^{(t1)}).\]
Sample: \[{\boldsymbol{\theta}}^{(t)} \sim f_{(\beta_t)}^{\dots}({\boldsymbol{\theta}}{\boldsymbol{x}}, {\boldsymbol{z}}_{1}^{(t)},\dots,{\boldsymbol{z}}_{\beta_t}^{(t)}).\]
The following toy example shows the SAME Gibbs sampler in action.
Example 4.3 Consider finding the parameters which maximise the likelihood in a setting in which the likelihood is a student \(t\)distribution of unknown location parameter \(\theta\) with \(0.05\) degrees of freedom. Four observations are available, \({\boldsymbol{x}}=(20,1,2,3)\).
In this case, the marginal likelihood is known (and we can use this knowledge to verify that the algorithm works as expected): \[\log p({\boldsymbol{x}}\theta) = 0.525 \sum\limits_{i=1}^{4} \log\left( 0.05 + (x_{i}  \theta)^{2}\right).\] This marginal likelihood is illustrated in Figure 4.1–4.2.
However, it is also possible to write down an augmented complete likelihood admitting this as a marginal distribution, by exploiting the fact that the student \(t\)distribution may be written as a scale mixture of normal densities: \[\begin{aligned} \log p({\boldsymbol{x}},{\boldsymbol{z}}\theta) &= \sum\limits_{i=1}^{4} \left[ 0.475 \log z_{i} + 0.025 z_{i} + 0.5 z_{i}(x_{i}  \theta)^{2} \right],\\ p_{(\beta_t)}({\boldsymbol{z}}_{1:\beta_{t}}\theta,{\boldsymbol{x}}) &= \prod\limits_{i=1}^{\beta_{t}}\prod\limits_{j=1}^4 \textsf{Gamma} \left(z_{i,j} ; 0.525,0.025 + \frac{(x_{j}  \theta)^{2}}{2} \right), \\ p_{(\beta_t)}(\theta {\boldsymbol{z}}_{1:\beta_{t}}) &\propto \textsf{N} \left(\theta ; \mu^{(\theta)}_{t},\Sigma^{(\theta)}_{t} \right),\end{aligned}\] where the parameters are \[\begin{aligned} \Sigma^{(\theta)}_{t} &= \left[ \sum\limits_{i=1}^{\beta_t} \sum\limits_{j=1}^{4} z_{i,j} \right] ^{1}, & \mu^{(\theta)}_{t} &= \Sigma^{(\theta)}_{t} \sum\limits_{i=1}^{\beta_t} y^{T} z_{i}.\end{aligned}\] We can straightforwardly implement the SAME Gibbs sampler for this problem..
It is perhaps more interesting to return to the familiar mixture model for which we have already considered several forms of inference. Lets apply the data augmentation approach to the problem of maximising the posterior density. (Note that one cannot use maximum likelihood estimation, at least directly, in this setting as the likelihood is not bounded above: if a cluster mean coincides exactly with an observation then making the variance of that component arbitrarily small leads to an arbitrarily high likelihood!)
Example 4.4 (MAP Estimation for a Gaussian Mixture Model). Consider again the Gaussian mixture model in which we assume that the density of \(y_i\) is a mixture of Gaussians \[f(y_i\pi_1,\dots,\pi_k,\mu_1,\dots,\mu_k,\tau_1,\dots,\tau_k)=\sum_{\kappa=1}^k \pi_{\kappa} \phi_{(\mu_{\kappa},1/\tau_{\kappa})}(y_i).\] Suitable prior distributions are a Dirichlet distribution for \((\pi_1,\dots,\pi_k)\), a Gaussian for \(\mu_{\kappa}\), and a Gamma distribution for \(\tau_{\kappa}\). In order to ensure identifiability we assume the \(\mu_{\kappa}\) are ordered, i.e. \(\mu_1<\dots<\mu_k\) and make the corresponding change to the posterior density (in order to compensate for setting the density to zero for all configurations which fail to satisfy the ordering constraint, the density of all configurations compatible with the constraint must be increased by a factor of \(k!\)). Here we assume that \(k\) is known, and have:
\(n\) iid observations, \(x_1,\dots,x_n\).
Likelihood \(f_{X,Z}(x_i,z_i\omega,\mu,\sigma) = \omega_{z_i} \textsf{N} \left(x_i ; \mu_{z_i},\sigma_{z_i}^2 \right)\).
Marginal likelihood \(f_X(x_i\omega, \mu, \sigma) = \sum\limits_{j=1}^K \omega_j \textsf{N} \left(x_i ; \mu_j,\sigma_j^2 \right)\).
Diffuse conjugate priors: \[\begin{aligned} \omega &\sim {\textsf{Dirichlet}\left( \chi,\dots,\chi \right)},\\ \sigma_i^2 &\sim {\textsf{IG}\left( \frac{\lambda_i+3}{2},{\frac{b_i}{2}} \right)},\\ \mu_i \vert \sigma_i^2 &\sim {\textsf{N}\left( a_i,\sigma_i^2 / \lambda_i \right)}.\end{aligned}\]
All full conditional distributions of interest are available, which allows us to use our Gibbs sampling strategy. This gives rise to an iterative algorithm in which step \(t\) comprises the following steps:
Sample: \[\begin{aligned} \omega &\leftarrow {\textsf{Dirichlet}\left( \beta_t(\chi  1) + 1 + n_{1}(\beta_t), \dots, \beta_t(\chi  1) + 1 + n_{K}(\beta_t) \right)}, \\ \sigma_i^2 &\leftarrow {\textsf{IG}\left( A_i,B_i \right)},\\ \mu_i\sigma_i^2 &\leftarrow {\textsf{N}\left( \frac{\beta_t \lambda_i a_i + \bar{{\boldsymbol{x}}}^{{\beta_t}}_{i}}{\beta_t \lambda_i + n_i^{\beta_t}} , \frac{\sigma_i^2}{\beta_t \lambda_i + n_i^{\beta_t}} \right)},\end{aligned}\] where \[\begin{aligned} n_i^{\beta_t} &= \sum_{l=1}^{\beta_t} \sum_{p=1}^n \mathbb{I}_{i} (Z_{l,p}^{(t1)}), & \bar{{\boldsymbol{x}}}^{\beta_t}_{i} &= \sum_{l=1}^{\beta_t} \sum_{p=1}^n \mathbb{I}_{i} (Z_{l,p}^{(t1)}) x_j, &% \\ \overline{{\boldsymbol{x}}^2}^{\beta_t}_{i} &= \sum_{l=1}^{\beta_t} \sum_{p=1}^n \mathbb{I}_{i} (Z_{l,p}) x^2_j,\end{aligned}\]
and \[\begin{aligned} A_i &= \frac{\beta_t(\lambda_i+1) + n_i^{\beta_t}}{2} + 1,\\ B_i &= \frac{1}{2} \left( \beta_t(b_i + \lambda_i a_i^2) + \bar{{\boldsymbol{x}}^2}^{\beta_t}_{i}  \sum\limits_{g=1}^{{\beta_t}} \frac{(\bar{{\boldsymbol{x}}}^g_i  \bar{{\boldsymbol{x}}}^{g1}_i + \lambda_i a_i )^2} {\lambda_i + n^g_i  n^{g1}_i} \right).\end{aligned}\]
Sample, for \(j=1,\dots,\beta_t\): \[{\boldsymbol{z}}_j^{(t)} \sim f^{\text{posterior}}({\boldsymbol{z}}{\boldsymbol{x}},\pi^{(t)},\sigma^{(t)},\mu^{(t)}).\]
The marginal posterior can be calculated (which means that we don’t need such a complicated algorithm to deal with this problem, although it does perform well; the advantage of using such an example is that it allows us to assess the performance of the algorithm).
First we compare the performance of 50 runs of the algorithm with 50 (differently initialised) runs of a deterministic algorithm (expectation maximisation; EM) which is widely used to deal with problems of this type. Cost gives a rough indication of the computational cost of running each algorithm once.
Algorithm  \(T\)  Cost  Mean  Std. Dev.  Min  Max 

EM  500  500  158.06  3.23  166.39  153.85 
EM  5000  5000  157.73  3.83  165.81  153.83 
SAME(6)  4250  8755  155.32  0.87  157.35  154.03 
SAME(50)  4250  112522  155.05  0.82  156.11  153.98 
Two different sequences of the annealing parameter were considered:
 SAME(6):

Set \(\beta_t=1\) for the first half of the iterations and then increasing linearly to a final maximum value of 6.
 SAME(50):

Set \(\beta_t=1\) for the first 250 iterations, and then increasing linearly to 50.
The log posterior density of the generating parameters was 155.87. These parameters were: \[\begin{aligned} \pi &= \left[0.2, 0.3, 0.5 \right], & \mu &= \left[0, 2, 3 \right],& \text{ and } \sigma &= \left[1, \frac14, \frac{1}{16} \right].\end{aligned}\]
Although the EM algorithm occasionally produces good results, for this clean simulated data, some runs of the algorithm totally fail to find anything close to the global mode. The SAME algorithm is computationally more costly, but does behave more robustly. In real marginal optimisation problems, one typically cannot evaluate the objective function and so robust methods that can be relied upon to produce good solutions are required.
Next we turn to the much celebrated Galaxy data set of Roeder (1990). This data set consists of the velocities of 82 galaxies, and it has been suggested that it consists of a mixture of between 3 and 7 distinct components—for example, see Roeder and Wasserman (1997) and Escobar and West (1995). For our purposes we have estimated the parameters of a 3 component Gaussian mixture model from which we assume the data was drawn. The following table summarises the marginal posterior of the solutions found by 50 runs of each algorithm, comparing the same algorithm with the EM algorithm. Cost gives a rough indication of the computational cost of running each algorithm once.
Algorithm  \(T\)  Cost  Mean  Std. Dev.  Min  Max 

EM  500  500  46.54  2.92  54.12  44.32 
EM  5000  5000  46.91  3.00  56.68  44.34 
SAME(6)  4250  8755  45.18  0.54  46.61  44.17 
SAME(50)  4250  112522  44.93  0.21  45.52  44.47 
Again, two different sequences of annealing schedule were considered:
 SAME(6)

set \(\beta_t=1\) for the first half of the iterations and then increasing linearly to a final maximum value of 6,
 SAME(50)

set \(\beta_t=1\) for the first 250 iterations, and then increasing linearly to 50,
and again, good robustness is demonstrated by the same algorithm. A slightly more sophisticated algorithm (Johansen, Doucet, and Davy 2008) finds solutions broadly consistent with the \(43.96\pm0.03\) found here.
4.5 Approximate Bayesian Computation
The Approximate Bayesian Computation (ABC) approach to inference has become extremely popular for performing inference for models whose likelihood is not tractable (either in the sense that we can’t evaluate it pointwise or that such evaluation is prohibitively expensive). Such models abound in some areas, such as ecology and phylogenetics, and these methods have consequently received a great deal of attention in recent years.
It was Pritchard et al. (1999) who introduced the method, although there are some connections to earlier work such as Diggle and Gratton (1984) and Tavaré et al. (1997). It is not always viewed as a spatial extension technique, but it can be quite helpful to think about it in these terms.
Before moving on to consider ABC itself, think about a simple case in which one has a target distribution \(f_{{\boldsymbol{X}}{\boldsymbol{Y}}}({\boldsymbol{x}}{\boldsymbol{y}})\) which will typically be a Bayesian posterior distribution (and \({\boldsymbol{y}}\) the observed data). This distribution is written, via Bayes rule as: \[f_{{\boldsymbol{X}}{\boldsymbol{Y}}}({\boldsymbol{x}}{\boldsymbol{y}}) = \frac{f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{y}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}})}{f_{{\boldsymbol{Y}}}({\boldsymbol{y}})}.\] If both \(f_{{\boldsymbol{X}}}({\boldsymbol{x}})\) and \(f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{y}}{\boldsymbol{x}})\) can be evaluated pointwise then we can use standard simulation techniques to obtain samples which we can use to approximate our target distribution, and to approximate expectations with respect to it.
If we cannot evaluate \(f_{{\boldsymbol{Y}}{\boldsymbol{X}}}\) even pointwise, then we can’t directly use the techniques which we’ve described previously. To address this, we can invoke a clever data augmentation trick which requires only that we can sample from \(f_{{\boldsymbol{Y}}{\boldsymbol{X}}}\). First let’s consider the case in which \({\boldsymbol{Y}}\) is a discrete random variable. We can define the extended distribution, with \({\boldsymbol{Z}}\) taking it values in the space space as \({\boldsymbol{Y}}\): \[f_{{\boldsymbol{X}},{\boldsymbol{Z}}{\boldsymbol{Y}}}({\boldsymbol{x}},{\boldsymbol{z}}{\boldsymbol{y}}) \propto f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}) \delta_{{\boldsymbol{y}},{\boldsymbol{z}}}\] and note that it has as a marginal distribution, our target: \[\begin{aligned} \sum_{{\boldsymbol{z}}} f_{{\boldsymbol{X}},{\boldsymbol{Z}}{\boldsymbol{Y}}}({\boldsymbol{x}},{\boldsymbol{z}}{\boldsymbol{y}}) \propto \sum_{{\boldsymbol{z}}} f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}) \delta_{{\boldsymbol{y}},{\boldsymbol{z}}} = f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{y}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}).\end{aligned}\]
In the simplest case, we can sample \(({\boldsymbol{X}},{\boldsymbol{Z}}) \sim f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}})f_{{\boldsymbol{X}}}({\boldsymbol{x}})\) using this as a rejection sampling proposal for our target distribution, keeping samples with probability proportional to \[f_{{\boldsymbol{X}},{\boldsymbol{Z}}{\boldsymbol{Y}}}({\boldsymbol{x}},{\boldsymbol{z}}{\boldsymbol{y}}) / f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}).\] This probability can easily be seen to be proportional to \(\delta_{{\boldsymbol{y}},{\boldsymbol{z}}}\). So this rejection sampling algorithm amounts to sampling \({\boldsymbol{X}}\) from its prior distribution; sampling an artificial set of data from the model \(f_{{\boldsymbol{Y}}{\boldsymbol{X}}}\) and keeping the sample as a sample from the posterior only if the artificial data set exactly matches the observed one.
Thus far, this clever algorithm has made no approximations. However, the probability of a sample being accepted is exactly the probability that a data set drawn by sampling a parameter value from the prior and a data set from the datagenerating model with that parameter value exactly matches the observed data. In the case of very small discrete data sets this might be acceptable, but typically it will be vanishingly small. That’s why approximation becomes necessary.
The approximate part of ABC arises first of all by relaxing the requirement that the simulated data exactly matches the observed data and keeping any sample for which the simulated data falls within some tolerance, \(\epsilon\), of the observed data. This leads to a different target distribution: \[\begin{aligned} f^{\textrm{ABC}}_{{\boldsymbol{X}},{\boldsymbol{Z}}{\boldsymbol{Y}}} \propto f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}) \mathbb{I}_{B({\boldsymbol{y}},\epsilon)}({\boldsymbol{z}}),\end{aligned}\] where \(B({\boldsymbol{y}},\epsilon) := \{{\boldsymbol{x}}: {\boldsymbol{x}} {\boldsymbol{y}} \leq \epsilon\}\), for which the marginal is no longer correct but may be approximately so under regularity conditions: \[\begin{aligned} f^{\textrm{ABC}}_{{\boldsymbol{x}}{\boldsymbol{Y}}} &\propto \int f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) f_{{\boldsymbol{X}}}({\boldsymbol{x}}) \mathbb{I}_{B({\boldsymbol{y}},\epsilon)}({\boldsymbol{z}}) d{\boldsymbol{z}}\\ &\propto f_{{\boldsymbol{X}}}({\boldsymbol{x}}) \int f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) \mathbb{I}_{B({\boldsymbol{y}},\epsilon)}({\boldsymbol{z}}) d{\boldsymbol{z}} \propto f_{{\boldsymbol{X}}}({\boldsymbol{x}})\int_{{\boldsymbol{z}}\in B({\boldsymbol{y}},\epsilon)} f_{{\boldsymbol{Y}}{\boldsymbol{X}}}({\boldsymbol{z}}{\boldsymbol{x}}) d{\boldsymbol{z}}.\end{aligned}\] This approximation amounts to a smoothing of the likelihood function.
Often a further approximation is introduced by considering not the data itself but some low dimensional summary. If that low dimensional summary does not constitute a sufficient statistic for the inferential task at hand, this induces an additional approximation error which doesn’t vanish even if the tolerance parameter, \(\epsilon\), is reduced to zero. We won’t consider this further here, but it is important to be aware of the impact of such approximation if you employ this type of technique in real inferential situations.
Using simple rejection sampling with such a target distribution leads to the ABC algorithm of Pritchard et al. (1999), while using this target distribution within a standard MCMC algorithm was proposed by Marjoram et al. (2003) with various approaches based around other Monte Carlo schemes, especially Sequential Monte Carlo also being proposed by various authors including Sisson, Fan, and Tanaka (2007), Del Moral, Doucet, and Jasra (2012), and Peters, Fan, and Sisson (2012).
It is important to be aware that ABC makes use of both finite tolerances and summary statistics. If the latter lack the sufficiency property this also introduces approximation error which does not go away with increased simulation effort and which can be extremely difficult to quantify or understand. Although the method is appealing in its simplicity and broad applicability, as statisticians we should be careful to understand any approximations involved in our computations.
There is work on the use of ABC within a model selection context. Early algorithms include those of Del Moral, Doucet, and Jasra (2012). Characterisation of sufficient statistics for model choice by ABC can be found in Grelaud et al. (2009), Didelot et al. (2011), and Robert et al. (2011), while Marin et al. (2014) characterises the properties required in order for insufficient summary statistics to provide asymptotically consistent Bayes factor (and hence model selection).