2 Simulation-Based Inference
2.1 Simulation
Much of what we will consider in this module involves sampling from distributions; simulating some generative process to obtain realisations of random variables. A natural question to ask is: how can we actually do this? What does it mean to sample from a distribution, and how can we actually implement such a procedure using a computer?
2.1.1 Pseudorandom Number Generation
Actually, strictly speaking, we can’t generally obtain realisations of random variables of a specified distribution using standard hardware. We settle for sequences of numbers which have the same relevant statistical properties as random numbers and, more particularly, we will see that given a sequence of standard uniform (i.e. \({\textsf{U}{\left[0,1\right]}}\)) random variables, we can use some simple techniques to transform them to obtain random variables with other distributions of interest1.
A pseudorandom number generator is a deterministic procedure which, when applied to some internal state, produces a value that can be used as a proxy for a realisation of a \({\textsf{U}{\left[0,1\right]}}\) random variable and a new internal state. Such a procedure is initialised by the supply of some seed value and then applied iteratively to produce a sequence of realisations. Of course, these numbers are not in any meaningful sense random, indeed, to quote von Neumann:
Anyone who considers artithmetical methods of reproducing random digits is, of course, in a state of sin. …there is no such thing as a random number—there are only methods of producing random numbers, and a strict arithmetic procedure is of course not such a method.
There are some very bad pseudorandom number generators in which there are very obvious patterns in the output and their use could seriously bias the conclusions of any statistical method based around simulation. So-called linear congruential generators were very popular for a time—but thankfully that time has largely passed, and unless you’re involved with legacy code or hardware you’re unlikely to encounter such things.
We don’t have time to go into the details and intricacies of the PRNG in this module and as long as we’re confident that we’re using a PRNG which is good enough for our purposes then we needn’t worry too much about its precise inner workings. Thankfully, a great deal of time and energy has gone into developing and testing PRNGs, including the Mersenne-twister-19937 (Matsumoto and Nishimura 1998) used by default in the current implementation of R (R Core Team 2013).
Parallel Computing and PRNGs
Parallel implementation of Monte Carlo algorithms requires access to parallel sources of random numbers. Of course, we really need to avoid simulating streams of random numbers in parallel with unintended relationships between the variables in the different streams. This is not at all trivial. Thankfully, some good solutions to the problem do exist—see Salmon et al. (2011) for an example.
Quasi-Random Numbers
Quasi-random numbers, like pseudo-random numbers, are deterministic sequences of numbers which are intended to have, in an appropriate sense, similar statistical properties to pseudorandom numbers, but that is the limit of the similarities between these two things. Quasi-random number sequences (QRNS) are intended to have a particular maximum discrepancy property. See Morokoff and Caflisch (1995) for an introduction to the Quasi Monte Carlo technique based around such numbers; or Niederreiter (1992) for a book-length introduction.
Real Random Numbers
Although standard computers don’t have direct access to any mechanism for generating truly random numbers; dedicated hardware devices that provide such a generator do exist. There exist sequences of numbers obtained by transformations of physical noise sources; see https://www.random.org/ for example. Surprisingly, the benefits of using such numbers—rather than those obtained from a good PRNG—do not necessarily outweigh the disadvantages (greater difficulty in replicating the results; difficulties associated with characterising the distribution of the input noise and hence the output random variables…). We won’t discuss these sources any further in these notes.
Finite Precision
Although the focus of this section has been noting that the real random numbers employed in computational algorithms are usually not random, it is also worthwhile noting that they are also, typically, not really real numbers either. Most computations are performed using finite precision arithmetic. There is some discussion of simulation from the perspective of finite precision arithmetic as far back as Devroye (1986, chap. 15). Again, we won’t generally concern ourself with this detail here; some of this was covered in the Statistical Computing module.
2.1.2 Transformation Methods
Having established that sources (of numbers which have similar properties to those of) random numbers uniformly distributed over the unit interval are available, we now turn our attention to turning random variables with such a distribution into random variables with other distributions. In principle, applying such transformations to realisations of \({\textsf{U}{\left[0,1\right]}}\) random variables will provide us with realisations of random variables of interest.
One of the simplest methods of generating random samples from a distribution with some cumulative distribution function (CDF) \(F(x)={\mathbb{P}}(X\leq x)\) is based on the inverse of that CDF. Although the CDF is, by definition, an increasing function, it is not necessarily strictly increasing or continuous and so may not be invertible. To address this we define the generalised inverse \(F^-(u):= \inf \{ x : \; F(x)\geq u\}\). (In some fields this is also called the quantile function.) Figure 2.1 illustrates its definition. If \(F\) is continuous and strictly increasing, then \(F^{-}(u)=F^{-1}(u)\). That is, the generalised inverse of a distribution function is a genuine generalisation of the inverse in that when \(F\) is invertible, \(F^{-}\) coincides with its inverse and when \(F\) is not invertible \(F^{-}\) is well defined nonetheless.
Theorem 2.1 (Inversion Method). Let \(U\sim {\textsf{U}{\left[0,1\right]}}\) and \(F\) be a CDF. Then \(F^-(U)\) has the CDF \(F\).
Proof. It is easy to see (e.g. in Figure 2.1) that \(F^{-}(u)\leq x\) is equivalent to \(u \leq F(x)\). Thus for \(U\sim {\textsf{U}{\left[0,1\right]}}\), \[{\mathbb{P}}(F^-(U)\leq x)={\mathbb{P}}(U\leq F(x))=F(x),\] thus \(F\) is the CDF of \(X=F^-(U)\).
Example 2.1 (Exponential Distribution). The exponential distribution with rate \(\lambda>0\) has the CDF \(F_{\lambda}(x)=1-\exp(-\lambda x)\) for \(x\geq 0\). Thus \(F^{-}_{\lambda}(u)=F^{-1}_{\lambda}(u)=-\log(1-u)/\lambda\), and we can generate random samples from \({\textsf{Exp}\left( \lambda \right)}\) by applying the transformation \(-\log(1-U)/\lambda\) to a uniform \({\textsf{U}{\left[0,1\right]}}\) random variable \(U\).
As \(U\) and \(1-U\), of course, have the same distribution we can instead use \(-\log(U)/\lambda\) to save a subtraction operation.
When the generalised inverse of the CDF of a distribution of interest is available in closed form, the Inversion Method can be a very efficient tool for generating random numbers. However very few distributions possess a CDF whose (generalised) inverse can be evaluated efficiently. Take, for example, the Normal distribution, whose CDF is not even available in closed form.
The generalised inverse of the CDF is just one possible transformation. Might there be other transformations that yield samples from the desired distribution? An example of such a method is the Box–Muller method for generating Normal random variables. Such specialised methods can be very efficient, but typically come at the cost of considerable case-specific implementation effort (aside from the difficulties associated with devising such methods in the first place).
Example 2.2 (Box–Muller Method for Normal Simulation). Using the transformation-of-density formula, one can show that \(X_1,X_2\overset{\text{iid}}{\sim}{\textsf{N}\left( 0,1 \right)}\) iff their polar coordinates \((R,\theta)\) with \[X_1=R \cdot \cos(\theta), \qquad X_2=R \cdot \sin(\theta),\] are independent, \(\theta\sim{\textsf{U}{\left[0,2\pi\right]}}\), and \(R^2\sim{\textsf{Exp}\left( 1/2 \right)}\) (Box and Muller 1958). Using \(U_1,U_2\overset{\text{iid}}{\sim}{\textsf{U}{\left[0,1\right]}}\) and Example 2.1 we can generate \(R\) and \(\theta\) by \[R=\sqrt{-2\log(U_1)},\qquad \theta=2\pi U_2,\] and thus \[X_1=\sqrt{-2\log(U_1)} \cdot \cos(2\pi U_2), \qquad X_2=\sqrt{-2\log(U_1)} \cdot \sin(2\pi U_2)\] are two independent realisations from a \({\textsf{N}\left( 0,1 \right)}\) distribution.
The idea of transformation methods like the Inversion Method was to generate random samples from a distribution other than the target distribution and to transform them such that they come from the desired target distribution. Transformation methods such as those described here are typically extremely efficient but it can be difficult to find simple transformations to produce samples from complicated distributions, especially in multivariate settings.
Many ingenious transformation schemes have been devised for specific classes of distributions (see Devroye (1986) for a good summary of these), but there are many interesting distributions for which no such transformation scheme has been devised. In these cases we have to proceed differently. One option is to sample from a distribution other than that of interest, in which case we have to find other ways of correcting for the fact that we sample from the “wrong” distribution. One method for doing exactly this is described in the next section; at the end of this chapter we see an alternative way of using samples from ‘proposal’ distributions to approximate integrals with respect to another distribution (Section 2.4.2).
2.1.3 Rejection Sampling
The basic idea of rejection sampling is to sample from a proposal distribution (sometimes referred to as an instrumental distribution) and to reject samples that are “unlikely” under the target distribution in a principled way. Assume that we want to sample from a target distribution whose density \(f\) is known to us. The simple idea underlying rejection sampling (and several other Monte Carlo algorithms) is the following rather trivial identity: \[f(x)=\int_0^{f(x)} 1\; du=\int_0^\infty \underbrace{\mathbb{I}_{[0,f(x)]}(u)}_{f(x,u):=} du.\] Thus \(f(x) = \int_0^\infty f(x,u) du\) can be interpreted as the marginal density of a uniform distribution on the area under the density \(f(x)\), \(\{(x,u):\;0\leq u\leq f(x)\}.\) This equivalence is very important in simulation, and has been referred to as the fundamental theorem of simulation. Figure 2.2 illustrates this idea.
This suggests that we can generate a sample from \(f\) by sampling from the area under the curve—but it doesn’t tell us how to sample uniformly from this area, which may be quite complicated (especially if we try to extend the idea to sampling from the distribution of a multivariate random variable).
Example 2.3 (Sampling from a Beta distribution). The \({\textsf{Beta}\left( a,b \right)}\) distribution (\(a,b > 0\)) has the density \[f(x)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1}(1-x)^{b-1},\qquad \textrm{for $0<x<1$},\] where \(\Gamma(a)=\int_0^{\infty}t^{a-1}\exp(-t)\;dt\) is the Gamma function. For \(a,b>1\) the \({\textsf{Beta}\left( a,b \right)}\) density is unimodal with mode \((a-1)/(a+b-2)\). Figure 2.2 shows the density of a \({\textsf{Beta}\left( 3,5 \right)}\) distribution. It attains its maximum of \(1680/729\approx 2.305\) at \(x=1/3\).
Using the above identity we can draw from \({\textsf{Beta}\left( 3,5 \right)}\) by drawing from a uniform distribution on the area under the density \(\{(x,u):\;0<u<f(x)\}\) (the area shaded in dark gray in Figure 2.2). In order to sample from the area under the density, we will use a similar trick to that used in the estimation of \(\pi\) in the preliminary material. We will sample from the light grey rectangle and and keep only the samples that fall in the area under the curve. Figure 2.2 illustrates this idea.
Mathematically speaking, we sample independently \(X\sim {\textsf{U}{\left[0,1\right]}}\) and \(U\sim {\textsf{U}{\left[0,2.4\right]}}\). We keep the pair \((X,U)\) if \(U<f(X)\), otherwise we reject it. The conditional probability that a pair \((X,U)\) is kept if \(X=x\) is \[{\mathbb{P}}(U<f(X)|X=x)={\mathbb{P}}(U<f(x))=\frac{f(x)}{2.4}.\] As \(X\) and \(U\) were drawn independently, we can rewrite our algorithm as: Draw \(X\) from \({\textsf{U}{\left[0,1\right]}}\) and accept \(X\) with probability \(f(X)/2.4\), otherwise reject \(X\).*
The method proposed in Example 2.3 is based on bounding the density of the Beta distribution by a box. Whilst this is a powerful idea, it cannot be applied directly to many other distributions, as many probability densities are either unbounded or have unbounded support (the whole real line, for example). However, we might be able to bound the density of \(f(x)\) by \(M\cdot g(x)\), where \(g(x)\) is a density from which we can easily sample and \(M\) is a finite constant.
Algorithm 2.1 (Rejection sampling). Given two densities \(f\), \(g\), with \(f(x)\leq M\cdot g(x)\) for all \(x\), we can generate a sample from \(f\) as follows.
Draw \(X\sim g\).
Accept \(X\) as a sample from \(f\) with probability \[\frac{f(X)}{M\cdot g(X)},\] otherwise go back to step 1.
Proof. Denote by \(E\) the set of all possible values \(X\) can take (for our purposes it can be assumed to be some subset of \(\mathbb{R}^d\) but can, in principle, be a much more general space and \(f\) and \(g\) can be densities with respect to essentially any common reference measure). We have, for any (measurable) \(\mathcal{X} \subseteq E\), \[\begin{equation} {\mathbb{P}}(\textrm{$X \in \mathcal{X}$ and is accepted})=\int_{\mathcal{X}} g(x) \underbrace{\frac{f(x)}{M\cdot g(x)}}_{={\mathbb{P}}(\textrm{$X$ is accepted}|X=x)}\;dx=\frac{\int_{\mathcal{X}}f(x)\;dx}{M},\tag{2.1} \end{equation}\] and thus \[\begin{equation} {\mathbb{P}}(\textrm{$X$ is accepted})={\mathbb{P}}(\textrm{$X \in E$ and is accepted})= \frac{1}{M},\tag{2.2} \end{equation}\] yielding \[\begin{equation} \tag{2.3} {\mathbb{P}}(x \in \mathcal{X}|\textrm{$X$ is accepted})=\frac{{\mathbb{P}}(\textrm{$X \in \mathcal{X}$ and is accepted})}{{\mathbb{P}}(\textrm{$X$ is accepted})}=\frac{\int_{\mathcal{X}}f(x)\;dx / M}{1/M}=\int_{\mathcal{X}}f(x)\;dx. \end{equation}\] Thus the density of the values accepted by the algorithm is \(f\).
Remark. If we know \(f\) only up to a multiplicative constant, i.e. if we only know \(\bar{f}(x)\), where \(f(x)=C\cdot \bar{f}(x)\), we can carry out rejection sampling using \[\frac{\bar{f}(X)}{M\cdot g(X)}\] as the probability of rejecting \(X\), provided \(\bar{f}(x)\leq M\cdot g(x)\) for all \(x\). Then by essentially the same argument as was used in (2.1)–(2.3) we have \[{\mathbb{P}}(\textrm{$X \in \mathcal{X}$ and is accepted})=\int_{\mathcal{X}} g(x) \frac{\bar{f}(x)}{M\cdot g(x)}\;dx= \frac{\int_{\mathcal{X}}\bar{f}(x)\;dx}{M}=\frac{\int_{\mathcal{X}}f(x)\;dx}{C\cdot M},\] \[{\mathbb{P}}(\textrm{$X$ is accepted})=1/(C\cdot M),\] and thus \[{\mathbb{P}}(x \in \mathcal{X}|\textrm{$X$ is accepted})=\frac{\int_{\mathcal{X}}f(x)\;dx / (C\cdot M)}{1/(C\cdot M)}=\int_{\mathcal{X}}f(x)\;dx.\]
Example 2.4 (Rejection sampling from the \({\textsf{N}\left( 0,1 \right)}\) distribution using a Cauchy proposal). *Assume we want to sample from the \({\textsf{N}\left( 0,1 \right)}\) distribution with density \[f(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2}{2}\right)\] using a Cauchy distribution with density \[g(x)=\frac{1}{\pi(1+x^2)}\] as proposal distribution. Of course, there is not much point is using this method is practice: the Box–Muller method is more efficient. The smallest \(M\) we can choose such that \(f(x)\leq M g(x)\) is \(M=\sqrt{2\pi}\cdot\exp(-1/2)\).
Figure 2.3 illustrates the results. As before, filled circles correspond to accepted values whereas open circles correspond to rejected values.*
Note that it is impossible to do rejection sampling the other way round: sampling from a Cauchy distribution using a \({\textsf{N}\left( 0,1 \right)}\) distribution as proposal distribution. There is no \(M\in \mathbb{R}\) such that \[\frac{1}{\pi(1+x^2)} \leq M \cdot \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{x^2}{2}\right);\] the Cauchy distribution has heavier tails than the Normal distribution. This illustrates a general principle that rejection sampling requires a proposal with heavier tails than the target distribution and the still more general principle that tail behaviour is often critical to the performance (or even correctness) of Monte Carlo algorithms.
2.2 Monte Carlo Testing
One of the simplest forms of simulation-based inference goes under the name of Monte Carlo Testing or, sometimes, randomized testing. The idea is appealingly simple and rather widely applicable.
Recall the basic idea of testing (or null hypothesis significance testing to be a little more precise). Given a null hypothesis about the data, compute some test statistic (i.e. a real valued summary function of the observed data) whose distribution is known under the null hypothesis and would be expected to deviate systematically from this under the alternative hypothesis. If a value of the test statistic shows a deviation which would be expected no more than \(\alpha\%\) of the time when the null hypothesis is true (but which is expected to be more common under the alternative hypothesis), then one concludes that there is evidence which justifies rejecting the null hypothesis at the \(\alpha\%\) level.
In principle this is reasonably straightforward, but there are often practical difficulties with following such a procedure. In particular, what if we do not know the distribution of the test statistic under the null hypothesis? The classical solution is to appeal to asymptotic theory for large samples to characterise the distribution of this statistic. This has two drawbacks: it is only approximately correct for finite samples, and it can be extremely difficult to do.
One simple solution which seems to have been first suggested formally by Barnard (1963) is to use simulation. This approach has taken a little while to gain popularity, despite some far-sighted early work (Besag and Diggle 1977 for example), perhaps in part because of limited computational resources and in part because of perceived difficulties with replication.
If \(T\) denotes the test statistic obtained from the actual data and \(T_1,T_2,\dots\) denote those obtained from repeated sampling under the null hypothesis, then, if the null hypothesis is true, \((T_1,\dots,T_k,T)\) comprises a collection of \(k+1\) iid replicates of the test statistic. The probability that \(T\) is the largest of \((T_1,\dots,T_k,T)\) is exactly \(1/(k+1)\) (by symmetry) and the probability that \(T\) is in the largest \(l\) of \((T_1,\dots,T_k,T)\) is, similarly, \(l / (k+1)\).
By this reasoning, we can construct a hypothesis test at the 5% significance level by drawing \(k=19\) realisations of the test statistic and rejecting the null hypothesis if and only if \(T_\star\) is greater than any of those synthetic replicates. This test is clearly exact: the probability of rejection if the null hypothesis is true is exactly as is specified. However, there is a loss of power as a result of the randomization, and there is no guarantee that two people presented with the same data will reach the same conclusion (if they both simulate different artificial replicates then one may reject and the other may not). However, for a “large enough” value of \(k\) these departures from the exact idealised test which this Monte Carlo procedure mimics are very small.
Although this idea might seem slightly arcane and removed from the other ideas we’ve been discussing in this section, it really is motivated by the same ideas. The empirical distribution of the artificial sample of test statistics converges to the true sampling distribution as the sample size becomes large, and we’re then just using the empirical quantiles as a proxy for the quantiles of the true distribution. With a little bit of care, as seen here, this can be done in such a way that the false positive (“type I”, if you insist) error probability is exactly that specified by the level of the test.
2.3 The Bootstrap
The bootstrap is based around a similar idea: if we want to characterise the distribution of an estimator then one option would be to simulate many replicates of it, and to use the resulting empirical distribution function as a proxy for the actual distribution function of the estimator. However, we don’t typically know the distribution of the estimator2.
If we knew that our dataset comprised realisations from some specific, known, distribution then it would be straightforward to simulate the distribution of a statistic by generating a large number of synthetic data sets and computing the statistic associated with each synthetic data set. In practice, however, we generally don’t even know that distribution (after all, if we did there wouldn’t be much statistics left to do…).
The idea behind the bootstrap is that the empirical distribution of a large (simple random) sample from some distribution is typically very close to the distribution itself (in various senses which we won’t make precise here). In order to exploit this, we draw many replicates of the entire data set by sampling with replacement from that data set (i.e. by sampling from the associated empirical distribution) to obtain so-called bootstrap replicates. The statistic is then calculated for each of these replicates, and the resulting empirical distribution of the resulting statistic values, which we will term the bootstrap distribution, is used as a proxy for the true sampling distribution of that statistic.
If we’re interested in some particular property of the sampling distribution of the test statistic (such as the variance of the statistic which might be useful in the construction of an approximate confidence interval), then we can simply: estimate that property of the estimator under the true sampling distribution with the property of that estimator under the bootstrap distribution, and use simulation get the latter.
A little more precisely, let \(T = h(X_1,\dots,X_n)\) denote a quantity calculated as a function of the original simple random sample of size \(n\), \(X_1,\dots,X_n\) (i.e. \(T\) is a statistic calculated as a function \(h\) of some actually observed data). In order to approximate the sampling distribution of \(T\) we do the following:
Obtain Bootstrap Samples | For \(i=1,\dots,B\):
End For |
Compute Summaries | For \(i=1,\dots,B\):
End For |
Compute Empirical Distribution |
|
Compute Approximations of Interest | e.g. Sampling variance of \(T\) is \({\mathbb{V}\textrm{ar}_{f_T}\left[T\right]}\); approximate with \[{\mathbb{V}\textrm{ar}_{f_T^\star}\left[T\right]} %= \expect[f_T^\star]{T^2} - \expect[f_T^\star]{T}^2 | | = \frac{1}{B} \sum_{i=1}^N \left(T_i^\star\right)^2 - \left[ \frac{1}{B} | \sum_{i=1}^B T_i^\star \right]^2,\] which is none other than the sample variance of the statistic obtained from the bootstrap sample. |
2.3.1 Bootstrap Confidence Intervals
One major use of the bootstrap is in the construction of (approximate) confidence intervals for statistics for which it might be difficult to construct exact confidence intervals.
2.3.1.1 Asymptotic Approach
We saw in the previous section that we can obtain approximations of the variance of an estimator using bootstrap techniques. The simplest method for constructing an approximate confidence interval using the bootstrap is to use such a variance estimate together with an assumption of approximate (or asymptotic) normality in order to arrive at an approximate (or asymptotic) confidence interval.
Taking this approach, we would arrive at an interval with endpoints of \[T_n(X_1,\dots,X_n) \pm z_{\alpha/2} \sqrt{{\mathbb{V}\textrm{ar}_{{f_{T}}^\star}\left[T_n\right]}},\] where \(z_\alpha\) denotes the level \(\alpha\) critical points of the standard normal distribution.
Although this approach may seem appealing in its simplicity, it can be expected to perform well only when the sampling distribution of the summary statistic is approximately normal. Imposing this additional assumption rather defeats the object of using bootstrap methods rather than employing simpler approximations directly.
2.3.1.2 Bootstrap Percentile Intervals
The next level of sophistication is to use the empirical distribution of the bootstrap realisations as a proxy for the sampling distribution of the statistic of interest. We arrive directly at an approximate confidence interval of the form \([t^\star_{1-\alpha/2},t^\star_{\alpha/2}]\) where \(t_\alpha^\star\) denotes the level \(\alpha\) critical value of the bootstrap distribution.
Again this is a nice simple approach, but it does depend rather strongly on the quality of the approximation of the sampling distribution of \(T\) by the bootstrap distribution, and this is determined by the original sample size, amongst other factors.
2.3.1.3 Approximate Pivotal Quantity Approach
Another common approach to the problem has better asymptotic properties than that of the previous section and should generally be preferred in practice.
Assume that \(T\) is an estimator of some real population parameter, \(\theta\), and that the quantity \(R = T - \theta\) is pivotal. (Recall that if \(X\) is drawn from a distribution parametrised by \(\theta\) then \(R = R(X,\theta)\) is pivotal for \(\theta\) if its distribution is independent of \(\theta\).) As before, assume that we are able to obtain a large number of bootstrap replicates of \(T\), denoted \(T_1^\star,\dots,T_B^\star\).
Let \(F_R\) denote the distribution function of \(R\), so that: \(F_R(r) := {\mathbb{P}}(R \leq r)\). It’s a matter of straightforward algebra to establish that: \[\begin{aligned} {\mathbb{P}}(L \leq \theta \leq U) = {\mathbb{P}}(L - T \leq \theta - T \leq U - T) = {\mathbb{P}}(T - U \leq R \leq T-L)\\ = F_R(T-L) - F_R(T-U).\end{aligned}\] If we seek a confidence interval at level \(\alpha\) it would be natural, therefore, to insist that \(T-L = F_R^{-1}(1-\alpha/2)\) and that \(T-U = F_R^{-1}(\alpha/2)\) (assuming that \(F_R\) is invertible, of course).
Defining \(L\) and \(U\) in this way, we arrive at coverage of \(1 - \alpha\) for the interval \([L,U]\) with \(L = T - F_R^{-1}(1-\alpha/2)\) and \(U = T - F_R^{-1}(\alpha /2)\). Unfortunately, we can’t use this interval directly because we don’t know \(F_R\) and we certainly don’t know \(F_R^{-1}\).
This is where we invoke the usual bootstrap argument. If we are able to assume that the bootstrap replicates are to \(T\) as \(T\) is to \(\theta\) then we can obtain a collection of bootstrap replicates of the pivotal quantity which we may define as: \(R^\star_i = T^\star_i - T\). We can then go on to define the associated empirical distribution function and, more importantly, we can obtain the quantiles of this distribution. Letting \(r^\star_\alpha\) denote the level \(\alpha\) quantile of our bootstrap distribution, we obtain a bootstrap pivotal confidence interval of the form \([L^\star,U^\star]\) with: \[\begin{aligned} L^\star &= T - r^\star_{1-\alpha/2}, & U^\star &= T + r^\star_{\alpha/2}.\end{aligned}\]
Such confidence intervals can be shown to be asymptotically correct under fairly weak regularity conditions.
Remark. Although this approach may seem somewhat more complicated and less transparent than the methods discussed previously, it can be seen that the rate of convergence of bootstrap approximations of pivotal quantities can be \(\mathcal{O}(1/n)\), in contrast to the \(\mathcal{O}(1/\sqrt{n})\) obtained by appeals to asymptotic normality or the use of bootstrap approximations of non-pivotal quantities. See Young (1994) for a concise argument based around Edgeworth expansions for an illustration of statistical asymptotics in practice.
A more extensive theoretical consideration of these, and several other, approaches to the construction of bootstrap confidence intervals is provided by Hall (1986). A readable survey of developments in bootstrap methodology was provided by Davison, Hinkley, and Young (2003).
2.4 Monte Carlo Integration
Perhaps the most common application of simulation based inference is to the approximation of (intractable) integrals. We’ll consider the approximation of expectations of the form \(I_h = {\mathbb{E}_{f}\left[h\right]} = \int h(x) f(x)dx\), noting that more general integrals can always be written in this form by decomposing the integrand as the product of a probability density and whatever remains.
2.4.1 Simple / Perfect / Naïve Monte Carlo
The canonical approach to Monte Carlo estimation of \(I_h\) is to draw \(X_1,\dots,X_n \overset{\text{iid}}{\sim}f\) and to employ the estimator: \[\hat{I}_{h}^n = \frac{1}{n} \sum_{i=1}^n h(X_i).\] The strong law of large numbers tells us (provided that \(I_h\) exists) that \(\lim_{n\rightarrow\infty} \hat{I}_{h}^n = I_h\). Furthermore, provided that \({\mathbb{V}\textrm{ar}_{f}\left[h\right]} = \sigma^2 < \infty\), the Central Limit Theorem can be invoked to tell us that: \[\sqrt{n} [\hat{I}_{h}^n - I] \overset{\mathcal{D}}{\rightarrow}{\textsf{N}\left( 0,\sigma^2 \right)},\] as \(n\to\infty\), providing a rate of convergence.
If this were all there was to say about the topic then this could be a very short module. In fact, there are two reasons that we must go beyond this perfect Monte Carlo approach:
Often, if we wish to evaluate expectations with respect to some distribution \(\pi\) then we can’t just simulate directly from \(\pi\). Indeed, this is likely to be the case if we require sophisticated simulation-based methods to approximate the integral.
Even if we can sample from \(\pi\), in some situations we obtain better estimates of \(I_h\) if we instead sample from another carefully selected distribution and correct for the discrepancy.
In Section 2.1 we saw some algorithms for sampling from some distributions; in Chapter 3 we will see another technique which will allow us to work with more challenging distributions. But first we turn our attention to a technique which can be used both to allow us to employ samples from distributions simpler than \(\pi\) to approximate \(I_h\), and to provide better estimators than the perfect Monte Carlo approach if we are able to sample from a distribution tuned to both \(f\) and \(h\).
2.4.2 Importance Sampling
In rejection sampling we compensated for the fact that we sampled from the proposal distribution \(g(x)\) instead of \(f(x)\) by rejecting some of the proposed values. Importance sampling is based on the idea of instead using weights to correct for the fact that we sample from the proposal distribution \(g(x)\) instead of the target distribution \(f(x)\).
Indeed, importance sampling is based on the elementary identity \[\begin{equation} {\mathbb{P}}(X\in\mathcal{X})=\int_{\mathcal{X}} f(x)\; dx = \int_{\mathcal{X}} g(x)\underbrace{\frac{f(x)}{g(x)}}_{=:w(x)} \;dx=\int_{\mathcal{X}}g(x)w(x)\;dx\tag{2.4} \end{equation}\] for all measurable \(\mathcal{X} \subseteq E\) and \(g(\cdot)\), such that \(g(x)>0\) for (almost) all \(x\) with \(f(x)>0\). We can generalise this identity by considering the expectation \({\mathbb{E}_{f}\left[h(X)\right]}\) of a measurable function \(h\): \[\begin{align} {\mathbb{E}_{f}\left[h(X)\right]}&=\int f(x) h(x)\; dx \notag\\ &= \int g(x) \underbrace{\frac{f(x)}{g(x)}}_{=:w(x)} h(x) \;dx = \int g(x) w(x) h(x) \;dx={\mathbb{E}_{g}\left[w(X)\cdot h(X)\right]}, \tag{2.5} \end{align}\] if \(g(x)>0\) for (almost) all \(x\) with \(f(x)\cdot h(x)\neq 0\).
Assume we have a sample \(X_1,\dots,X_n\sim g\). Then, provided \({\mathbb{E}_{g}\left[|w(X)\cdot h(X)|\right]}\) exists, \[\frac{1}{n}\sum_{i=1}^n w(X_i) h(X_i) \stackrel{a.s.}{\overset{n\rightarrow \infty}{\longrightarrow}} {\mathbb{E}_{g}\left[w(X)\cdot h(X)\right]},\] and thus by (2.5) \[\frac{1}{n}\sum_{i=1}^n w(X_i) h(X_i) \stackrel{a.s.}{\overset{n\rightarrow \infty}{\longrightarrow}} {\mathbb{E}_{f}\left[h(X)\right]}.\] In other words, we can estimate \(\mu:={\mathbb{E}_{f}\left[h(X)\right]}\) by using \[\tilde\mu:=\frac{1}{n}\sum_{i=1}^n w(X_i) h(X_i).\]
Note that whilst \({\mathbb{E}_{g}\left[w(X)\right]}=\int_{E} \frac{f(x)}{g(x)} g(x)\;dx=\int_{E} f(x)=1\), the weights \(w_1(X),\dots,w_n(X)\) do not necessarily sum up to \(n\), so one might want to consider the self-normalised version \[\hat\mu:=\frac{1}{\sum_{i=1}^nw(X_i)}\sum_{i=1}^n w(X_i) h(X_i).\]
This gives rise to the following algorithm:
Algorithm 2.2 (Importance Sampling). Choose \(g\) such that \(\textrm{supp}(g)\supseteq \textrm{supp}(f\cdot h)\).
For \(i=1,\dots,n\):
Generate \(X_i\sim g\).
Set \(w(X_i)=\frac{f(X_i)}{g(X_i)}\).
Return either \[\hat \mu=\frac{\sum_{i=1}^n w(X_i) h(X_i)}{\sum_{i=1}^n w(X_i)}\] or \[\tilde \mu=\frac{\sum_{i=1}^n w(W_i) h(X_i)}{n}.\]
The following theorem gives the bias and the variance of importance sampling.
Theorem 2.2 (Bias and Variance of Importance Sampling).
\(\displaystyle {\mathbb{E}_{g}\left[\tilde\mu\right]} = \mu\),
\(\displaystyle {\mathbb{V}\textrm{ar}_{g}\left[\tilde \mu\right]} = \frac{{\mathbb{V}\textrm{ar}_{g}\left[w(X)\cdot h(X)\right]}}{n}\),
\(\displaystyle {\mathbb{E}_{g}\left[\hat \mu\right]} = \mu + \frac{\mu{\mathbb{V}\textrm{ar}_{g}\left[w(X)\right]}-\mathbb{C}ov\left[[,g\right]]{w(X), w(X)\cdot h(X)}}{n} + \mathcal{O}(n^{-2})\),
\(\displaystyle {\mathbb{V}\textrm{ar}_{g}\left[\hat \mu\right]} = \frac{{\mathbb{V}\textrm{ar}_{g}\left[w(X)\cdot h(X)\right]} -2 \mu \mathbb{C}ov\left[[,g\right]]{w(X), w(X)\cdot h(X)} + \mu^2{\mathbb{V}\textrm{ar}_{g}\left[w(X)\right]}}{n} + \mathcal{O}(n^{-2})\).
Proof.
\(\displaystyle {\mathbb{E}_{g}\left[\frac{1}{n}\sum_{i=1}^n w(X_i) h(X_i)\right]}=\frac{1}{n}\sum_{i=1}^n {\mathbb{E}_{g}\left[w(X_i) h(X_i)\right]} = {\mathbb{E}_{f}\left[h(X)\right]}\).
\(\displaystyle {\mathbb{V}\textrm{ar}_{g}\left[\frac{1}{n}\sum_{i=1}^n w(X_i) h(X_i)\right]}=\frac{1}{n^2}\sum_{i=1}^n {\mathbb{V}\textrm{ar}_{g}\left[w(X_i) h(X_i)\right]} = \frac{{\mathbb{V}\textrm{ar}_{g}\left[w(X)h(X)\right]}}{n}\).
For (c) and (d) see Liu (2001, p35).
Note that the theorem implies that, contrary to \(\tilde \mu\), the self-normalised estimator \(\hat \mu\) is biased. The self-normalised estimator \(\hat \mu\), however, might have a lower variance. In addition, it has another advantage: we only need to know the density up to a multiplicative constant, as is often the case in Bayesian modelling, for example. Assume \(f(x)=C\cdot \bar{f}(x)\), then \[\hat \mu=\frac{\sum_{i=1}^n w(X_i) h(X_i)}{\sum_{i=1}^n w(X_i)}=\frac{\sum_{i=1}^n \frac{f(X_i)}{g(X_i)} h(X_i)}{\sum_{i=1}^n \frac{f(X_i)}{g(X_i)}} =\frac{\sum_{i=1}^n \frac{C\cdot \bar{f}(X_i)}{g(X_i)} h(X_i)}{\sum_{i=1}^n \frac{C\cdot \bar{f}(X_i)}{g(X_i)}} = \frac{\sum_{i=1}^n \frac{\bar{f}(X_i)}{g(X_i)} h(X_i)}{\sum_{i=1}^n \frac{\bar{f}(X_i)}{g(X_i)}},\] i.e. the self-normalised estimator \(\hat \mu\) does not depend on the normalisation constant \(C\). By a closely analogous argument, one can show that is also enough to know \(g\) only up to a multiplicative constant. On the other hand, as demonstrated by the proof of Theorem 2.2 it is a lot harder to analyse the theoretical properties of the self-normalised estimator \(\hat \mu\).
Although the above equations (2.4) and (2.5) hold for every \(g\) with \(\textrm{supp}(g)\supseteq \textrm{supp}(f\cdot h)\) and the importance sampling algorithm converges for a large choice of such \(g\), one typically only considers choices of \(g\) that lead to finite variance estimators. The following two conditions are each sufficient (albeit rather restrictive; see Geweke (1989) for some other possibilities) to ensure that \(\tilde \mu\) has finite variance:
\(f(x) \leq M \cdot g(x)\) and \({\mathbb{V}\textrm{ar}_{f}\left[h(X)\right]}<\infty\).
\(E\) is compact, \(f\) is bounded above on \(E\), and \(g\) is bounded below on \(E\).
So far we have only studied whether a \(g\) is an appropriate proposal distribution, i.e. whether the variance of the estimator \(\tilde \mu\) (or \(\hat \mu\)) is finite. This leads to the question which proposal distribution is optimal, i.e. for which choice \({\mathbb{V}\textrm{ar}_{}\left[\tilde\mu\right]}\) is minimal. The following theorem, variants of which date back at least to Goertzel (1949), answers this question:
Theorem 2.3 (Optimal proposal). The proposal distribution \(g\) that minimises the variance of \(\tilde \mu\) is \[g^*(x)=\frac{|h(x)|f(x)}{\int_{E}|h(t)|f(t)\;dt}.\]
Proof. We have from Theorem 2.2 (b) that \[\begin{align*} n\cdot {\mathbb{V}\textrm{ar}_{g}\left[\tilde\mu\right]}&={\mathbb{V}\textrm{ar}_{g}\left[w(X)\cdot h(X)\right]}\\ &={\mathbb{V}\textrm{ar}_{g}\left[\frac{h(X)\cdot f(X)}{g(X)}\right]}={\mathbb{E}_{g}\left[\left(\frac{h(X)\cdot f(X)}{g(X)}\right)^2\right]}-\Bigg(\underbrace{{\mathbb{E}_{g}\left[\frac{h(X)\cdot f(X)}{g(X)}\right]}}_{={\mathbb{E}_{g}\left[\tilde \mu\right]}=\mu}\Bigg)^2. \end{align*}\] The second term is independent of the choice of proposal distribution, thus we need minimise only \({\mathbb{E}_{g}\left[\left(\frac{h(X)\cdot f(X)}{g(X)}\right)^2\right]}\). Substituting \(g^{\star}\) into this expression we obtain: \[\begin{aligned} {\mathbb{E}_{g^{\star}}\left[\left(\frac{h(X)\cdot f(X)}{g^{\star}(X)}\right)^2\right]}=&\int_{E} \frac{h(x)^2\cdot f(x)^2}{g^{\star}(x)}\;dx = \left(\int_{E} \frac{h(x)^2\cdot f(x)^2}{|h(x)|f(x)}\;dx\right) \cdot \left(\int_{E}|h(t)|f(t)\;dt\right) \\=&\left(\int_{E}|h(x)|f(x)\;dx\right)^2\end{aligned}\] On the other hand, we can apply Jensen’s inequality to \({\mathbb{E}_{g}\left[\left(\frac{h(X)\cdot f(X)}{g(X)}\right)^2\right]}\), yielding \[{\mathbb{E}_{g}\left[\left(\frac{h(X)\cdot f(X)}{g(X)}\right)^2\right]}\geq \left({\mathbb{E}_{g}\left[\frac{|h(X)|\cdot f(X)}{g(X)}\right]}\right)^2 = \left(\int_{E}|h(x)|f(x)\;dx\right)^2\] i.e. the estimator obtained by using an importance sampler employing proposal distribution \(g^\star\) attains the minimal possible variance amongst the class of importance sampling estimators.
An important corollary of Theorem 2.3 is that importance sampling can be super-efficient: when using the optimal \(g^{\star}\) from Theorem 2.3 the variance of \(\tilde \mu\) is less than the variance obtained when sampling directly from \(f\): \[\begin{aligned} n\cdot {\mathbb{V}\textrm{ar}_{f}\left[\frac{h(X_1)+\dots+h(X_n)}{n}\right]} &={\mathbb{E}_{f}\left[h(X)^2\right]} - \mu^2 \\ &\geq \left({\mathbb{E}_{f}\left[|h(X)|\right]}\right)^2 - \mu^2\\ &= \left(\int_{E}|h(x)|f(x)\;dx\right)^2 -\mu^2=n\cdot {\mathbb{V}\textrm{ar}_{g^{\star}}\left[\tilde \mu\right]}\end{aligned}\] where the inequality follows from Jensen’s inequality. Unless \(h\) is (almost surely) constant, the inequality is strict. There is an intuitive explanation to the super-efficiency of importance sampling. Using \(g^{\star}\) instead of \(f\) causes us to focus on regions which balance both high probability density, \(f\), and substantial values of the function, where \(|h|\) is large, which contribute the most to the integral \({\mathbb{E}_{f}\left[h(X)\right]}\).
Theorem 2.3 is, however, a rather formal optimality result. When using \(\tilde \mu\) we need to know the normalisation constant of \(g^{\star}\), which if \(h\) is everywhere positive is exactly the integral we are attempting to approximate—and is likely to be equally difficult to evaluate even when that is not the case! Furthermore, we need to be able to draw samples from \(g^{\star}\) efficiently. The practically important implication of Theorem 2.3 is that we should choose an instrumental distribution \(g\) whose shape is close to the one of \(f\cdot |h|\).
Example 2.5 (Computing for \(X\sim {\textsf{t}_{3}}\)). Assume we want to compute \({\mathbb{E}_{f}\left[|X|\right]}\) for \(X\) from a \({\textsf{t}_{}}\)-distribution with 3 degrees of freedom (\({\textsf{t}_{3}}\)) using a Monte Carlo method. Consider three different schemes.
Sampling \(X_1,\dots,X_n\) directly from \({\textsf{t}_{3}}\) and estimating \({\mathbb{E}_{f}\left[|X|\right]}\) by \[\frac{1}{n}\sum_{i=1}^{n}|X_i|.\]
Alternatively we could use importance sampling using a \({\textsf{t}_{1}}\) (which is nothing other than a Cauchy distribution) as proposal distribution. The idea behind this choice is that the density \(g_{{\textsf{t}_{1}}}(x)\) of a \({\textsf{t}_{1}}\) distribution is closer to \(f(x)|x|\), where \(f(x)\) is the density of a \({\textsf{t}_{3}}\) distribution, as Figure 2.4 shows.
Third, we will consider importance sampling using a \({\textsf{N}\left( 0,1 \right)}\) distribution as proposal distribution.
Note that the third choice yields weights of infinite variance, as the proposal distribution (\({\textsf{N}\left( 0,1 \right)}\)) has lighter tails than the distribution we want to sample from (\({\textsf{t}_{3}}\)). The right-hand panel of Figure 2.5 illustrates that this choice yields a very poor estimate of the integral \(\int |x|f(x)\;dx\). Sampling directly from the \({\textsf{t}_{3}}\) distribution can be seen as importance sampling with all weights \(w_i\equiv 1\); this choice clearly minimises the variance of the weights. However, minimizing the variance of the weights does not imply that this yields an estimate of the integral \(\int |x|f(x)\;dx\) of minimal variance. Indeed, after 1500 iterations the empirical standard deviation (over 100 realisations) of the direct estimate is \(0.0345\), which is larger than the empirical standard deviation of \(\tilde\mu\) when using a \({\textsf{t}_{1}}\) distribution as proposal distribution, which is \(0.0182\). This suggests that using a \({\textsf{t}_{1}}\) distribution as proposal distribution is super-efficient (see Figure 2.5, although we should always be careful when assuming that empirical standard deviations are a good approximation of the true standard deviation.
Figure 2.6 somewhat explains why the \({\textsf{t}_{1}}\) distribution is a far better choice than the \({\textsf{N}\left( 0,1 \right)}\) distribution. As the \({\textsf{N}\left( 0,1 \right)}\) distribution does not have heavy enough tails, the weight tends to infinity as \(|x| \rightarrow +\infty\). Thus large \(|x|\) can receive very large weights, causing the jumps of the estimate \(\tilde\mu\) shown in Figure 2.5. The \({\textsf{t}_{1}}\) distribution has heavy enough tails, to ensure that the weights are small for large values of \(|x|\), explaining the small variance of the estimate \(\tilde \mu\) when using a \({\textsf{t}_{1}}\) distribution as proposal distribution.
References
This material is included for completeness, but isn’t actually covered in this module; a good summary was provided in the Statistical Computing module.↩︎
Actually, an algorithm known as the parametric bootstrap does consider exactly this case, essentially resulting in an importance sampling estimate.↩︎