1 Introduction

These notes were intended to supplement the Computer Intensive Statistics lectures and laboratory sessions rather than to replace or directly accompany them. As such, material is presented here in an order which is logical for reference purposes after the week and not precisely the order in which it was to be discussed during the week. There is much more information in these notes concerning some topics than would have been covered during the week itself. One of their main functions is to provide pointers to the relevant literature for anyone wanting to learn more about these topics.


These notes are based on a set developed by Adam Johansen, which can be traced back to a lecture course given by him and Ludger Evers in Bristol in 2007–08. Many of the better figures were originally prepared by Ludger Evers, and more recent contributions from previous APTS module leader Paul Jenkins. Any errors are my own and should be directed to .

1.1 Three Views of Sample Approximation

Many of the techniques described in these notes are simulation-based or otherwise make use of sample approximations of quantities. In the preliminary notes there is some discussion of the approximation of \(\pi\), for example, by representing it in terms of an expectation that can be approximated using a sample average.

In general there are three increasingly abstract ways of viewing the justification of this type of approach. Thinking in these terms can be very helpful when trying to understand what these techniques are aiming to do and why we might expect them to work, and so it’s worth thinking about this even before getting in to the details of particular algorithms.

1.1.1 Direct Approximation

This is a definition of Monte Carlo methods due to Halton (1970):

Representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.

Recalling the approximation of \(\pi\) in the preliminary material, we constructed a simple random sample from a population described by a Bernoulli distribution with parameter \(\pi/4\) and then used a simple estimate of the population parameter as an estimate of \(\pi/4\).

Although in one sense this is the simplest view of a simulation-based approach to inference, it requires a specific construction for each problem we want to address.

1.1.2 Approximation of Integrals

The next level of indirection is to view Monte Carlo methods as algorithms for approximation of integrals. The quantity we wish to estimate is written as an expectation with respect to a probability distribution and a large sample from that population is then used to approximate that expectation; we can easily justify this via the (strong) law of large numbers and the central limit theorem.

That is, given \(I = \int \varphi(x) f(x) dx\), we sample a collection, \(X_1,\dots,X_n\), of \(n\) independent random variables with distribution \(f\) and use the sample mean of \(\varphi(X_i)\) as an approximation of \(I\): \[\begin{aligned} \hat{I}_n &= \frac{1}{n}\sum_{i=1}^n \varphi(X_i).\end{aligned}\] The strong law of large numbers tells us that \(\hat{I}_n \overset{a.s.}{\rightarrow}I\), and the central limit theorem (CLT) tells us that, provided that \(\varphi(X)\) has finite variance, \(\sqrt{n {\mathbb{V}\textrm{ar}_{}\left[\varphi(X)\right]}} [\hat{I}_n - I] \overset{\mathcal{D}}{\rightarrow}Z\), where \(X \sim f\) and \(Z\) is a standard normal random variable. The CLT tells us something about the rate at which the estimate converges. Notice that this rate is independent of the space in which the \(X_i\) live: this is the basis of the (slightly misleading) claim that the Monte Carlo method beats the curse of dimensionality. Although the rate is independent of dimension (\(Z\) is scaled by a quantity which decreases at a rate of \(\sqrt{n}\), independent of dimension), the associated constants typically do depend on the dimension of the sampling space (in this simple case, this is because \({\mathbb{V}\textrm{ar}_{}\left[\varphi(X)\right]}\) is typically larger when \(X\) takes values in a larger space)…

In the case of the estimation of \(\pi\), we can let \(X_i = (X_i^x,X_i^y)\) with \(X_i^x \overset{\text{iid}}{\sim}{\textsf{U}{\left[-1,+1\right]}}\), \(X_i^y \overset{\text{iid}}{\sim}{\textsf{U}{\left[-1,+1\right]}}\), and \(X_i^x,X_i^y\) independent of one another. So we have \(f(x,y) = \frac14\mathbb{I}_{[-1,+1]}(x) \mathbb{I}_{[-1,+1]}(y)\), where \(\mathbb{I}_A(x)\) denotes the indicator function on a set \(A\) evaluated at the point \(x\), i.e. it takes the value \(1\) if \(x\in A\) and \(0\) otherwise.

We consider the points which land within a disc of unit radius centred at the origin, \(S_1 = \{ (x,y) : x^2 + y^2 \leq 1\}\), and the proportion of points drawn from \(f\) which lie within \(S_1\). The population value of this quantity (i.e. the probability that a particular point lies within the circle) is clearly the expectation of a function taking the value \(1\) within \(S_1\) and \(0\) outside it: \(\pi/4 = \int \mathbb{I}_{S_1}(x,y)f(x,y) dx\).

Note that this is just a particular case of the useful fact that the probability of any event \(A\) is equal to the expectation of an indicator function on that set.

1.1.3 Approximation of Distributions

The most abstract view of the Monte Carlo method, and indeed other approaches to simulation-based inference, is through the lens of distributional approximation. Rather than constructing an approximation of the quantity of interest directly, or of an integral representation of that quantity of interest, we could consider the method as providing an approximation of the distribution of interest itself.

If we are interested in some property of a probability distribution—a probability, an expectation, a quantile,…, then a natural approach would be to obtain an approximation of that distribution and to use the corresponding property of that approximation as an approximation of the quantity of interest. The natural simulation-based approach is to use the empirical distribution associated with a (large) sample from the distribution of interest. That is, we consider a discrete distribution which places mass \(1/n\) on each of \(n\) points obtained by sampling from \(f\), and then use this distribution as an approximation to \(f\) itself.

The (measure-theoretic) way of writing such an approximation is: \[\hat{f}^n = \frac{1}{n} \sum_{i=1}^n \delta_{x_i},\] where \(x_1,\dots,x_n\) are realisations of \(X_1,\dots,X_n \overset{\text{iid}}{\sim}\pi\) and \(\delta_x\) denotes the probability distribution (i.e. measure) which places mass \(1\) at \(x\). In the case in which \(\pi\) is a distribution over the real numbers we can think in terms of its distribution function and write \[\hat{F}^n(x) = \sum_{i=1}^n \frac{1}{n} \mathbb{I}_{(-\infty,x]}(x_i),\] which just tells us that we approximate \(\mathbb{P}(X \leq x)\) with the proportion of the sampled values which lie below \(x\).

In the case of the estimation of \(\pi\), we saw in the previous section that we can represent \(\pi\) as an expectation with respect to \(f(x,y) = \frac14\mathbb{I}_{[-1,+1]}(x) \mathbb{I}_{[-1,+1]}(y)\). We immediately recover our approximation of \(\pi\) by taking the expectation under \(\hat{f}^n\) rather than \(f\).

This may seem like an unnecessarily complicated or abstract view of the approach, but it is very general and encompasses many ostensibly different approaches to Monte Carlo estimation.

1.2 The Usefulness of Sample Approximation

It may seem surprising at first, but it is often possible to obtain samples from distributions with respect to which it is not possible to compute expectations explicitly. Sampling is a different problem from integrating, and one may be soluble while the other is not. We can use the approximation provided by artificial samples in these cases to approximate quantities of interest that we might not be able to approximate adequately by other means.

In some other situations we will see, we may have to deal with settings in which we have access to a sample whose distribution we don’t know; in such settings we can still use the approximation of the distribution provided by the sample itself.

1.3 Further Reading

A great many books have been written on the material discussed here, but it might be useful to identify some examples with particular strengths:

  • An elementary self-contained introduction written from a similar perspective to these notes is provided by Voss (2013).

  • A more in-depth study of Monte Carlo methods, particularly Markov chain Monte Carlo, is provided by Robert and Casella (2004).

  • A slightly more recent collection of MCMC topics, including chapters on Hamiltonian Monte Carlo, is given by Brooks et al. (2011).

  • A book with many examples from the natural sciences, which might be more approachable to those with a background in those sciences, is given by Liu (2001).

Where appropriate, references to both primary literature and good tutorials are provided throughout these notes.


Brooks, S., A. Gelman, G. L. Jones, and X.-L. Meng, eds. 2011. Handbook of Markov Chain Monte Carlo. CRC Press.
Halton, J. H. 1970. “A Retrospective and Prospective Survey of the Monte Carlo Method.” SIAM Review 12 (1): 1–63.
Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. New York: Springer Verlag.
Robert, C. P., and G. Casella. 2004. Monte Carlo Statistical Methods. Second. New York: Springer Verlag.
Voss, J. 2013. An Introduction to Statistical Computing: A Simulation-Based Approach. Wiley.