A Some Markov Chain Concepts
This appendix is provided largely to make these notes self contained and to provide a little context and some details for those who want them. The notion of a stochastic process in general and Markov chains in particular are, of course, explored in more depth during the Applied Stochastic Processes module. No significant amount of lecture time will be dedicated to this material, and if this is all unfamiliar to you then you’ll be able to engage with the lectures and the module without becoming intimately acquainted with the fine details of this material.
I’ve attempted to balance the need for technical rigour with accessibility and have avoided making much explicit reference to the theory of measure. If you aren’t familiar with measure theory then you should be able to read this appendix by simply ignoring any reference to measurability but be aware that should you go on to use these concepts in the wild that we do need to be careful about such things.
A.1 Stochastic Processes
For our purposes we can define an \(E\)-valued process as a function \(\xi:{\mathcal{I}}\rightarrow E\) which maps values in some index set \({\mathcal{I}}\) to some other space \(E\). The evolution of the process is described by considering the variation of \(\xi(i)\) with \(i\). An \(E\)-valued stochastic process (or random process) can be viewed as a process in which, for each \(i \in {\mathcal{I}}\), \(\xi(i)\) is a random variable taking values in \(E\).
Although a rich literature on more general situations exists, we will consider only the case of discrete time stochastic processes in which the index set \({\mathcal{I}}\) is \({\mathbb{N}}\) (of course, any index set isomorphic to \({\mathbb{N}}\) can be used in the same framework by simple relabeling). We will use the notation \(\xi_i\) to indicate the value of the process at time \(i\) (note that there need be no connection between the index set and real time, but this terminology is both convenient and standard).
We will begin with an extremely brief description of a general stochastic process, before moving on to discuss the particular classes of process in which we will be interested. In order to characterise a stochastic process of the sort in which we are interested, it is sufficient to know all of its finite-dimensional distributions, the joint distributions of the process at any collection of finitely many times. For any collection of times \(i_1, i_2, \dots, i_t\) and any measurable collection of subsets of \(E\), \(A_{i_1}, A_{i_2}, \dots, A_{i_t}\) we are interested in the probability: \[{{\mathbb{P}}_{}\left(\xi_{i_1} \in A_{i_1}, \xi_{i_2} \in A_{i_2}, \dots, \xi_{i_t} \in A_{i_t}\right)}.\] For such a collection of probabilities to define a stochastic process, we require that they meet a certain consistency criterion. We require the marginal distribution of the values taken by the process at any collection of times to be the same under any finite-dimensional distribution which includes the process at those time points, so, defining any second collection of times \(j_1, \dots, j_s\) with the property that \(j_k \not= i_l\) for any \(k \leq t, l \leq s\), we must have that: \[\begin{aligned} {{\mathbb{P}}_{}\left(\xi_{i_1} \in A_{i_1}, \xi_{i_2} \in A_{i_2}, \dots, \xi_{i_t} \in A_{i_t}\right)} = {{\mathbb{P}}_{}\left(\xi_{i_1} \in A_{i_1}, \xi_{i_2} \in A_{i_2}, \dots, \xi_{i_t} \in A_{i_t}, \xi_{j_1} \in E, \dots, \xi_{j_s} \in E\right)}.\end{aligned}\] This is just an expression of the intuitive concept that any finite-dimensional distribution which describes the process at the times of interest should provide the same description if we neglect any information it provides about the process at other times. Or, to put it another way, they must all be marginal distributions of the same distribution.
In the case of real-valued stochastic processes, in which \(E={\mathbb{R}}\), we may express this concept in terms of the joint distribution functions (the multivariate analogue of the distribution function). Defining the joint distribution functions according to: \[\begin{aligned} F_{i_1,\dots,i_t}(x_1,x_2,\dots,x_t) = {{\mathbb{P}}_{}\left(\xi_{i_1} \leq x_1,\xi_{i_2} \leq x_2,\dots,\xi_{i_t} \leq x_t\right)},\end{aligned}\] our consistency requirement may now be expressed as: \[\begin{aligned} &F_{i_1,\dots,i_t,j_1,\dots,j_s}(x_1,x_2,\dots,x_t, \infty,\dots,\infty) = F_{i_1,\dots,i_t}(x_1,x_2,\dots,x_t).\end{aligned}\]
Having established that we can specify a stochastic process if we are able to specify its finite-dimensional distributions, we might wonder how to specify these distributions. In the next two sections, we proceed to describe a class of stochastic processes which can be described constructively and whose finite-dimensional distributions may be easily established. The Markov processes which we are about to introduce represent the most widely used class of stochastic processes, and the ones which will be of most interest in the context of Monte Carlo methods.
A.2 Discrete State Space Markov Chains
A.2.1 Basic Notions
We begin by turning our attention to the discrete state space case which is somewhat easier to deal with than the general case which will be of interest later. In the case of discrete state spaces, in which \(|E|\) is either finite, or countably infinite, we can work with the actual probability of the process having a particular value at any time (you’ll recall that in the case of continuous random variables more subtlety is generally required as the probability of any continuous random variable defined by a density (with respect to Lebesgue measure, in particular) taking any particular value is zero). This simplifies things considerably, and we can consider defining the distribution of the process of interest over the first \(t\) time points by employing the following decomposition: \[\begin{align*} {{\mathbb{P}}_{}\left(\xi_{1} = x_1, \xi_{2} = x_2, \dots, \xi_{t} = x_{t}\right)} = &{} {{\mathbb{P}}_{}\left(\xi_{1} = x_1, \xi_{2} = x_2, \dots, \xi_{t-1} = x_{t-1}\right)}\\ & {}\times {{\mathbb{P}}_{}\left(\xi_t = x_t | \xi_1 = x_1,\dots,\xi_{t-1}=x_{t-1}\right)}. \end{align*}\] Looking at this decomposition, it’s clear that we could construct all of the distributions of interest from an initial distribution from which \(\xi_1\) is assumed to be drawn and then a sequence of conditional distributions for each \(t\), leading us to the specification: \[\begin{equation} {{\mathbb{P}}_{}\left(\xi_{1} = x_1, \xi_{2} = x_2, \dots, \xi_{t} = x_{t}\right)} = {{\mathbb{P}}_{}\left(\xi_1 = x_1\right)} \prod\limits_{i=2}^t {{\mathbb{P}}_{}\left(\xi_i = x_i | \xi_{1} = x_1, \dots, \xi_{i-1} = x_{i-1}\right)}.\tag{A.1} \end{equation}\] From this specification we can trivially construct all of the finite-dimensional distributions using no more than the sum and product rules of probability.
So, we have a method for constructing finite-dimensional distributions for a discrete state space stochastic process, but it remains a little formal as the conditional distributions seem likely to become increasingly complex as the time index increases. The conditioning present in decomposition (A.1) is needed to capture any relationship between the distribution at time \(t\) and any previous time. In many situations of interest, we might expect interactions to exist on only a much shorter time-scale. Indeed, one could envisage a memoryless process in which the distribution of the state at time \(t+1\) depends only upon its state at time \(t\), \(\xi_t\), regardless of the path by which it reached \(\xi_t\). Formally, we could define such a process as: \[\begin{align} {{\mathbb{P}}_{}\left(\xi_{1} = x_1, \xi_{2} = x_2, \dots, \xi_{t} = x_{t}\right)} &= {{\mathbb{P}}_{}\left(\xi_1 = x_1\right)} \prod\limits_{i=2}^t {{\mathbb{P}}_{}\left(\xi_i = x_i | \xi_{i-1} = x_{i-1}\right)}. \tag{A.2} \end{align}\]
It is clear that (A.2) is a particular case of (A.1) in which this lack of memory property is captured explicitly, as: \[{{\mathbb{P}}_{}\left(\xi_t = x_t | \xi_{1} = x_1, \dots, \xi_{t-1} = x_{t-1}\right)} = {{\mathbb{P}}_{}\left(\xi_t = x_t | \xi_{t-1} = x_{t-1}\right)}.\] We will take this as the defining property of a collection of processes which we will refer to as discrete time Markov processes or, as they are more commonly termed in the Monte Carlo literature, Markov chains. There is some debate in the literature as to whether the term “Markov chain” should be reserved for those Markov processes which take place on a discrete state space, those which have a discrete index set (the only case we will consider here) or both. As is common in the field of Monte Carlo simulation, we will use the terms Markov chain and Markov process interchangeably.
When dealing with discrete state spaces, it is convenient to associate a row vector7 with any probability distribution. We assume, without loss of generality, that the state space, \(E\), is \({\mathbb{N}}\). Now, given a random variable \(X\) on \(E\), we say that \(X\) has distribution \(\mu\), often written as \(X \sim \mu\) for some vector \(\mu\) with the property that: \[\forall x \in E:\: {{\mathbb{P}}_{}\left(X = x\right)} = \mu_x .\]
A.2.1.1 Homogeneous Markov Chains
The term homogeneous Markov Chain is used to describe a Markov process of the sort just described with the additional caveat that the conditional probabilities do not depend explicitly on the time index, so: \[\begin{aligned} \forall m \in {\mathbb{N}}:\: {{\mathbb{P}}_{}\left(\xi_t = y | \xi_{t-1} = x\right)} \equiv {{\mathbb{P}}_{}\left(\xi_{t+m} = y | \xi_{t+m-1} = x\right)}.\end{aligned}\]
In this setting, it is particular convenient to define a function corresponding to the transition probability (as the probability distribution at time \(t+1\) conditional upon the state of the process at time \(t\)) or kernel as it is often known, which may be written as a two argument function or, in the discrete case as a matrix, \(K(i,j) = K_{ij} = {{\mathbb{P}}_{}\left(\xi_t = j | \xi_{t-1} = i\right)}\).
Having so expressed things, we are able to describe the dynamic structure of a discrete state space, discrete time Markov chain in a particularly simple form. If we allow \(\mu_t\) to describe the distribution of the chain at time \(t\), so that \(\mu_{t,i} = {{\mathbb{P}}_{}\left(\xi_t = i\right)}\), then we have by applying the sum and product rules of probability, that: \[\begin{aligned} \mu_{t+1,j} = \sum\limits_i \mu_{t,i} K_{ij}.\end{aligned}\] We may recognise this as standard vector-matrix multiplication and write simply that \(\mu_{t+1} = \mu_t K\) and, proceeding inductively it’s straightforward to verify that \(\mu_{t+m} = \mu_t K^m\) where \(K^m\) denotes the usual \(m^\textrm{th}\) matrix power of \(K\). We will make some use of this object, as it characterises the \(m\)-step ahead condition distribution: \[K^m_{ij} := (K^m)_{ij} = {{\mathbb{P}}_{}\left(\xi_{t+m} = j | \xi_t = i\right)}.\] In fact, the initial distribution \(\mu_1\), together with \(K\) tells us the full distribution of the chain over any finite time horizon: \[{{\mathbb{P}}_{}\left(\xi_1 = x_1,\dots, \xi_t = x_t\right)} = \mu_{1,x_1} \prod\limits_{i=2}^t K_{x_{i-1} x_i}.\]
A general stochastic processes is said to possess the weak Markov property if, for any deterministic time, \(t\) and any finite integer \(p\), we may write that for any integrable function \(\varphi:E \rightarrow {\mathbb{R}}\): \[{\mathbb{E}_{}\left[\varphi(\xi_{t+p})|\xi_1 = x_1, \dots \xi_t = x_t\right]} = {\mathbb{E}_{}\left[\varphi(\xi_{t+p})|\xi_t = x_t\right]}.\]
A.2.1.2 Inhomogeneous Markov Chains
Note that it is perfectly possible to define Markov Chains whose behaviour does depend explicitly upon the time index. Although such processes are more complex to analyse than their homogeneous counterparts, they do play a rôle in Monte Carlo methodology—in both established algorithms such as simulated annealing (see Section 3.5 and in more recent developments such as adaptive Markov Chain Monte Carlo and the State Augmentation for Maximising Expectations (SAME) algorithm of Doucet, Godsill, and Robert (2002). In the interests of simplicity, what follows is presented for homogeneous Markov Chains.
A.2.1.3 Examples
Before moving on to introduce some theoretical properties of discrete state space Markov chains we will present a few simple examples. Whilst there are innumerable examples of homogeneous discrete state space Markov chains, we confined ourselves here to some particular simple cases which will be used to illustrate some properties below, and which will probably be familiar to you.
We begin with an example which is apparently simple, and rather well known, but which exhibits some interesting properties.
Example A.1 (Simple random walk over the integers). Given a process \(\xi_t\) whose value at time \(t+1\) is \(\xi_t + 1\) with probability \(p_+\) and \(\xi_{t-1}\) with probability \(p_- = 1 - p_+\), we obtain the familiar random walk. We may write this as a Markov chain by setting \(E = {\mathbb{Z}}\) and noting that the transition kernel may be written as: \[K_{ij} = \left\{ \begin{array}{cl} p_- & \text{ if } j = i - 1 \\ p_+ & \text{ if } j = i + 1 \\ 0 & \text{otherwise.} \end{array} \right.\]
Example A.2 It will be interesting to look at a slight extension of this random walk, in which there is some probability \(p_0\) of remaining in the present state at the next time step, so \(p_+ + p_- < 0\) and \(p_0 = 1 - (p_+ + p_-)\). In this case we may write the transition kernel as: \[K_{ij} = \left\{ \begin{array}{cl} p_- & \text{ if } j = i - 1 \\ p_0 & \text{ if } j = i \\ p_+ & \text{ if } j = i + 1 \\ 0 & \text{otherwise.} \end{array} \right.\]
Example A.3 (Random Walk on a Triangle). A third example which we will consider below could be termed a “random walk on a triangle”. In this case, we set \(E = \{1,2,3\}\) and define a transition kernel of the form: \[K = \left[ \begin{array}{ccc} 0 & p_+ & p_- \\ p_-& 0 & p_+ \\ p_+& p_- & 0 \end{array} \right].\]
Example A.4 (One-sided Random Walk). Finally, we consider the rather one-sided random walk on the positive integers, illustrated in Figure A.4, and defined by transition kernel: \[K_{ij} = \left\{ \begin{array}{cl} p_0 & \text{ if } j = i \\ p_+= 1 - p_0 & \text{ if } j = i + 1 \\ 0 & \text{otherwise.} \end{array} \right.\]
A.2.2 Important Properties
In this section we introduce some important properties in the context of discrete state space Markov chains and attempt to illustrate their importance within the field of Monte Carlo simulation. As is the usual practice when dealing with this material, we will restrict our study to the homogeneous case. As you will notice, it is the transition kernel which is most important in characterising a Markov chain.
We begin by considering how the various states that a Markov chain may be reached from one another. In particular, the notion of states which communicate is at the heart of the study of Markov chains.
Definition A.1 (Accessibility). A state \(y\) is accessible from a state \(x\), sometimes written as \(x \rightarrow y\) if, for a discrete state space Markov chain, \[\inf \left\{t : {{\mathbb{P}}_{}\left(\xi_t = y | \xi_1 = x\right)} > 0 \right\} < \infty.\] We can alternatively write this condition in terms of the transition matrix as \(\inf \left\{ t : K^t_{xy} > 0\right\} < \infty\).
This concept tells us which states one can reach at some finite time in the future, if one starts from a particular state and then moves, at each time, according to the transition kernel, \(K\). That is, if \(x\rightarrow y\), then there is a positive probability of reaching \(y\) at some finite time in the future, if we start from a state \(x\) and then “move” according to the Markov kernel \(K\). It is now useful to consider cases in which one can traverse the entire space, or some subset of it, starting from any point.
Definition A.2 (Communication). Two states \(x, y \in E\) are said to communicate (written, by some authors as \(x \leftrightarrow y\)) if each is accessible from the other, that is: \[x \leftrightarrow y \Leftrightarrow x \rightarrow y \textrm{ and } y \rightarrow x.\]
We’re now in a position to describe the relationship, under the action of a Markov kernel, between two states. This allows us to characterise something known as the communication structure of the associated Markov chain to some degree, noting which points its possible to travel both to and back from. We now go on to introduce a concept which will allow us to describe the properties of the full state space, or significant parts of it, rather than individual states.
Definition A.3 (Irreducibility). A Markov Chain is said to be irreducible if all states communicate, so \(\forall x,y \in E:\: x \rightarrow y\). Given a distribution \(\phi\) on \(E\), the term \(\phi\)-irreducible is used to describe a Markov chain for which every state with positive probability under \(\phi\) communicates with every other such state: \[\forall x,y \in \textrm{supp}(\phi) :\: x \rightarrow y\] where the support of the discrete distribution \(\phi\) is defined as \(\textrm{supp}(\phi) = \{ x \in E: \phi(x) > 0\}\). It is said to be strongly irreducible if any state can be reached from any point in the space in a single step and strongly \(\phi\)-irreducible if all states (except for a collection with probability 0 under \(\phi\)) may be reached in a single step.
This will prove to be important for the study of Monte Carlo methods based upon Markov chains as a chain with this property can somehow explore the entire space rather than being confined to some portion of it, perhaps one which depends upon the initial state.
It is also important to consider the type of routes which it is possible to take between a state, \(x\), and itself as this will tell us something about the presence of long-range correlation between the states of the chain.
Definition A.4 (Period). A state \(x\) in a discrete state space Markov chain has period \(d(x)\) defined as: \[d(x) = \gcd \left\{s \geq 1 : K^s_{xx} > 0 \right\},\] where \(\gcd\) denotes the greatest common denominator. A chain possessing such a state is said to have a cycle of length \(d\).
Proposition A.1 All states which communicate have the same period and hence, in an irreducible Markov chain, all states have the same period.
Proof. Assume that \(x \leftrightarrow y\). Let there exist paths of lengths \(r, s\) and \(t\), respectively from \(x \rightarrow y\), \(y \rightarrow x\) and \(y \rightarrow y\), respectively.
There are paths of length \(r+s\) and \(r+s+t\) from \(x\) to \(x\), hence \(d(x)\) must be a divisor of \(r+s\) and \(r+s+t\) and consequently of their difference, \(t\). This holds for any \(t\) corresponding to a path from \(y \rightarrow y\) and so \(d(x)\) is a divisor of the length of any path from \(y\rightarrow y\): as \(d(y)\) is the greatest common divisor of all such paths, we have that \(d(x) \leq d(y)\).
By symmetry, we also have that \(d(y) \leq d(x)\), and this completes the proof.
In the context of irreducible Markov chains, the term periodic is used to describe those chains whose states have some common period great than 1, whilst those chains whose period is 1 are termed aperiodic.
One further quantity needs to be characterised in order to study the Markov chains which will arise later. Some way of describing how many times a state is visited if a Markov chain is allowed to run for infinite time still seems required. In order to do this it is useful to define an additional random quantity, the number of times that a state is visited: \[\eta_x := \sum\limits_{k=1}^\infty \mathbb{I}_{x}(\xi_k).\] We will also adopt the convention, common in the Markov chain literature that, given any function of the path of a Markov chain, \(\varphi\), \({\mathbb{E}_{x}\left[\varphi\right]}\) is the expectation of that function under the law of the Markov chain initialised with \(\xi_1 = x\). Similarly, if \(\mu\) is some distribution over \(E\), then \({\mathbb{E}_{\mu}\left[\varphi\right]}\) should be interpreted as the expectation of \(\phi\) under the law of the process initialised with \(\xi_1 \sim \mu\).
Definition A.5 (Transience and Recurrence). In the context of discrete state space Markov chains, we describe a state, \(x\), as transient if: \[{\mathbb{E}_{x}\left[\eta_x\right]} < \infty,\] whilst, if we have that, \[{\mathbb{E}_{x}\left[\eta_x\right]} = \infty,\] then that state will be termed recurrent.
In the case of irreducible Markov chains, transience and recurrence are properties of the chain itself, rather than its individual states: if any state is transient (or recurrent) then all states have that property. Indeed, for an irreducible Markov chain either all states are recurrent or all are transient.
We will be particularly concerned in this course with Markov kernels which admit an invariant distribution.
Definition A.6 (Invariant Distribution). A distribution, \(\mu\) is said to be invariant or stationary for a Markov kernel, \(K\), if \(\mu K = \mu\).
If a Markov chain has any single time marginal distribution which corresponds to its stationary distribution, \(\xi_t \sim \mu\), then all of its future time marginals are the same as, \(\xi_{t+s} \sim \mu K^s = \mu\). A Markov chain is said to be in its stationary regime once this has occurred. Note that this tells us nothing about the correlation between the states or their joint distribution. One can also think of the invariant distribution \(\mu\) of a Markov kernel, \(K\) as the left eigenvector with unit eigenvalue.
Definition A.7 (Reversibility). A stationary stochastic process is said to be reversible if the statistics of the time-reversed version of the process match those of the process in the forward distribution, so that reversing time makes no discernible difference to the sequence of distributions which are obtained, that is the distribution of any collection of future states given any past history must match the conditional distribution of the past conditional upon the future being the reversal of that history.
Reversibility is a condition which, if met, simplifies the analysis of Markov chains. It is normally verified by checking the detailed balance condition, (A.3). If this condition holds for a distribution, then it also tells us that this distribution is the stationary distribution of the chain, another property which we will be interested in.
Proposition A.2 If a Markov kernel satisfies the detailed balance condition for some distribution \(\mu\), \[\begin{equation} \forall x,y \in E:\: \mu_x K_{xy} = \mu_{y} K_{yx}\tag{A.3} \end{equation}\] then:
\(\mu\) is the invariant distribution of the chain.
The chain is reversible with respect to \(\mu\).
Proof. To demonstrate that \(K\) is \(\mu\)-invariant, consider summing both sides of the detailed balance equation over \(x\): \[\begin{aligned} \sum\limits_{x \in E} \mu_x K_{xy} &= \sum\limits_{x\in E} \mu_{y} K_{yx}, \\ (\mu K)_y &= \mu_y,\end{aligned}\] and as this holds for all \(y\), we have \(\mu K = \mu\).
In order to verify that the chain is reversible we proceed directly: \[\begin{aligned} {{\mathbb{P}}_{}\left(\xi_t = x| \xi_{t+1} = y\right)} &= \frac{{{\mathbb{P}}_{}\left(\xi_t = x, \xi_{t+1} = y\right)}}{{{\mathbb{P}}_{}\left(\xi_{t+1} = y\right)}} = \frac{{{\mathbb{P}}_{}\left(\xi_t = x\right)} K_{xy}}{{{\mathbb{P}}_{}\left(\xi_{t+1} = y\right)}} = \frac{\mu_{x} K_{xy}}{\mu_y} = \frac{\mu_y K_{yx}}{\mu_y} = K_{yx}\\ &= {{\mathbb{P}}_{}\left(\xi_t = x | \xi_{t-1} =y\right)},\end{aligned}\] in the case of a Markov chain it is clear that if the transitions are time-reversible then the process must be time reversible.
A.3 General State Space Markov Chains
A.3.1 Basic Concepts
The study of general state space Markov chains is a complex and intricate business. To do so entirely rigorously requires a degree of technical sophistication which lies somewhat outside the scope of this course. Here, we will content ourselves with explaining how the concepts introduced in the context of discrete state spaces in the previous section might be extended to continuous domains via the use of probability densities. We will not consider more complex cases—such as mixed continuous and discrete spaces, or distributions over uncountable spaces which may not be described by a density. Nor will we provide proofs of results for this case, but will provide suitable references for the interested reader.
Although the guiding principles are the same, the study of Markov chains with continuous state spaces requires considerably more subtlety as it is necessary to introduce concepts which correspond to those which we introduced in the discrete case, describe the same properties and are motivated by the same intuition but which remain meaningful when we are dealing with densities rather than probabilities. As always, the principal complication is that the probability of any random variable distributed according to a non-degenerate density on a continuous state space taking any particular value is formally zero.
We will begin by considering how to emulate the decomposition we used to define a Markov chain on a discrete state space, Equation (A.2), when \(E\) is a continuous state space. In this case, what we essentially require is that the probability of any range of possible values, given the entire history of the process depends only upon its most recent value in the sense that, for any measurable \(A_t \subseteq E\): \[{{\mathbb{P}}_{}\left(\xi_t \in A_t | \xi_{1} = x_1, \dots, \xi_{t-1} = x_{t-1}\right)} = {{\mathbb{P}}_{}\left(\xi_t \in A_t | \xi_{t-1} = x_{t-1}\right)}.\]
In the case which we are considering, it is convenient to describe the distribution of a random variable over \(E\) in terms of some probability density, \(\mu : E \rightarrow {\mathbb{R}}\) which has the property that, if integrated over any measurable set, it tells us the probability that the random variable in question lies within that set, i.e. if \(X \sim \mu\), we have that for any measurable set A that: \[{{\mathbb{P}}_{}\left(X \in A\right)} = \int\limits_A \mu(x) dx.\]
We will consider only the homogeneous case here, although the generalisation to inhomogeneous Markov chains follows in the continuous setting in precisely the same manner as the discrete one. In this context, we may describe the conditional probabilities of interest as a function \(K: E \times E \rightarrow {\mathbb{R}}\) which has the property that for all measurable sets \(A \subseteq E\) and all points \(x \in E\): \[{{\mathbb{P}}_{}\left(\xi_{t} \in A | X_{t-1} = x\right)} = \int\limits_A K(x,y) dy.\]
We note that, as in the discrete case the law of a Markov chain evaluated at any finite number of points may be completely specified by the initial distribution, call it \(\mu\), and a transition kernel, \(K\). We have, for any suitable collection of sets \(A_1, \dots\), that the following holds: \[{{\mathbb{P}}_{}\left(\xi_1 \in A_1, \dots, \xi_t \in A_t\right)} = \int\limits_{A_1 \times \dots \times A_t} \mu(x_1) \prod\limits_{k=2}^t K_k(x_{k-1},x_k) dx_1 \dots dx_t.\] And, again, it is useful to be able to consider the \(s\)-step ahead conditional distributions, \[{{\mathbb{P}}_{}\left(\xi_{t+s} \in A | \xi_t = x_t\right)} = \int\limits_{E^{s-1} \times A} \prod\limits_{k=t+1}^{k=t+s} K(x_{k-1},x_k) dx_{t+1} \dots dx_{t+s},\] and it is useful to define an \(s\)-step ahead transition kernel in the same manner as it is in the discrete case, here matrix multiplication is replaced by a convolution operation but the intuition remains the same. Defining \[K^s(x_{t},x_{t+s}) := \int\limits_{E^{s-1}} \prod\limits_{k=t+1}^{k=t+s} K(x_{k-1},x_k) dx_{t+1} \dots dx_{t+s-1},\] we are able to write \[{{\mathbb{P}}_{}\left(\xi_{t+s} \in A | \xi_t = x_t\right)} = \int\limits_{A} K^s(x_{t},x_{t+s}) dx_{t+s}.\]
A.3.2 Important Properties
In this section we will introduce properties which fulfill the same rôle in context of continuous state spaces as those introduced in Section @ref(#secmcdiscreteproperties) do in the discrete setting.
Whilst it is possible to define concepts similar to communication and accessibility in a continuous state space context, this isn’t especially productive. We are more interested in the property of irreducibility: we want some way of determining what class of states are reachable from one another and hence what part of \(E\) might be explored, with positive probability, starting from a point within such a class. We will proceed directly to a continuous state space definition of this concept.
Definition A.8 (Irreducibility). Given a distribution, \(\mu\), over \(E\), a Markov chain is said to be \(\mu\)-irreducible if for all points \(x \in E\) and all measurable sets \(A\) such that \(\mu(A) > 0\) there exists some \(t\) such that: \[\int\limits_A K^t(x,y) dy > 0.\] If this condition holds with \(t = 1\), then the chain is said to be strongly \(\mu\)-irreducible.
This definition has the same character as that employed in the discrete case, previously, but is well defined for more general state spaces. It still tells us whether a chain is likely to be satisfactory if we are interested in approximation of some property of a measure \(\mu\) by using a sample of the evolution of that chain: if it is not \(\mu\)-irreducible then there are some points in the space from which we cannot reach all of the support of \(\mu\), and this is likely to be a problem. In the sequel we will be interested more of less exclusively with Markov chains which are irreducible with respect to some measure of interest.
We need a little more subtlety in extending some of the concepts introduced in the case of discrete Markov chains to the present context. In order to do this, it will be necessary to introduce the concept of the small set; these function as a replacement for the individual states of a discrete space Markov chain as we will see shortly.
A first attempt might be to consider the following sets which have the property that the distribution of taken by the Markov chain at time \(t+1\) is the same if it starts at any point in this set—so the conditional distribution function is constant over this set.
Definition A.9 (Atoms). A Markov chain with transition kernel \(K\) is said to have an atom, \(\alpha \subseteq E\), if there is some probability distribution, \(\nu\), such that: \[\forall x \in \alpha, A \subseteq E: \quad \int\limits_A K(x,y) dy = \int\limits_A \nu(y) dy.\] If the Markov chain in question is \(\nu\)-irreducible, then \(\alpha\) is termed an accessible atom.
Whilst the concept of atoms starts to allow us to introduce some sort of structure similar to that seen in discrete chains—it provides us with a set of positive probability which, if the chain ever enters it, we know the distribution of the subsequent state. Note that this is much stronger than knowledge of the transition kernel, \(K\), as in general all points in the space have zero probability. Most interesting continuous state spaces do not possess atoms. The condition that the distribution of the next state is precisely the same, wherever the current state is rather strong. Another approach would be to require only that the conditional distribution has a common component, and that is the intuition behind a much more useful concept which underlies much of the analysis of general state space Markov chains.
Definition A.10 (Small Sets). A set, \(C \subseteq E\), is termed small for a given Markov chain (or, when one is being precise, \((\nu,s,\epsilon)\)-small) if there exists some positive integer \(s\), some \(\epsilon > 0\) and some non-trivial probability distribution, \(\nu\), such that: \[\forall x \in C, A \subseteq E:\: \quad \int\limits_A K^s(x,y) dy \geq \epsilon \int\limits_A \nu(y) dy.\]
This tells us that the distribution \(s\)-steps after the chain enters the small set has a component of size at least \(\epsilon\) of the distribution \(\nu\), wherever it was within that set. In this sense, small sets are not “too big”: there is potentially some commonality of all paths emerging from them. Although we have not proved that such sets exist for any particular class of Markov chains it is, in fact, the case that they do for many interesting Markov chain classes and their existence allows for a number of sophisticated analytic techniques to be applied.
In order to define cycles (and hence the notion of periodicity) in the general case, we require the existence of a small set. We need some group of “sufficiently similar” points in the state space which have a finite probability of being reached. We then treat this collection of points in the same manner as an individual state in the discrete case, leading to the following definitions.
Definition A.11 (Cycles). A \(\mu\)-irreducible Markov chain has a cycle of length \(d\) if there exists a \((\nu,M,\epsilon)\)-small set C, such that: \[d = \gcd \left\{s \geq 1 : C \text{ is $(\nu,s,\delta_s \epsilon)$-small for some } \delta_s > 0 \right\}.\]
This provides a reasonable concept of periodicity within a general state space Markov chain as it gives us a way of characterising the existence of regions of the space with the property that, wherever you start within that region you have positive probability of returning to that set after any multiple of \(d\) steps and this does not hold for any number of steps which is not a multiple of \(d\). We are able to define periodicity and aperiodicity in the same manner as for discrete chains, but using this definition of a cycle. As in the discrete space, all states within the support of \(\mu\) in a \(\mu\)-irreducible chain must have the same period (see Proposition A.1) although we will not prove this here.
Considering periodicity from a different viewpoint, we are able to characterise it in a manner which is rather easier to interpret but somewhat difficult to verify in practice. The following definition of period is equivalent to that given above (Nummelin 1984): a Markov chain has a period \(d\) if there exists some partition of the state space, \(E_1, \dots, E_d\) with the properties that:
\(\forall i \neq j:\: E_i \cap E_j = \emptyset\),
\(\bigcup\limits_{i=1}^d E_i = E\),
\(\forall i, j, t, s:\: {{\mathbb{P}}_{}\left(X_{t+s} \in E_j | X_t \in E_i\right)} = \left\{ \begin{array}{cl} 1 & j = i + s \textrm{ mod } d \\ 0 & \textrm{otherwise}. \end{array} \right.\)
What this actually tells us is that a Markov chain with a period of \(d\) has associated with it a disjoint partition of the state space, \(E_1, \dots, E_d\) and that we know that the chain moves with probability 1 from set \(E_1\) to \(E_2\), \(E_2\) to \(E_3\), \(E_{d-1}\) to \(E_d\) and \(E_d\) to \(E_1\) (assuming that \(d \geq 3\), of course). Hence the chain will visit a particular element of the partition with a period of \(d\).
We also require some way of characterising how often a continuous state space Markov chain visits any particular region of the state space in order to obtain concepts analogous to those of transience and recurrence in the discrete setting. In order to do this we define a collection of random variables \(\eta_A\) for any subset \(A\) of \(E\), which correspond to the number of times the set \(A\) is visited, i.e. \(\eta_A := \sum_{l=1}^\infty \mathbb{I}_A(\xi_k)\) and, once again we use \({\mathbb{E}_{x}\left[\cdot\right]}\) to denote the expectation under the law of the Markov chain with initial state \(x\). We note that if a chain is not \(\mu\)-irreducible for some distribution \(\mu\), then there is no guarantee that it is either transient or recurrent, however, the following definitions do hold:
Definition A.12 (Transience and Recurrence). We begin by defining uniform transience and recurrence for sets \(A \subseteq E\) for \(\mu\)-irreducible general state space Markov chains. Such a set is recurrent if: \[\forall x \in A:\: {\mathbb{E}_{x}\left[\eta_A\right]} = \infty.\] A set is uniformly transient if there exists some \(M < \infty\) such that: \[\forall x \in A:\: {\mathbb{E}_{x}\left[\eta_A\right]} \leq M.\] The weaker concept of transience of a set may then be introduced. A set, \(A \subseteq E\), is transient if it may be expressed as a countable union of uniformly transient sets, i.e.: \[\begin{aligned} \exists \left\{ B_i \subseteq E\right\}_{i=1}^\infty :& \quad A \subseteq \bigcup\limits_{i=1}^\infty B_i \\ \forall i \in {\mathbb{N}},\, \forall x \in B_i:&\quad {\mathbb{E}_{x}\left[\eta_{B_i}\right]} \leq M_i < \infty.\end{aligned}\] A general state space Markov chain is recurrent if the following two conditions are satisfied:
The chain is \(\mu\)-irreducible for some distribution \(\mu\),
For every measurable set \(A \subseteq E\) such that \(\int_A \mu(y) dy > 0\), \({\mathbb{E}_{x}\left[\eta_A\right]} = \infty\) for every \(x \in A\);
whilst it is transient if it is \(\mu\)-irreducible for some distribution \(\mu\) and the entire space is transient.
As in the discrete setting, in the case of irreducible chains, transience and recurrence are properties of the chain rather than individual states: all states within the support of the irreducibility distribution are either transient or recurrent. It is useful to note that any \(\mu\)-irreducible Markov chain which has stationary distribution \(\mu\) is positive recurrent (Tierney 1994).
A slightly stronger form of recurrence is widely employed in the proof of many theoretical results which underlie many applications of Markov chains to statistical problems, this form of recurrence is known as Harris recurrence and may be defined as follows:
Definition A.13 (Harris Recurrence). A set \(A \subseteq E\) is Harris recurrent if \({{\mathbb{P}}_{x}\left(\eta_A = \infty\right)} = 1\) for every \(x \in A\).
A Markov chain is Harris recurrent if there exists some distribution \(\mu\) with respect to which it is irreducible and every set \(A\) such that \(\int_A \mu(x) dx > 0\) is Harris recurrent.
The concepts of invariant distribution, reversibility and detailed balance are essentially unchanged from the discrete setting. It’s necessary to consider integrals with respect to densities rather than sums over probability distributions, but no fundamental differences arise here.
A.4 Selected Theoretical Results
The probabilistic study of Markov chains dates back more than fifty years and comprises an enormous literature, much of it rather technically sophisticated. We don’t intend to summarise that literature here, nor to provide proofs of the results which we present here. This section serves only to motivate the material presented in the subsequent chapters.
These two theorems fill the rôle which the law of large numbers and the central limit theorem for independent, identically distributed random variables fill in the case of simple Monte Carlo methods. They tell us, roughly speaking, that if we take the sample averages of a function at the points of a Markov chain which satisfies suitable regularity conditions and possesses the correct invariant distribution, then we have convergence of those averages to the integral of the function of interest under the invariant distribution and, furthermore, under stronger regularity conditions we can obtain a rate of convergence.
There are two levels of strength of law of large numbers which it is useful to be aware of. The first tells us that for most starting points of the chain a law of large numbers will hold. Under slightly stronger conditions (which it may be difficult to verify in practice) it is possible to show the same result holds for all starting points.
Theorem A.1 (A Simple Ergodic Theorem). If \(\left(\xi_{i}\right)_{i \in {\mathbb{N}}}\) is a \(\mu\)-irreducible, recurrent \({\mathbb{R}}^d\)-valued Markov chain which admits \(\mu\) as a stationary distribution, then the following strong law of large numbers holds (convergence is with probability 1) for any integrable function \(\varphi:{\mathbb{R}}^d \rightarrow{\mathbb{R}}\): \[\lim_{t\rightarrow \infty} \frac1t \sum\limits_{i=1}^t \varphi(\xi_i) \overset{a.s.}{=}\int \varphi(x) \mu(x) dx.\] for almost every starting value \(x\) (i.e. for any \(x\) except perhaps for a set of bad starting value, \(\mathcal{N}\), which has the property that \(\int_\mathcal{N} \mu(x) dx = 0\)).
An outline of the proof of this theorem is provided by G. O. Roberts and Rosenthal (2004, Fact 5).
Theorem A.2 (A Stronger Ergodic Theorem). If \(\left(\xi_{i}\right)_{i \in {\mathbb{N}}}\) is a \(\mu\)-invariant, Harris recurrent Markov chain, then the following strong law of large numbers holds (convergence is with probability 1) for any integrable function \(\varphi:E\rightarrow{\mathbb{R}}\): \[\lim_{t\rightarrow \infty} \frac1t \sum\limits_{i=1}^t \varphi(\xi_i) \overset{a.s.}{=}\int \varphi(x) \mu(x) dx.\]
A proof of this result is beyond the scope of the course. This is a particular case of Robert and Casella (2004, p241, Theorem 6.63), and a proof of the general theorem is given there. The same theorem is also presented with proof in Meyn and Tweedie (1993, p433, Theorem 17.3.2).
Theorem A.3 (A Central Limit Theorem). Under technical regularity conditions (see Jones (2004) for a summary of various combinations of conditions) it is possible to obtain a central limit theorem for the ergodic averages of a Harris recurrent, \(\mu\)-invariant Markov chain, and a function \(\varphi:E\rightarrow {\mathbb{R}}\) which has at least two finite moments (depending upon the combination of regularity conditions assumed, it may be necessary to have a finite moment of order \(2+\delta\)). \[\begin{aligned} & \lim_{t\rightarrow\infty} \sqrt{t} \left[\frac{1}{t}\sum\limits_{i=1}^t \varphi(\xi_i) - \int \varphi(x) \mu(x) dx \right] \overset{\mathcal{D}}{=}{\textsf{N}\left( 0,\sigma^2(\varphi) \right)}, \\ & \sigma^2(\varphi) = {\mathbb{E}_{}\left[(f(\xi_1) - \bar{\varphi})^2\right]} + 2\sum_{k=2}^\infty {\mathbb{E}_{}\left[(\varphi(\xi_1) - \bar{\varphi})(\varphi(\xi_k) - \bar{\varphi})\right]},\end{aligned}\] where \(\bar{\varphi} = \int \varphi(x) \mu(x) dx\).
A.5 Further Reading
We conclude this chapter by noting that innumerable tutorials on the subject of Markov chains have been written, particularly with reference to their use in the field of Monte Carlo simulation. Some which might be of interest include the following:
G. Roberts (1996) provides an elementary introduction to some Markov chain concepts required to understand their use in Monte Carlo algorithms.
In the same volume, Tierney (1996) provides a more technical look at the same concepts; a more in-depth, but similar approach is taken by the earlier paper Tierney (1994).
An alternative, elementary formulation of some of the material presented here together with some additional background material, aimed at an engineering audience, can be found in Johansen (2009).
Robert and Casella (2004, chap. 6). This is a reasonably theoretical treatment intended for those interest in Markov chain Monte Carlo; it is reasonably technical in content, without dwelling on proofs. Those familiar with measure theoretic probability might find this a reasonably convenient place to start.
Those of you interested in technical details might like to consult Meyn and Tweedie (1993). This is the definitive reference work on stability, convergence and theoretical analysis of Markov chains and it is now possible to download it, free of charge from the website of one of the authors.
A less detailed, but more general and equally rigorous, look at Markov chains is provided by the seminal work of Nummelin (1984). This covers some material outside of the field of probability, but remains a concise work and presents only a few of the simpler results. It is perhaps a less intimidating starting point than Meyn and Tweedie (1993), although opinions on this vary.
A survey of theoretical results relevant to Monte Carlo is provided by G. O. Roberts and Rosenthal (2004). Again, this is necessarily somewhat technical.
References
Formally, much of the time this will be an infinite-dimensional vector but this need not concern us here.↩︎