Chapter 14 MCMC for Continuous Distribution, Gaussian Process(Lecture on 02/18/2021)

We studied why Metropolis-Hastings chain converges to the full posterior when the parameter space is discrete. Now we study the case when the parameter space is continuous. When the Markov chain is defined on a continuous state space, we cannot use transition probability matrix. We are dealing with Markov chain of the form $\{X_n:n\geq 1\}$ but each $X_n\in\mathcal{S}$, where $\mathcal{S}$ is not discrete.

Definition 14.1 (Transition kernels) The transition kernels $T(x,x^{\prime})$ of making a transition from $X_n=x$ to $X_{n+1}=x^{\prime}$ is a function of $x$ and $x^{\prime}$ that has a. $T(x,\cdot)$ is a density function.

$T(\cdot,x^{\prime})$ is a measurable function.

You can think of $T(x,\cdot)$ as the conditional distribution of $X_{n+1}|X_n=x$

Let $g(x^{\prime}|x)$ be the proposal density for proposing $x^{\prime}$ from $x$. Let $a(x,x^{\prime})$ be the acceptance probability of accepting $x^{\prime}$ from $x$. Then the transition kernel for Metropolis-Hasting can be described as \[\begin{equation} T(x,x^{\prime})=\left\{\begin{aligned} & g(x^{\prime}|x)a(x,x^{\prime}) & x\neq x^{\prime}\\ &1-\int g(x^{\prime}|x)a(x,x^{\prime})dx^{\prime} & x=x^{\prime} \end{aligned} \right. \tag{14.1} \end{equation}\]

The first expression in (14.1) is the probability that you propose $x^{\prime}$, which is given by $g(x^{\prime}|x)$, then the proposal get accepted (the probability of that is $a(x,x^{\prime})$). For the second expression in (14.1), $\int g(x^{\prime}|x)a(x,x^{\prime})dx^{\prime}$ is the probability that proposing any $x^{\prime}$ and get accepted. Thus, $1-\int g(x^{\prime}|x)a(x,x^{\prime})dx^{\prime}$ is the probability that any proposed $x^{\prime}$ is not accepted, or just $x=x^{\prime}$.

Definition 14.2 (Stationary distribution in continuous case) If a transition kernel satisfies \[\begin{equation} \int T(x,x^{\prime})\pi(x)dx=\pi(x^{\prime}),\quad \forall x^{\prime} \tag{14.2} \end{equation}\] then $\pi(\cdot)$ is known as the stationary distribution for the transition kernel.

Notice the similarity with the definition of stationary distribution in discrete case (Definition 9.1), where we have the stationary distribution satisfies $\pi_j=\sum_i\pi_ip_{ij},\forall j$. Here we just replace the sum with integral.

Proposition 14.1 (Detailed balance condition for continuous case) Similar to the discrete case, if $\exists\pi(\cdot)$ such that \[\begin{equation} T(x,x^{\prime})\pi(x)=T(x^{\prime},x)\pi(x^{\prime}),\quad \forall x,x^{\prime} \tag{14.3} \end{equation}\] then $\pi(\cdot)$ is the stationary distribution, and we say that the detailed balance condition is satisfied.

Let $P(x)$ be the full posterior distribution. We have \[\begin{equation} a(x,x^{\prime})=\min(1,\frac{p(x^{\prime})g(x|x^{\prime})}{p(x)g(x^{\prime}|x)}) \tag{14.4} \end{equation}\]

If $x\neq x^{\prime}$, we then get \[\begin{equation} \begin{split} T(x,x^{\prime})p(x)&=g(x^{\prime}|x)a(x,x^{\prime})p(x)\\ =g(x^{\prime}|x)p(x)\min(1,\frac{p(x^{\prime})g(x|x^{\prime})}{p(x)g(x^{\prime}|x)}) \end{split} \tag{14.5} \end{equation}\]

We consider two cases. In the first case, assume $p(x^{\prime})g(x|x^{\prime})<p(x)g(x^{\prime}|x)$, then (14.5) becomes \[\begin{equation} T(x,x^{\prime})p(x)=g(x^{\prime}|x)p(x)\frac{p(x^{\prime})g(x|x^{\prime})}{p(x)g(x^{\prime}|x)}=p(x^{\prime})g(x|x^{\prime}) \tag{14.6} \end{equation}\]

In addition, \[\begin{equation} \begin{split} T(x^{\prime},x)p(x^{\prime})=g(x|x^{\prime})a(x^{\prime},x)p(x^{\prime})\\ &=g(x|x^{\prime})p(x^{\prime})\min(1,\frac{p(x)g(x^{\prime}|x)}{p(x^{\prime})g(x|x^{\prime})})\\ &=g(x|x^{\prime})p(x^{\prime}) \end{split} \tag{14.7} \end{equation}\]

Therefore, by (14.6) and (14.7), we get \[\begin{equation} T(x,x^{\prime})p(x)=T(x^{\prime},x)p(x^{\prime}),\quad \forall x\neq x^{\prime} \tag{14.8} \end{equation}\] and it is trivially satisfied for $x=x^{\prime}$.

For the second case, assume $p(x^{\prime})g(x|x^{\prime})\geq p(x)g(x^{\prime}|x)$, the proof follows in the similar way. Thus, \[\begin{equation} T(x,x^{\prime})p(x)=T(x^{\prime},x)p(x^{\prime}),\quad \forall x, x^{\prime} \tag{14.9} \end{equation}\]

Therefore, $p(x)$ is the stationary distribution. Assume the regularity conditions hold, $p(x)$ is also the limiting distribution. This is the reason why when doing MCMC using Metropolis-Hastings, ultimately you will sample from the full posterior distribution.

Now as for Gibbs sampler, let $p(\theta_1,\theta_2)$ be the full posterior distribution for $(\theta_1,\theta_2)$. Assume $p(\theta_1|\theta_2)$ and $p(\theta_2|\theta_1)$ be the conditional distributions of $\theta_1|\theta_2$ and $\theta_2|theta_1$. The transition kernel is given by \[\begin{equation} T((\theta_1,\theta_2),(\theta_1^{\prime},\theta_2^{\prime}))=p(\theta_1^{\prime}|\theta_2)p(\theta_2^{\prime}|\theta_1^{\prime}) \tag{14.10} \end{equation}\]

For Gibbs sampler, the detailed balance condition is not hold, so we need to use the definition directly.

We want to prove \[\begin{equation} \int\int T((\theta_1,\theta_2),(\theta_1^{\prime},\theta_2^{\prime}))p(\theta_1,\theta_2)d\theta_1d\theta_2=p(\theta_1^{\prime},\theta_2^{\prime}),\forall \theta_1,\theta_2,\theta_1^{\prime},\theta_2^{\prime} \tag{14.11} \end{equation}\]

The L.H.S. can be written as \[\begin{equation} \begin{split} L.H.S. &=\int\int p(\theta_1^{\prime}|\theta_2)p(\theta_2^{\prime}|\theta_1^{\prime})p(\theta_1,\theta_2)d\theta_1d\theta_2\\ &=p(\theta_2^{\prime}|\theta_1^{\prime})\int_{\theta_2} p(\theta_1^{\prime}|\theta_2)(\int_{\theta_1}p(\theta_1,\theta_2)d\theta_1)d\theta_2\\ &=p(\theta_2^{\prime}|\theta_1^{\prime})\int_{\theta_2} p(\theta_1^{\prime}|\theta_2)p(\theta_2)d\theta_2\\ &=p(\theta_2^{\prime}|\theta_1^{\prime})p(\theta_1^{\prime})\\ &=p(\theta_1^{\prime},\theta_2^{\prime})=R.H.S. \end{split} \tag{14.12} \end{equation}\]

This implies that $p(\theta_1,\theta_2)$ is the stationary distribution for the Gibbs sampling kernel. Under ergodicity, the stationary distribution is then the limiting distribution. Thus, the Gibbs sampler is also valid.

This validition is for Gibbs sampler with two parameters. Similar technique can be generated to Gibbs sampler with more parameters.

Gaussian Process

Definition 14.3 (Gaussian Process) A Gaussian process is a stochastic process $\{X_t:t\geq 0\}$ whose finite dimensional distribution are multivariate normal distribution.

Why Gaussian process is so important? Consider the example where the data $(y_i,x_i)$ for $i=1,\cdots,n$ looks like Figure 14.1.

$\label{fig:14001}An example when we may need to use Gaussian process to fit the data.$

FIGURE 14.1: An example when we may need to use Gaussian process to fit the data.

We may first consider fitting the data using linear regression, that is \[\begin{equation} y_i=\mu+x_i\beta+\epsilon_i \tag{14.13} \end{equation}\] which is shown as the blue line in Figure 14.1. It is not a good fit to the data. There is obviously an inherent nonlinear relationship between $x$ and $y$, which is not captured by the linear regression model. Therefore, we consider fit a nonlinear regression model \[\begin{equation} y_i=f(x_i)+\epsilon_i \tag{14.14} \end{equation}\] where $f$ is not a linear function.

There are many different ways to make $f$ nonlinear. For example, $f$ can be piecewise polynomials or more generally, $f$ can be constructed using the spline functions. The idea behind this kind of method is to fit locally linear or polynomial functions to the data, then add some constrains such that the function at the boundaries are smooth. However, there are some issues with this kind of technique. First of all, you need to know how many spline basis functions you have to use. Every spline function is represented by some knots. You have to decide how to put knots. There is not automatic approach to do this. This drawbacks lead to the usage of Gaussian process to estimate this nonlinear function $f$. The Gaussian process is much more automatic than the spline functions and it is also flexible.

$f(\cdot)$ is an unknown function and our job is to put a prior distribution on $f(\cdot)$. Here $f(\cdot)$ is an infinite dimensional quantity and stochastic process can act as a prior on an infinite dimensional function. There comes the idea of using a Gaussian process prior on $f(\cdot)$.

Formally, we say $f(\cdot)\sim GP(\mu,C_{\nu}(d,\theta))$. Here we inherently assume that the fitted Gaussian process is stationary, which means $Cov(f(x),f(x^{\prime}))$ is just a function of $x-x^{\prime}$. $f(\cdot)\sim GP(\mu,C_{\nu}(d,\theta))$ imples that $E(f(x_i))=\mu$ for all $x_i$ and $Cov(f(x_i),f(x_j))=C_{\nu}(d,\theta)$ where $d=|x_i-x_j|$, and $\theta,\nu$ are parameters. $C_{\nu}(\cdot,\cdot)$ is called the covariance kernel of a Gaussian process.

Quesiton: How to choose the covariance kernel?

Definition 14.4 (Matern covariance kernel) The Matern family of covariance kernel is defined as \[\begin{equation} C_{\nu}(d,\phi,\sigma^2)=\sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\phi d)^{\nu}K_{\nu}(\sqrt{2\nu}\phi d) \tag{14.15} \end{equation}\] where $K_{\nu}(\cdot)$ is the modefied Bessel function of the second kind.

For more about Matern family covariance functions, one can referred to Stein (1999).

Here are some special cases of Matern covariance function:

$\nu=\frac{1}{2}$, $C_{\nu}(d,\phi,\sigma^2)=\sigma^2\exp(-d\phi)$, which is also known as the exponential covariance kernel;
$\nu=\frac{3}{2}$, $C_{\nu}(d,\phi,\sigma^2)=\sigma^2(1+\sqrt{3}d\phi)\exp(-\sqrt{3}d\phi)$;
$\nu=\frac{5}{2}$, $C_{\nu}(d,\phi,\sigma^2)=\sigma^2(1+\sqrt{5}d\phi+\frac{5}{3}d^2\phi^2)\exp(-\sqrt{5}d\phi)$;
$\nu\to\infty$, $C_{\nu}(d,\phi,\sigma^2)=\sigma^2\exp(-\frac{d^2\phi}{2})$, which is also known as the Gaussian covariance kernel.

If you draw a function from a Gaussian process with the covariance kernel specified by Matern, the function you draw will be $\lfloor\nu\rfloor$ times differentiable. By choosing $\nu$ appropriately, we can control the smoothness of the functions we draw from the Gaussian process prior. For example,

If $\nu=\frac{1}{2}$, we will draw functions which are nowhere differentiable;
If $\nu=\frac{3}{2}$, we will draw functions which are once differentiable;
If $\nu\to\infty$, we will draw functions which are infinitely differentiable (smooth functions);

This result implies if we use a GP prior with Matern covariance kernel and $\mu=\frac{1}{2}$, then the posterior will provide probability 1 on functions which are nowhere differentiable.

Sometimes people also use powered exponential covariance kernels.

Definition 14.5 (Powered exponential covariance kernels) The powered exponential covariance kernels is defined as \[\begin{equation} C(d,\phi,\sigma^2,\alpha)=\sigma^2\exp(-\phi d^{\alpha}) \tag{14.16} \end{equation}\] where $\alpha$ is called the power in the powered exponential kernels. If $\alpha=1$, then the covariance function is called the exponential kernel while if $\alpha=2$, it is called the Gaussian kernel.

Suppose we want to fit $y=f(x)+\epsilon$, and we have data $(y_i,x_i)$ for $i=1,\cdots,n$. How to fit the model? How to find $p(f|y_1,x_1,\cdots,y_n,x_n)$? We will discuss these questions later.

References

Stein, Michael. 1999. Interpolation of Spatial Data: Some Theory of Kriging. New York, NY: Springer.