Chapter 15 Bayesian Inference for Gaussian Process(Lecture on 02/23/2021)

When the data $$(y,\mathbf{x})$$ has a highly nonlinear relationship, the way to perform a regression model is to assume $$y=f(\mathbf{x})+\epsilon$$, where $$f$$ is an unknown function. We use GP as the prior for $$f$$. Assume $$f\sim GP(\mu,C_{\nu}(\cdot,\cdot,\boldsymbol{\theta}))$$, it means that $$$\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix} \sim N(\mu\mathbf{1},H) \tag{15.1}$$$ where $$H=(H_{ij})_{i,j=1}^n$$ and $$H_{ij}=C_{\nu}(\mathbf{x}_i,\mathbf{x}_j;\boldsymbol{\theta})$$. A commonly used covariance kernel is the Matern covariance kernel, defined as $$$C_{\nu}(d,\phi,\sigma^2)=\sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\phi d)^{\nu}K_{\nu}(\sqrt{2\nu}\phi d) \tag{15.2}$$$ where $$d=||\mathbf{x}_i-\mathbf{x}_j||$$ is the Euclidean distance between $$\mathbf{x}_i$$ and $$\mathbf{x}_j$$.

Now, the regression model can be written as $$$\begin{split} &y=f(\mathbf{x})+\epsilon\\ &\epsilon\stackrel{i.i.d.}{\sim}N(0,\tau^2),\quad f\sim GP(\mu,C_{\nu}) \end{split} \tag{15.3}$$$

As a bayesian statistician, we want to find the posterior distribution of $$f$$, that is $$f|y_1,\cdots,y_n$$.

Let us have the data $$(y_1,\mathbf{x}_1),\cdots,(y_n,\mathbf{x}_n)$$, then $$$\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma) \tag{15.4}$$$ where $$\Sigma=\sigma^2\mathbf{H}(\phi)$$. For simplicity, assume we work with the exponential covariance function $$$C_{\nu}(\mathbf{x}_i,\mathbf{x}_j,\phi,\sigma^2)=\sigma^2\exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||) \tag{15.5}$$$ then $$\mathbf{H}(\phi)$$ is an $$n\times n$$ matrix with the $$(i,j)$$th entry of $$H(\phi)$$ is given by $$exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||)$$.

Let $$\mu\sim P(\mu)=N(\mu|a_{\mu},b_{\mu})$$, $$\sigma^2\sim P_{\sigma}(\sigma^2)=IG(\sigma^2|a_{\sigma},b_{\sigma})$$, $$\tau^2\sim P_{\tau^2}(\tau^2)=IG(\tau^2|a_{\tau},b_{\tau})$$ and $$\phi\sim P_{\phi}(\phi)=Unif(a_{\phi},b_{\phi})$$ be the prior distribution for the model parameters. Utlimately it turns out that the model can be written hierarchically as: $$$\begin{split} &\mathbf{y}\sim N(\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T,\tau^2I)\\ &\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma)\\ &\mu\sim N(\mu|a_{\mu},b_{\mu}),\quad \sigma^2\sim IG(\sigma^2|a_{\sigma},b_{\sigma})\\ &\tau^2\sim IG(\tau^2|a_{\tau},b_{\tau}),\quad \phi\sim Unif(a_{\phi},b_{\phi}) \end{split} \tag{15.6}$$$

Just for notational simplicity, denote $$\boldsymbol{\theta}=\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T$$, our job is to estimate the posterior distribution of $$\mu,\tau^2,\sigma^2,\phi$$ and $$\boldsymbol{\theta}$$.

1. There is some issues with directly build a sampler for $$\mu,\tau^2,\sigma^2,\phi$$ and $$\boldsymbol{\theta}$$ together. It is hard for the chain to converge. Therefore, in practice, we usually marginalize out $$\boldsymbol{\theta}$$ at first.

2. $$\sigma^2$$ and $$\phi$$ can not be jointly estimated consistently. One solution is to estimate $$\sigma^2$$ without any constrains, while put a uniform prior on a small range for $$\phi$$. Typically the range is set as $$[\frac{3}{\max(d)},\frac{3}{\min(d)}]$$, where $$d=\{|x_i-x_j|:|x_i-x_j|\neq 0\}$$.

Since $$\mathbf{y}\sim N(\boldsymbol{\theta},\tau^2\mathbf{I})$$ and $$\boldsymbol{\theta}\sim N(\mu\mathbf{I},\Sigma)$$, marginally $$\mathbf{y}\sim N(\mu\mathbf{1},\Sigma+\tau^2\mathbf{I})$$. Mixing will be improved when $$\boldsymbol{\theta}$$ is integrated out. The full posterior for $$\mu,\tau^2,\sigma^2,\phi$$ is therefore $$$\begin{split} p(\mu,\tau^2,\sigma^2,\phi|\mathbf{y})&\propto N(\mu\mathbf{1},\tau^2I+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})\\ &\times Unif(a_{\phi},b_{\phi})N(\mu|a_{\mu},b_{\mu}) \end{split} \tag{15.7}$$$

Now we consider the full conditionals for each parameter. If it has closed form, then it can be sampled using Gibbs sampler, otherwise it will be sampled using Metropolis-Hasting sampler.

Firstly, $$$\begin{split} p(\mu|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2I+\sigma^2H(\phi)) \times N(\mu|a_{\mu},b_{\mu})\\ &\propto \exp\{-\frac{(\mathbf{y}-\mu\mathbf{1})^T(\tau^2\mathbf{I}+\sigma^2H(\phi))^{-1}(\mathbf{y}-\mu\mathbf{1})}{2}\}\\ &\times \exp\{-\frac{(\mu-a_{\mu})^2}{2b_{\mu}}\}\\ &\propto \exp\{-\frac{1}{2}[\mu^2(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}})\\ &-2\mu(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}})]\} \end{split} \tag{15.8}$$$ Therefore, by complete the squares, we have $$\mu|\text{rest},\mathbf{y}\sim N(a_{\mu|\cdot},b_{\mu|\cdot})$$, where $$$\begin{split} &b_{\mu|\cdot}=\frac{1}{\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}}}\\ &a_{\mu|\cdot}=b_{\mu|\cdot}[\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}}] \end{split} \tag{15.9}$$$ Therefore, we can sample $$\mu$$ from its full conditional distribution.

In addition, $$$p(\phi,\tau^2,\sigma^2|\mathbf{y},\mu)\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2\mathbf{I}+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})Unif(a_{\phi},b_{\phi}) \tag{15.10}$$$ $$\phi,\tau^2,\sigma^2$$ do not have any closed form full conditional. They need to be updated using Metropolis-Hastings.

Should we update $$\phi,\tau^2,\sigma^2$$ all together, or should we update them one at a time? The answer is case by case. Updating $$\sigma^2,\tau^2$$ together and updating $$\phi$$ separately using M-H tends to give better mixing.

Once posterior samples of $$\mu,\phi,\tau^2,\sigma^2$$ are obtained, the next step is to get the posterior samples of $$\boldsymbol{\theta}$$. Note that $$$\begin{split} p(\boldsymbol{\theta}|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\boldsymbol{\theta},\tau^2\mathbf{I})N(\boldsymbol{\theta}|\mu\mathbf{1},\sigma^2H(\phi))\\ &\propto \exp\{-\frac{(\mathbf{y}-\boldsymbol{\theta})^T(\mathbf{y}-\boldsymbol{\theta})}{2\tau^2}\}\\ &\times \exp\{-\frac{(\boldsymbol{\theta}-\mu\mathbf{1})^T(\sigma^2H(\phi))^{-1}(\boldsymbol{\theta}-\mu\mathbf{1})}{2}\}\\ &\propto \exp\{-\frac{1}{2}[\boldsymbol{\theta}^T(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})\boldsymbol{\theta}-2\boldsymbol{\theta}^T(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2})]\} \end{split} \tag{15.11}$$$ After completing the squares, we get $$\boldsymbol{\theta}|\text{rest},\mathbf{y}\sim N(\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2))$$ where $$$\begin{split} &\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})^{-1}\\ &\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=\Sigma_{\boldsymbol{\theta}|\cdot}(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2}) \end{split} \tag{15.12}$$$ For the $$l$$th post burn-in samples of parameters, denoted as $$(\phi_l,\mu_l,\sigma_l^2,\tau_l^2)$$, we can obtain a sample of $$\boldsymbol{\theta}$$ by sample from $$N(\mu_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2))$$