Chapter 15 Bayesian Inference for Gaussian Process(Lecture on 02/23/2021)

When the data \((y,\mathbf{x})\) has a highly nonlinear relationship, the way to perform a regression model is to assume \(y=f(\mathbf{x})+\epsilon\), where \(f\) is an unknown function. We use GP as the prior for \(f\). Assume \(f\sim GP(\mu,C_{\nu}(\cdot,\cdot,\boldsymbol{\theta}))\), it means that \[\begin{equation} \begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix} \sim N(\mu\mathbf{1},H) \tag{15.1} \end{equation}\] where \(H=(H_{ij})_{i,j=1}^n\) and \(H_{ij}=C_{\nu}(\mathbf{x}_i,\mathbf{x}_j;\boldsymbol{\theta})\). A commonly used covariance kernel is the Matern covariance kernel, defined as \[\begin{equation} C_{\nu}(d,\phi,\sigma^2)=\sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\phi d)^{\nu}K_{\nu}(\sqrt{2\nu}\phi d) \tag{15.2} \end{equation}\] where \(d=||\mathbf{x}_i-\mathbf{x}_j||\) is the Euclidean distance between \(\mathbf{x}_i\) and \(\mathbf{x}_j\).

Now, the regression model can be written as \[\begin{equation} \begin{split} &y=f(\mathbf{x})+\epsilon\\ &\epsilon\stackrel{i.i.d.}{\sim}N(0,\tau^2),\quad f\sim GP(\mu,C_{\nu}) \end{split} \tag{15.3} \end{equation}\]

As a bayesian statistician, we want to find the posterior distribution of \(f\), that is \(f|y_1,\cdots,y_n\).

Let us have the data \((y_1,\mathbf{x}_1),\cdots,(y_n,\mathbf{x}_n)\), then \[\begin{equation} \begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma) \tag{15.4} \end{equation}\] where \(\Sigma=\sigma^2\mathbf{H}(\phi)\). For simplicity, assume we work with the exponential covariance function \[\begin{equation} C_{\nu}(\mathbf{x}_i,\mathbf{x}_j,\phi,\sigma^2)=\sigma^2\exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||) \tag{15.5} \end{equation}\] then \(\mathbf{H}(\phi)\) is an \(n\times n\) matrix with the \((i,j)\)th entry of \(H(\phi)\) is given by \(exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||)\).

Let \(\mu\sim P(\mu)=N(\mu|a_{\mu},b_{\mu})\), \(\sigma^2\sim P_{\sigma}(\sigma^2)=IG(\sigma^2|a_{\sigma},b_{\sigma})\), \(\tau^2\sim P_{\tau^2}(\tau^2)=IG(\tau^2|a_{\tau},b_{\tau})\) and \(\phi\sim P_{\phi}(\phi)=Unif(a_{\phi},b_{\phi})\) be the prior distribution for the model parameters. Utlimately it turns out that the model can be written hierarchically as: \[\begin{equation} \begin{split} &\mathbf{y}\sim N(\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T,\tau^2I)\\ &\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma)\\ &\mu\sim N(\mu|a_{\mu},b_{\mu}),\quad \sigma^2\sim IG(\sigma^2|a_{\sigma},b_{\sigma})\\ &\tau^2\sim IG(\tau^2|a_{\tau},b_{\tau}),\quad \phi\sim Unif(a_{\phi},b_{\phi}) \end{split} \tag{15.6} \end{equation}\]

Just for notational simplicity, denote \(\boldsymbol{\theta}=\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T\), our job is to estimate the posterior distribution of \(\mu,\tau^2,\sigma^2,\phi\) and \(\boldsymbol{\theta}\).

There is some issues with directly build a sampler for \(\mu,\tau^2,\sigma^2,\phi\) and \(\boldsymbol{\theta}\) together. It is hard for the chain to converge. Therefore, in practice, we usually marginalize out \(\boldsymbol{\theta}\) at first.
\(\sigma^2\) and \(\phi\) can not be jointly estimated consistently. One solution is to estimate \(\sigma^2\) without any constrains, while put a uniform prior on a small range for \(\phi\). Typically the range is set as \([\frac{3}{\max(d)},\frac{3}{\min(d)}]\), where \(d=\{|x_i-x_j|:|x_i-x_j|\neq 0\}\).

Since \(\mathbf{y}\sim N(\boldsymbol{\theta},\tau^2\mathbf{I})\) and \(\boldsymbol{\theta}\sim N(\mu\mathbf{I},\Sigma)\), marginally \(\mathbf{y}\sim N(\mu\mathbf{1},\Sigma+\tau^2\mathbf{I})\). Mixing will be improved when \(\boldsymbol{\theta}\) is integrated out. The full posterior for \(\mu,\tau^2,\sigma^2,\phi\) is therefore \[\begin{equation} \begin{split} p(\mu,\tau^2,\sigma^2,\phi|\mathbf{y})&\propto N(\mu\mathbf{1},\tau^2I+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})\\ &\times Unif(a_{\phi},b_{\phi})N(\mu|a_{\mu},b_{\mu}) \end{split} \tag{15.7} \end{equation}\]

Now we consider the full conditionals for each parameter. If it has closed form, then it can be sampled using Gibbs sampler, otherwise it will be sampled using Metropolis-Hasting sampler.

Firstly, \[\begin{equation} \begin{split} p(\mu|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2I+\sigma^2H(\phi)) \times N(\mu|a_{\mu},b_{\mu})\\ &\propto \exp\{-\frac{(\mathbf{y}-\mu\mathbf{1})^T(\tau^2\mathbf{I}+\sigma^2H(\phi))^{-1}(\mathbf{y}-\mu\mathbf{1})}{2}\}\\ &\times \exp\{-\frac{(\mu-a_{\mu})^2}{2b_{\mu}}\}\\ &\propto \exp\{-\frac{1}{2}[\mu^2(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}})\\ &-2\mu(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}})]\} \end{split} \tag{15.8} \end{equation}\] Therefore, by complete the squares, we have \(\mu|\text{rest},\mathbf{y}\sim N(a_{\mu|\cdot},b_{\mu|\cdot})\), where \[\begin{equation} \begin{split} &b_{\mu|\cdot}=\frac{1}{\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}}}\\ &a_{\mu|\cdot}=b_{\mu|\cdot}[\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}}] \end{split} \tag{15.9} \end{equation}\] Therefore, we can sample \(\mu\) from its full conditional distribution.

In addition, \[\begin{equation} p(\phi,\tau^2,\sigma^2|\mathbf{y},\mu)\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2\mathbf{I}+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})Unif(a_{\phi},b_{\phi}) \tag{15.10} \end{equation}\] \(\phi,\tau^2,\sigma^2\) do not have any closed form full conditional. They need to be updated using Metropolis-Hastings.

Should we update \(\phi,\tau^2,\sigma^2\) all together, or should we update them one at a time? The answer is case by case. Updating \(\sigma^2,\tau^2\) together and updating \(\phi\) separately using M-H tends to give better mixing.

Once posterior samples of \(\mu,\phi,\tau^2,\sigma^2\) are obtained, the next step is to get the posterior samples of \(\boldsymbol{\theta}\). Note that \[\begin{equation} \begin{split} p(\boldsymbol{\theta}|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\boldsymbol{\theta},\tau^2\mathbf{I})N(\boldsymbol{\theta}|\mu\mathbf{1},\sigma^2H(\phi))\\ &\propto \exp\{-\frac{(\mathbf{y}-\boldsymbol{\theta})^T(\mathbf{y}-\boldsymbol{\theta})}{2\tau^2}\}\\ &\times \exp\{-\frac{(\boldsymbol{\theta}-\mu\mathbf{1})^T(\sigma^2H(\phi))^{-1}(\boldsymbol{\theta}-\mu\mathbf{1})}{2}\}\\ &\propto \exp\{-\frac{1}{2}[\boldsymbol{\theta}^T(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})\boldsymbol{\theta}-2\boldsymbol{\theta}^T(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2})]\} \end{split} \tag{15.11} \end{equation}\] After completing the squares, we get \(\boldsymbol{\theta}|\text{rest},\mathbf{y}\sim N(\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2))\) where \[\begin{equation} \begin{split} &\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})^{-1}\\ &\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=\Sigma_{\boldsymbol{\theta}|\cdot}(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2}) \end{split} \tag{15.12} \end{equation}\] For the \(l\)th post burn-in samples of parameters, denoted as \((\phi_l,\mu_l,\sigma_l^2,\tau_l^2)\), we can obtain a sample of \(\boldsymbol{\theta}\) by sample from \(N(\mu_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2))\)。