Chapter 15 Bayesian Inference for Gaussian Process(Lecture on 02/23/2021)

When the data (y,x) has a highly nonlinear relationship, the way to perform a regression model is to assume y=f(x)+ϵ, where f is an unknown function. We use GP as the prior for f. Assume f\sim GP(\mu,C_{\nu}(\cdot,\cdot,\boldsymbol{\theta})), it means that \begin{equation} \begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix} \sim N(\mu\mathbf{1},H) \tag{15.1} \end{equation} where H=(H_{ij})_{i,j=1}^n and H_{ij}=C_{\nu}(\mathbf{x}_i,\mathbf{x}_j;\boldsymbol{\theta}). A commonly used covariance kernel is the Matern covariance kernel, defined as \begin{equation} C_{\nu}(d,\phi,\sigma^2)=\sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\phi d)^{\nu}K_{\nu}(\sqrt{2\nu}\phi d) \tag{15.2} \end{equation} where d=||\mathbf{x}_i-\mathbf{x}_j|| is the Euclidean distance between \mathbf{x}_i and \mathbf{x}_j.

Now, the regression model can be written as \begin{equation} \begin{split} &y=f(\mathbf{x})+\epsilon\\ &\epsilon\stackrel{i.i.d.}{\sim}N(0,\tau^2),\quad f\sim GP(\mu,C_{\nu}) \end{split} \tag{15.3} \end{equation}

As a bayesian statistician, we want to find the posterior distribution of f, that is f|y_1,\cdots,y_n.

Let us have the data (y_1,\mathbf{x}_1),\cdots,(y_n,\mathbf{x}_n), then \begin{equation} \begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma) \tag{15.4} \end{equation} where \Sigma=\sigma^2\mathbf{H}(\phi). For simplicity, assume we work with the exponential covariance function \begin{equation} C_{\nu}(\mathbf{x}_i,\mathbf{x}_j,\phi,\sigma^2)=\sigma^2\exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||) \tag{15.5} \end{equation} then \mathbf{H}(\phi) is an n\times n matrix with the (i,j)th entry of H(\phi) is given by exp(-\phi||\mathbf{x}_i-\mathbf{x}_j||).

Let \mu\sim P(\mu)=N(\mu|a_{\mu},b_{\mu}), \sigma^2\sim P_{\sigma}(\sigma^2)=IG(\sigma^2|a_{\sigma},b_{\sigma}), \tau^2\sim P_{\tau^2}(\tau^2)=IG(\tau^2|a_{\tau},b_{\tau}) and \phi\sim P_{\phi}(\phi)=Unif(a_{\phi},b_{\phi}) be the prior distribution for the model parameters. Utlimately it turns out that the model can be written hierarchically as: \begin{equation} \begin{split} &\mathbf{y}\sim N(\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T,\tau^2I)\\ &\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T \sim MVN(\mu\mathbf{1},\Sigma)\\ &\mu\sim N(\mu|a_{\mu},b_{\mu}),\quad \sigma^2\sim IG(\sigma^2|a_{\sigma},b_{\sigma})\\ &\tau^2\sim IG(\tau^2|a_{\tau},b_{\tau}),\quad \phi\sim Unif(a_{\phi},b_{\phi}) \end{split} \tag{15.6} \end{equation}

Just for notational simplicity, denote \boldsymbol{\theta}=\begin{pmatrix} f(\mathbf{x}_1) & \cdots & f(\mathbf{x}_n)\end{pmatrix}^T, our job is to estimate the posterior distribution of \mu,\tau^2,\sigma^2,\phi and \boldsymbol{\theta}.

  1. There is some issues with directly build a sampler for \mu,\tau^2,\sigma^2,\phi and \boldsymbol{\theta} together. It is hard for the chain to converge. Therefore, in practice, we usually marginalize out \boldsymbol{\theta} at first.

  2. \sigma^2 and \phi can not be jointly estimated consistently. One solution is to estimate \sigma^2 without any constrains, while put a uniform prior on a small range for \phi. Typically the range is set as [\frac{3}{\max(d)},\frac{3}{\min(d)}], where d=\{|x_i-x_j|:|x_i-x_j|\neq 0\}.

Since \mathbf{y}\sim N(\boldsymbol{\theta},\tau^2\mathbf{I}) and \boldsymbol{\theta}\sim N(\mu\mathbf{I},\Sigma), marginally \mathbf{y}\sim N(\mu\mathbf{1},\Sigma+\tau^2\mathbf{I}). Mixing will be improved when \boldsymbol{\theta} is integrated out. The full posterior for \mu,\tau^2,\sigma^2,\phi is therefore \begin{equation} \begin{split} p(\mu,\tau^2,\sigma^2,\phi|\mathbf{y})&\propto N(\mu\mathbf{1},\tau^2I+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})\\ &\times Unif(a_{\phi},b_{\phi})N(\mu|a_{\mu},b_{\mu}) \end{split} \tag{15.7} \end{equation}

Now we consider the full conditionals for each parameter. If it has closed form, then it can be sampled using Gibbs sampler, otherwise it will be sampled using Metropolis-Hasting sampler.

Firstly, \begin{equation} \begin{split} p(\mu|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2I+\sigma^2H(\phi)) \times N(\mu|a_{\mu},b_{\mu})\\ &\propto \exp\{-\frac{(\mathbf{y}-\mu\mathbf{1})^T(\tau^2\mathbf{I}+\sigma^2H(\phi))^{-1}(\mathbf{y}-\mu\mathbf{1})}{2}\}\\ &\times \exp\{-\frac{(\mu-a_{\mu})^2}{2b_{\mu}}\}\\ &\propto \exp\{-\frac{1}{2}[\mu^2(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}})\\ &-2\mu(\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}})]\} \end{split} \tag{15.8} \end{equation} Therefore, by complete the squares, we have \mu|\text{rest},\mathbf{y}\sim N(a_{\mu|\cdot},b_{\mu|\cdot}), where \begin{equation} \begin{split} &b_{\mu|\cdot}=\frac{1}{\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{1}+\frac{1}{b_{\mu}}}\\ &a_{\mu|\cdot}=b_{\mu|\cdot}[\mathbf{1}^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}\mathbf{y}+\frac{a_{\mu}}{b_{\mu}}] \end{split} \tag{15.9} \end{equation} Therefore, we can sample \mu from its full conditional distribution.

In addition, \begin{equation} p(\phi,\tau^2,\sigma^2|\mathbf{y},\mu)\propto N(\mathbf{y}|\mu\mathbf{1},\tau^2\mathbf{I}+\sigma^2H(\phi))IG(\sigma^2|a_{\sigma},b_{\sigma})IG(\sigma^2|a_{\tau},b_{\tau})Unif(a_{\phi},b_{\phi}) \tag{15.10} \end{equation} \phi,\tau^2,\sigma^2 do not have any closed form full conditional. They need to be updated using Metropolis-Hastings.

Should we update \phi,\tau^2,\sigma^2 all together, or should we update them one at a time? The answer is case by case. Updating \sigma^2,\tau^2 together and updating \phi separately using M-H tends to give better mixing.

Once posterior samples of \mu,\phi,\tau^2,\sigma^2 are obtained, the next step is to get the posterior samples of \boldsymbol{\theta}. Note that \begin{equation} \begin{split} p(\boldsymbol{\theta}|\text{rest},\mathbf{y})&\propto N(\mathbf{y}|\boldsymbol{\theta},\tau^2\mathbf{I})N(\boldsymbol{\theta}|\mu\mathbf{1},\sigma^2H(\phi))\\ &\propto \exp\{-\frac{(\mathbf{y}-\boldsymbol{\theta})^T(\mathbf{y}-\boldsymbol{\theta})}{2\tau^2}\}\\ &\times \exp\{-\frac{(\boldsymbol{\theta}-\mu\mathbf{1})^T(\sigma^2H(\phi))^{-1}(\boldsymbol{\theta}-\mu\mathbf{1})}{2}\}\\ &\propto \exp\{-\frac{1}{2}[\boldsymbol{\theta}^T(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})\boldsymbol{\theta}-2\boldsymbol{\theta}^T(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2})]\} \end{split} \tag{15.11} \end{equation} After completing the squares, we get \boldsymbol{\theta}|\text{rest},\mathbf{y}\sim N(\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)) where \begin{equation} \begin{split} &\Sigma_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=(\frac{\mathbf{I}}{\tau^2}+\frac{H(\phi)^{-1}}{\sigma^2})^{-1}\\ &\mu_{\boldsymbol{\theta}|\cdot}(\phi,\mu,\sigma^2,\tau^2)=\Sigma_{\boldsymbol{\theta}|\cdot}(\frac{\mathbf{y}}{\tau^2}+\frac{H(\phi)^{-1}\mathbf{1}\mu}{\sigma^2}) \end{split} \tag{15.12} \end{equation} For the lth post burn-in samples of parameters, denoted as (\phi_l,\mu_l,\sigma_l^2,\tau_l^2), we can obtain a sample of \boldsymbol{\theta} by sample from N(\mu_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2),\Sigma_{\boldsymbol{\theta}|\cdot}(\phi_l,\mu_l,\sigma_l^2,\tau_l^2))