Chapter 16 Low-rank Method to Fasten the Inference of Gaussian Process(Lecture on 02/25/2021)

We are trying to fit \(y=f(x)+\epsilon\) for \(\epsilon\sim N(0,\tau^2)\) and we put a Gaussian process prior on \(f\). We have showed how to draw post burn-in samples \((\mu(1),\sigma^2(1),\tau^2(1),\phi(1)),\cdots,(\mu(L),\sigma^2(L),\tau^2(L),\phi(L))\). Then using these post burn-in samples, we can get \(\boldsymbol{\theta}(1),\cdots,\boldsymbol{\theta}(L)\).

With post burn-in samples \((\theta^{(1)},\mu^{(1)},(\tau^2)^{(1)},(\sigma^2)^{(1)}),\cdots,(\theta^{(L)},\mu^{(L)},(\tau^2)^{(L)},(\sigma^2)^{(L)})\), we can simulate \(f(x)\) at new locations \(\tilde{x}_1,\cdots,\tilde{x}_m\) by noticing that \[\begin{equation} \begin{pmatrix} f(x_1) \\ \vdots \\ f(x_n) \\ f(\tilde{x}_1)\\ \vdots \\ f(\tilde{x}_m)\end{pmatrix}|\mu,\tau^2,\sigma^2,\phi\sim N(\mu\mathbf{1},\begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}) \tag{16.1} \end{equation}\]

Denote \(\boldsymbol{\theta}=(f(x_1),\cdots,f(x_n))\) and \(\boldsymbol{\tilde{\theta}}=(f(\tilde{x}_1),\cdots,f(\tilde{x}_m))\), we have \[\begin{equation} \boldsymbol{\tilde{\theta}}|\boldsymbol{\theta},\mu,\tau^2,\sigma^2,\phi\sim N(\mu\mathbf{1}+\Sigma_{21}\Sigma_{11}^{-1}(\boldsymbol{\theta}-\mu\mathbf{1}),\Sigma_{11}-\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}) \tag{16.2} \end{equation}\] Therefore, for posterior samples of \(\tilde{\boldsymbol{\theta}}^{l}\), it can be simulated from the (16.2) using posterior samples \((\theta^{(l)},\mu^{(l)},(\tau^2)^{(l)},(\sigma^2)^{(l)},\phi^{(l)})\). Finally, to sample corresponding \(y\), since \(y=f(\tilde{x}_k)+\epsilon_k\), we can use posterior samples \(f(\tilde{x}_k)^{(l)}\) and sample \(y^{(l)}\) form \(N(f(\tilde{x}_k)^{(l)},(\tau^2)^{(l)})\).

In the Bayesian inference of Gaussian process, since \(\mathbf{y}\sim N(\mu\mathbf{1},\sigma^2H(\phi)+\tau^2\mathbf{I})\), when running MCMC, at each iteration, we have to evaluate this likelihood: \[\begin{equation} \frac{1}{|det(\sigma^2H(\phi)+\tau^2\mathbf{I})|^{1/2}}\exp\{(\mathbf{y}-\mu\mathbf{1})^T(\sigma^2H(\phi)+\tau^2\mathbf{I})^{-1}(\mathbf{y}-\mu\mathbf{1})\} \tag{16.3} \end{equation}\] Therefore, at each iteration, we need to evaluate inverse and determinant of \(\sigma^2H(\phi)+\tau^2\mathbf{I}\), which needs to be done using by computing the Cholesky decomposition of \(\sigma^2H(\phi)+\tau^2\mathbf{I}\). Cholesky decomposition requires computation complexity of \(O(n^3)\), when \(n\) is large, even for \(n=20000\), fitting GP may take months. This is known as the big-n problem in Gaussian process.

There are various versions of GP proposed which are computationally less expensive. We will discuss two such methods. The two methods are:

  1. Low-rank method;

  2. Sparse method.

Low-rank method

Rather than using a GP specification directly on \(f(\mathbf{x})\) in \(y=f(\mathbf{x})+\epsilon\), low-rank methods express \(f(\mathbf{x})\) as a basis representation. For example, we may write \[\begin{equation} f(\mathbf{x})=\mu+\sum_{j=1}^SK(x,\mathbf{x}_j^*,\xi)\lambda_j \tag{16.4} \end{equation}\] where \(K(\cdot,\cdot,\xi)\) is a pre-specified basis function, and \(\mathbf{x}_1^*,\cdots,\mathbf{x}_s^*\) are called the knot points. We also have \(S<<n\).

There are several possible choices for the basis functions, such as:

  1. Spline basis

  2. Bezier kernel

The different choice of basis functions will give you different types of models. The choice of basis functions and the choice of knots will totally determine the low-rank method.

There is a trade-off between the computation efficiency and the inferencial accuracy when choosing \(S\). If we keep \(S\) larger, then the compuational efficiency will be much harmed, but the accuracy is more. If we keep \(S\) smaller, the compuational efficiency is imporved, but the statistical accuracy is less.

Suppose we fix the basis functions and the knots, it will lead to \[\begin{equation} y=f(\mathbf{x})+\epsilon=\mu+\sum_{j=1}^SK(x,\mathbf{x}_j^*,\xi)\lambda_j+\epsilon \tag{16.5} \end{equation}\] Note that \(\lambda_1,\cdots,\lambda_j,\xi\) and \(\mu\) are parameters, we need to estimate them.

Since we have \[\begin{equation} \begin{split} &y_1=\mu+\sum_{j=1}^SK(x_1,\mathbf{x}_j^*,\xi)\lambda_j+\epsilon_1\\ &\cdots\cdots\cdots\\ &y_n=\mu+\sum_{j=1}^SK(x_n,\mathbf{x}_j^*,\xi)\lambda_j+\epsilon_n\\ \end{split} \tag{16.6} \end{equation}\] and \(\epsilon_i\stackrel{i.i.d.}{\sim} N(0,\tau^2)\).

Written in the matrix form, we have \[\begin{equation} \mathbf{y}=\mu\mathbf{1}+K\boldsymbol{\lambda}+\boldsymbol{\epsilon} \tag{16.7} \end{equation}\] with \(\boldsymbol{\epsilon}\sim N(\mathbf{0},\tau^2\mathbf{I})\), \(\boldsymbol{\lambda}=(\lambda_1,\cdots,\lambda_S)^T\) and the \(n\times S\) matrix \(K\) is given by \[\begin{equation} K=\begin{pmatrix} K(x_1,\mathbf{x}_1^*,\xi) & \cdots & K(x_1,\mathbf{x}_S^*,\xi)\\ \vdots & &\vdots\\ K(x_n,\mathbf{x}_1^*,\xi) & \cdots & K(x_n,\mathbf{x}_S^*,\xi)\end{pmatrix} \tag{16.8} \end{equation}\]

We want to estimate \(\mu,\boldsymbol{\lambda},\tau^2,\xi|y_1,\cdots,y_n\). We assigns prior for the parameters as \[\begin{equation} \begin{split} &\mu\sim N(m_{\mu},\sigma^2_{\mu}),\quad \boldsymbol{\lambda}\sim N(\mathbf{0},\Sigma_1)\\ &\tau^2\sim IG(a_{\tau},b_{\tau}),\quad \xi\sim p(\xi) \end{split} \tag{16.9} \end{equation}\] Since \(\mathbf{y}\sim N(\mu\mathbf{1}+K\boldsymbol{\lambda},\tau^2\mathbf{I})\) and \(\boldsymbol{\lambda}\sim N(\mathbf{0},\Sigma_1)\), we can integrate out \(\boldsymbol{\lambda}\), which gives us \[\begin{equation} \mathbf{y}\sim N(\mu\mathbf{1},K\Sigma_1K^T+\tau^2\mathbf{I}) \tag{16.10} \end{equation}\]

Like before you will have to evaluate the likelihood \(N(\mathbf{y}|\mu\mathbf{1},K\Sigma_1K^T+\tau^2\mathbf{I})\) at each MCMC iteration for drawing samples. The log-likelihood is \[\begin{equation} l=-\frac{(\mathbf{y}-\mu\mathbf{1})^T(K\Sigma_1K^T+\tau^2\mathbf{I})^{-1}(\mathbf{y}-\mu\mathbf{1})}{2}-\frac{1}{2}\log(det(K\Sigma_1K^T+\tau^2\mathbf{I})) \tag{16.11} \end{equation}\]

Then use the woodbury matrix identity, \[\begin{equation} (A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \tag{16.12} \end{equation}\] where \(A,U,C,V\) are \(n\times n\), \(n\times s\), \(s\times s\), \(s\times n\) matrix, respectively. The crucial thing to notice is that \(C^{-1}+VA^{-1}U\) is a \(s\times s\) matrix. Therefore, take \(K=U\), \(C=\Sigma_1\) and \(A=\tau^2\mathbf{I}\), \(A^{-1}=\frac{1}{\tau^2}\mathbf{I}\). The main computational difficulty comes from evaluation \((C^{-1}+VA^{-1}U)^{-1}\), which is a \(S\times S\) matrix. The computational complexity of inverting this matrix is \(O(S^3)\). Since \(S<<n\), therefore it achieves computational efficiency.

We will also need to evaluate the determinant of \(K\Sigma_1K^T+\tau^2\mathbf{I}\), which can be simplified using the following result.

Lemma 16.1 (Matrix-determinant lemma) For \(n\times n\), \(n\times s\), \(s\times s\) and \(s\times n\) matrix \(A,U,C\) and \(V\), we have \[\begin{equation} det(A+UCV)=det(C^{-1}+VA^{-1}U)det(A)det(C) \tag{16.13} \end{equation}\]

Therefore, for our case, compute the determinant of a \(n\times n\) matrix \(K\Sigma_1K^T+\tau^2\mathbf{I}\) can be done by compute the determinant of \(S\times S\) matrices \(\Sigma_1^{-1}+K^T\frac{1}{\tau^2}K\) and \(\Sigma_1\). \(A=\tau^2\mathbf{I}\) so \(det(A)\) and \(A^{-1}\) are both easy to obtain.

The inverse and determinant of a matrix can be get together by doing Cholesky decomposition of the matrix. Suppose the Cholesky decomposition of matrix \(A\) is \(A=LL^{T}\), then \(A^{-1}=(L^{-1})(L^{-1})^{T}\) and \(det(A)=(det(L))^2\). Since \(L\) is a lower triangular matrix, the determinant pf \(L\) is the product of its diagonals.

For low-rank methods, the choice of kernel functions \(K(\cdot,\cdot,\xi)\) and the choice of knot points \(\mathbf{x}_1^*,\cdots,\mathbf{x}_S^*\) seems to affect the results.

People are interested in whether the choice of the kernel functions can be motivated from the Gaussian process itself. That leads to the idea of predictive process model. In a predictive process model, you can only choose the set of knots, then the set of kernel functions is automatically set.

Suppose the original nonlinear regression model is \(y=f(\mathbf{x})+\epsilon\) and suppose \(f(\mathbf{x})\sim GP(\mu,\sigma^2\exp(-\phi d))\). Let us define a set of knots \(\mathbf{x}_1^*,\mathbf{x}_S^*\), and define \(\mathbf{f}^*=(f(\mathbf{x}_1^*),\cdots,f(\mathbf{x}_S^*))^T\). We have \[\begin{equation} Cov(f(\mathbf{x}_i),\mathbf{f}^*)=\begin{pmatrix} \sigma^2\exp(-\phi|\mathbf{x}_i-\mathbf{x}_1^*|)\\ \vdots\\ \sigma^2\exp(-\phi|\mathbf{x}_i-\mathbf{x}_S^*|) \end{pmatrix} \tag{16.14} \end{equation}\] and \(Var(\mathbf{f}^*)\) is an \(S\times S\) matrix whose \((k,l)\)th entry is given by \(\sigma^2\exp(-\phi|\mathbf{x}_k^*-\mathbf{x}_l^*|)=Cov(f(\mathbf{x}_k^*),f(\mathbf{x}_l^*))\).

Predictive process model

Let \(y_i=\mu+Cov(f(\mathbf{x}_i),\mathbf{f}^*)Var(\mathbf{f}^*)^{-1}\mathbf{f}^*+\epsilon_i\). Denote \(Cov(f(\mathbf{x}_i),\mathbf{f}^*)=c(\mathbf{x}_i)\) and \(C^*=Var(\mathbf{f}^*)\), we then have \[\begin{equation} \begin{split} &y_1=\mu+c(\mathbf{x}_1)(C^*)^{-1}\mathbf{f}^*+\epsilon_1\\ &\cdots\cdots\cdots\\ &y_n=\mu+c(\mathbf{x}_n)(C^*)^{-1}\mathbf{f}^*+\epsilon_n\\ \end{split} \tag{16.15} \end{equation}\] Written in matrix form, we have \[\begin{equation} \mathbf{y}=\mu\mathbf{1}+\begin{pmatrix} c(\mathbf{x}_1) \\ \vdots \\ c(\mathbf{x}_n) \end{pmatrix}(C^*)^{-1}\mathbf{f}^*+\boldsymbol{\epsilon} \tag{16.16} \end{equation}\] where \(\begin{pmatrix} c(\mathbf{x}_1) \\ \vdots \\ c(\mathbf{x}_n) \end{pmatrix}\) is \(n\times S\) matrix, \((C^*)^{-1}\) is \(S\times S\) matrix and \(\mathbf{f}^*\) is \(S\times 1\) matrix. Think about the low-rank method, and set \[\begin{equation} K=\begin{pmatrix} c(\mathbf{x}_1) \\ \vdots \\ c(\mathbf{x}_n) \end{pmatrix}(C^*)^{-1} \tag{16.17} \end{equation}\] and we have \[\begin{equation} \mathbf{y}=\mu\mathbf{1}+K\mathbf{f}^*+\boldsymbol{\epsilon} \tag{16.18} \end{equation}\] It looks like a low-rank method, but the choice of the basis functions is implicitly determined by the GP specification on \(f\).

To do posterior inference on predictive process, since \(\mathbf{y}\sim N(\mu\mathbf{1}+K\mathbf{f}^*,\tau^2\mathbf{I})\) and \(\mathbf{f}^*\sim N(0,C^*)\), marginalized out \(\mathbf{f}^*\) we have \(\mathbf{y}\sim N(\mu\mathbf{1},\tau^2\mathbf{I}+KC^*K^T)\). You have to run Gibbs with-in Metropolis-Hastings (Gibbs for \(\mu\) and M-H for other parameters) to draw samples of \(\mu,\tau^2,\phi,\sigma^2\).

If the full model is \(y_i=\mu+f(\mathbf{x}_i)+\epsilon_i\) where \(f\sim GP(0,C_{\nu})\), the predictive processes is \(y_i=\mu+Cov(f(\mathbf{x}_i),\mathbf{f}^*)Var(\mathbf{f}^*)^{-1}\mathbf{f}^*+\epsilon_i\). Notice that \(Cov(f(\mathbf{x}_i),\mathbf{f}^*)Var(\mathbf{f}^*)^{-1}\mathbf{f}^*=E(f(\mathbf{x}_i)|\mathbf{f}^*)\), it implies that the predictive process model can also be written as \[\begin{equation} y_i=\mu+E(f(\mathbf{x}_i)|\mathbf{f}^*)+\epsilon_i \tag{16.19} \end{equation}\]

The variability of an observation from the full model is \(Var(f(\mathbf{x}_i))+Var(\epsilon_i)\). The variability of an observation from the predictive process model is \(Var(E(f(\mathbf{x}_i)|\mathbf{f}^*))+Var(\tilde{\epsilon}_i)\). Since the variability of an observation should be the same for different models, we have \[\begin{equation} Var(f(\mathbf{x}_i))+Var(\epsilon_i)=Var(E(f(\mathbf{x}_i)|\mathbf{f}^*))+Var(\tilde{\epsilon}_i) \tag{16.20} \end{equation}\] which implies \[\begin{equation} Var(E(f(\mathbf{x}_i)|\mathbf{f}^*))+E(Var(f(\mathbf{x}_i)|\mathbf{f}^*))+Var(\epsilon_i)=Var(E(f(\mathbf{x}_i)|\mathbf{f}^*))+Var(\tilde{\epsilon}_i) \tag{16.20} \end{equation}\] and we can finally get \[\begin{equation} Var(\tilde{\epsilon}_i)\geq Var(\epsilon_i) \tag{16.21} \end{equation}\] (16.21) tells us that if we simulate data from a full Gaussian process model, and we try to fit predict process model on that data, then we will always observe an overestimation of the error variance. It appears that as \(S\) increases, predictive process becomes a better approximation of the full GP model. For the extreme case when \(S=n\) and data points are taken as the knot points, then \(E(f(\mathbf{x}_i)|\mathbf{f}^*)=f(\mathbf{x}_i)\).