Chapter 17 Sparse Model to Fasten the Inference of Gaussian Process, Hidden Markov Model(Lecture on 03/02/2021)

When fitting a Gaussian model, we have $\mathbf{y}\sim N(\mu\mathbf{1},\tau^2\mathbf{I}+\sigma^2H(\phi))$ where $H(\phi)_{ij}=\exp(-\phi||x_i-x_j||)$ . The problem for computation is that $H(\phi)$ is a dense matrix, inverting it needs the whole cholesky decomposition.

What happens if $H(\phi)$ us sparse? There are efficient sparse matrix solvers which can invert a sparse matrix $J_{n\times n}$ with only $S$ nonzero entries with $O(s)$ complexity.

We cannot simply set some elements of $H(\phi)$ to 0, since the sparse matrix constructed in that way may not be positive definite.

Definition 17.1 (Compactly supported correlation function) $C_{\nu}(x_i,x_j)$ is a compactly supported correlation function if it is a nonnegative definite function and $C_{\nu}(x_i,x_j)=0$ when $||x_i-x_j||>\nu$ .

For example, $\begin{equation} C_{\nu}(x_i,x_j)=(1-\frac{d}{\nu})_{+}^4(1+\frac{d}{\nu}) \tag{17.1} \end{equation}$ is a compactly supported correlation function, here $d=||x_i-x_j||$ and $(1-\frac{d}{\nu})_{+}=\left\{\begin{aligned} & 1-\frac{d}{\nu} & d<\nu\\ &0 & o.w. \end{aligned}\right.$

Some other choices of compactly supported correlation functions can be observed in the work by Reinhard Furrer and Tilman Gneiting. For another example, $\begin{equation} C_{\nu}(x_i,x_j)=(1-\frac{d}{\nu})_{+}^2(1+2\frac{d}{\nu}) \tag{17.2} \end{equation}$

These two examples are the most commonly used compactly supported correlation function in practice.

Notice that in exponential correlation function, the correlation goes to $\infty$ as the distance $d\to\infty$ , but it is never exactly 0. For compactly supported correlation function, the correlation is exactly 0 when $d>\nu$ .

Rather than fitting $y=f(x)+\epsilon$ where $f\sim GP(\mu,\sigma^2\exp(-\phi||x_i-x_j||))$ , alternatively, we will fit the sparse model, given by $f=\tilde{f}(x)+\epsilon$ , where $\tilde{f}\sim GP(\mu,\tilde{C}_{\nu}(\cdot,\cdot,\phi,\sigma^2))$ and $Cov(\tilde{f}(x_i),\tilde{f}(x_j))=\sigma^2\exp(-\phi||x_i-x_j||)C_{\nu}(x_i,x_j)$ where $C_{\nu}(\cdot,\cdot)$ is a compactly supported correlation function.

Under the sparse model, we have $\begin{equation} \begin{pmatrix} \tilde{f}(x_1)\\ \vdots \\ \tilde{f}(x_n) \end{pmatrix}\sim N(\mu\mathbf{1},R(\phi,\sigma^2)) \tag{17.3} \end{equation}$ where $R(\phi,\sigma^2)_{ij}=\sigma^2\exp(-\phi||x_i-x_j||)C_{\nu}(x_i,x_j)$ . Let $M_{ij}=C_{\nu}(x_i,x_j)$ and $H(\phi)_{ij}=\sigma^2\exp(-\phi||x_i-x_j||)$ , then $R(\phi,\sigma^2)=H(\phi)\circ M$ . Here $\circ$ denotes the Hadamard product (elementwise product) between two matrices.

If $A$ and $B$ are both symmetric and postive definite, then $A\circ B$ is also symmetric and positive definite.

Therefore, $R(\phi,\sigma^2)$ is symmetric and positive definite as both $M$ and $H(\phi)$ are symmetric and positive definite.

Therefore, we have $\begin{equation} \begin{pmatrix} y_1\\ \vdots \\ y_n \end{pmatrix}=\begin{pmatrix} \tilde{f}(x_1)\\ \vdots \\ \tilde{f}(x_n) \end{pmatrix}+\begin{pmatrix} \epsilon_1\\ \vdots \\ \epsilon_n \end{pmatrix} \tag{17.4} \end{equation}$ or $\mathbf{y}\sim N(\mu\mathbf{1},\tau^2\mathbf{I}+R(\phi,\sigma^2))$ . Evaluating this likelihood will involve computing $|\tau^2\mathbf{I}+R(\phi,\sigma^2)|$ and $(\tau^2\mathbf{I}+R(\phi,\sigma^2))^{-1}$ . We can choose $\nu$ suitably to make $M$ sparse. This is because $M_{ij}=C_{\nu}(x_i,x_j)$ . It equals to 0 if $||x_i-x_j||>\nu$ . Thus, if we choose $\nu$ very small, $M$ will be heavily sparse and sparsity can be controled by choosing $\nu$ suitably. Since $M$ is sparse, $R(\phi,\sigma^2)=H(\phi)\circ M$ is also sparse. If a matrix is sparse, you can employ an sparse solver to efficiently inverse $R(\phi,\sigma^2)$ . We also need $|R(\phi,\sigma^2)|$ . There is no efficient algorithm to find determinant of a sparse matrix. Therefore, it is not fully useful if $n$ is beyond 10000. The computation of sparse model is infeasible.

Let $y=\mu+f_L(x)+(f(x)-f_L(x))h(x)+\epsilon$ , where $f_L(x)$ is a low rank GP (for example, predictive process). $f(x)$ is following the full GP and $h(x)$ is a function following a GP with a compactly supported correlation kernel. This model is much less sensitive to the choice of $\nu$ .

The idea behind this modeling approach is that: since $y=f_L(x)+[f(x)-f_L(x)]+\epsilon$ , we can first use the low-rank method to deal with the mean. Then for the residuals, to release the computation burdern, we use sparse method.

Hidden Markov Model (HMM)

HMM provides useful representation of dependent heterogeneous phenomena. They are very widely applied in economics, biology, genetics and finance. HMM has been successfully applied in finance since financial prices usually show nonlinear dynamics which are often due to the existence of two or more engines within which returns or volatilities disply different behavior.

Let $\mathbf{y}=(y_t)_{t=1}^T$ be the vector of observed variable indexd by time. HMM assumes distribution of $y_t$ is dependent on a hidden variable $\mathbf{S}=(S_t)_{t=1}^T$ , which characterize the “state” in which the underlying generating process is at any time. We assume $P(s_t=j|S_{t-1}=i)=\lambda_{ij}$ , for $i,j=1,\cdots,k$ . Let $\boldsymbol{\pi}$ satisfies $\boldsymbol{\pi}\Lambda=\boldsymbol{\pi}$ where $\Lambda$ is the matrix with the $(i,j)$ th entry $\lambda_{ij}$ .

As a generating process, we assume if $S_t=i$ , $y_t\sim N(\mu_i,\sigma_i^2), i=1,\cdots,k$ and we also assume that $P(S_t=i)=\pi_i$ , i.e., the chain is already in the steady state. Therefore, $\begin{equation} y_t|S_t,\boldsymbol{\mu},\boldsymbol{\sigma}^2\sim N(\mu_{S_t},\sigma^2_{S_t}) \tag{17.5} \end{equation}$ where $\boldsymbol{\mu}=(\mu_1,\cdots,\mu_k)^T,\boldsymbol{\sigma}^2=(\sigma_1^2,\cdots,\sigma^2_k)$ .

Integrating over $S_t$ , we have $\begin{equation} y_t|\boldsymbol{\mu},\boldsymbol{\sigma}^2\sim \sum_{i=1}^{k}\pi N(\mu_i,\sigma^2_i) \tag{17.6} \end{equation}$ because $P(S_t=i)=\pi$ . Therefore, HMM lead to mixture models. Here $k$ is finite and fixed before the analysis.

Prior Choice:

Our parameters are $\boldsymbol{\mu}=(\mu_1,\cdots,\mu_k)$ and $\boldsymbol{\sigma}^2=(\sigma_1^2,\cdots,\sigma_k^2)$ and $\Lambda_{k\times k}$ matrix. We use $\begin{equation} \begin{split} &\mu_i|\sigma_i^2\sim N(\xi,\kappa\sigma_i^2)\\ &\sigma_i^2\sim IG(\eta,\zeta)\\ &(\lambda_{i1},\cdots,\lambda_{ik})\sim Dir(\delta_1,\cdots,\delta_k),\quad i=1,\cdots,k \end{split} \tag{17.6} \end{equation}$

Using these prior, we have the posterior as $\begin{equation} p\propto p(\Lambda|\boldsymbol{\delta})p(\mathbf{S}|\Lambda)p(\boldsymbol{\mu}|\xi,\boldsymbol{\sigma}^2,\kappa)p(\boldsymbol{\sigma}^2|\eta,\zeta)p(\boldsymbol{y}|\boldsymbol{S},\boldsymbol{\mu},\boldsymbol{\sigma}^2) \tag{17.7} \end{equation}$ where $p(\mathbf{S}|\Lambda)=p(S_1|\Lambda)\prod_{t=2}^Tp(S_t|S_{t-1},\Lambda)$ , $p(S_1=i|\Lambda)=\pi$ , $P(S_t=j|S_{t-1}=i,\Lambda)=\lambda_{ij}$ and $p(\boldsymbol{y}|\boldsymbol{S},\boldsymbol{\mu},\boldsymbol{\sigma}^2)=\prod_{t=1}^TN(y_t|\mu_{S_t},\sigma_{S_t}^2)$ .

Firstly, consider update $\Lambda$ , let us show how to update $(\lambda_{i1},\cdots,\lambda_{ik})$ , we have $\begin{equation} \begin{split} p((\lambda_{i1},\cdots,\lambda_{ik})|\cdots)&\propto p(\Lambda|\boldsymbol{\delta})p(\mathbf{S}|\Lambda)\\ &\propto Dir((\lambda_{i1},\cdots,\lambda_{ik})|(\delta_1,\delta_k))(\mathbf{S}|\Lambda)\\ &\propto Dir((\lambda_{i1},\cdots,\lambda_{ik})|(\delta_1,\delta_k))\prod_{j=1}^k\lambda_{ij}^{n_{ij}} \end{split} \tag{17.8} \end{equation}$ where $n_{ij}=\sum_{t=1}^{T-1}I(S_t=i,S_{t+1}=j)$ . Thus, $\begin{equation} (\lambda_{i1},\cdots,\lambda_{ik})|\cdots\sim Dir(\delta_1+n_{i1},\cdots,\delta_k+n_{ik}) \tag{17.9} \end{equation}$

Next, consider update $\mathbf{S}$ , we have $\begin{equation} \begin{split} &P(S_t=i|\cdots)\propto N(y_t|\mu_i,\sigma_i^2)\lambda_{S_{t-1},i}\lambda_{i,S_{t+1}}\quad t=2,\cdots,T-1\\ &P(S_1=i|\cdots)\propto \pi_iN(y_t|\mu_i,\sigma_i^2)\lambda_{i,S_2}\\ &P(S_T=i|\cdots)\propto N(y_t|\mu_i,\sigma_i^2)\lambda_{S_{T-1},i} \end{split} \tag{17.10} \end{equation}$ Posterior full conditional of each $S_t$ is a discrete distribution on $\{1,\cdots,k\}$ with probabilities proportional to the expressions given in (17.10).

Next, to update $\boldsymbol{\sigma}^2$ , we have $\begin{equation} \begin{split} p(\sigma_i^2|\cdots)&\propto IG(\sigma_i^2|\eta,\zeta)N(\mu_i|\xi,\kappa\sigma_i^2)\prod_{S_t=i}N(y_t|\mu_i,\sigma_i^2)\\ &\propto \exp\{-\sum_{S_t=i}\frac{(y_t-\mu_i)^2}{2\sigma^2_i}-\frac{(\mu_i-\xi)^2}{2\kappa\sigma_i^2}-\frac{\zeta}{\sigma_i^2}\}\times\frac{1}{(\sigma_i^2)^{\eta+1+\frac{n_i+1}{2}}} \end{split} \tag{17.11} \end{equation}$ Thus, we have the full conditionals of $\sigma_i^2$ as $\begin{equation} \sigma_i^2|\cdots\sim IG(\eta+\frac{n_i+1}{2},\zeta+\frac{1}{2}\sum_{S_t=i}(y_t-\mu_i)^2+\frac{(\mu_i-\zeta)^2}{2\kappa}) \tag{17.12} \end{equation}$

Then for update $\boldsymbol{\mu}$ , note that the full model is invariant to the permutation of the labels. Thus, one restricts $\mu_1<\cdots,\mu_k$ . Without restriction, $\begin{equation} \mu_i|\cdots\sim N(\frac{\kappa\sum_{S_t=i}y_t+\xi}{1+\kappa n_i},\frac{\sigma_i^2\kappa}{1+\kappa n_i}) \tag{17.13} \end{equation}$ in the case when $k=2,3$ , one can use rejection sampling.

With out this restriction, one may have label switching problem. However, one can try with the unrestrict version first, if there is no label switching issue and the chain converges fine, then we do not need the restriction.