# A Numerical Linear Algebra for Fast GPs

This appendix is in two parts. §A.1 illustrates the value of linking against fast linear algebra libraries, yielding \(10 \times\) speedups and sometimes better, without any coding changes. §A.2 provides pointers to recent developments in MVN likelihood evaluation and prediction via stochastic approximations to log determinants and solves of linear systems.

## A.1 Intel MKL and OSX Accelerate

Throughout this text, Rmarkdown builds leveraged an ordinary R installation -- one downloaded from CRAN and run without modification. R ships with a rather vanilla, but highly portable, numerical linear algebra library (i.e., implementation of BLAS and LAPACK).^{13} This section is an exception. Here an R linked against Intel's Math Kernel Library (MKL) is used,^{14} following instructions here for Linux/Unix. Another good resource tied specifically to `.deb`

based systems such as Ubuntu can be found here. A set of instructions from Berkeley's Statistics Department, similar to Intel's for MKL, explains how to link against Apple OSX's Accelerate Framework instead. Microsoft R Open provides binary installs for Linux and Windows linked against MKL, and for OSX linked against the Accelerate framework. These are nice options for out-of-the-box speedups, as will be demonstrated momentarily. One reason MATLAB® often outperforms R on linear algebra-intensive benchmarks is that MATLAB ships with Intel MKL linear algebra. R does not. Microsoft R Open fills that gap, although it's not hard to do-it-yourself.

#### Illustration on borehole

Recall the borehole function introduced in §9.1.3 and revisited throughout Chapter 9. The borehole is a classic synthetic computer simulation example (Morris, Mitchell, and Ylvisaker 1993). It's well-fit by a GP, but obtaining accurate predictions requires a relatively large sampling. First pasting ...

```
borehole <- function(x)
{
rw <- x[1]*(0.15 - 0.05) + 0.05
r <- x[2]*(50000 - 100) + 100
Tu <- x[3]*(115600 - 63070) + 63070
Hu <- x[4]*(1110 - 990) + 990
Tl <- x[5]*(116 - 63.1) + 63.1
Hl <- x[6]*(820 - 700) + 700
L <- x[7]*(1680 - 1120) + 1120
Kw <- x[8]*(12045 - 9855) + 9855
m1 <- 2*pi*Tu*(Hu - Hl)
m2 <- log(r/rw)
m3 <- 1 + 2*L*Tu / (m2*rw^2*Kw) + Tu/Tl
return(m1/m2/m3)
}
```

... then code below generates random training and testing sets in a manner identical to §9.1.3 with two exceptions. Both sets are much bigger: ten thousand elements each from a combined Latin hypercube sample (LHS, §4.1); and these are observed with noise so that nuggets must be estimated alongside lengthscales.

```
library(lhs)
Npred <- N <- 10000
x <- randomLHS(N + Npred, 8)
y <- apply(x, 1, borehole)
y <- y + rnorm(length(y), sd=1)
X <- x[1:N,]
Y <- y[1:N]
XX <- x[-(1:N),]
YY <- y[-(1:N)]
```

Consider the full, non-approximate GP capability offered in `laGP`

(Gramacy and Sun 2018). Code below generates appropriate priors, insuring stable search for MAP estimates of hyperparameters \(\theta_1, \dots, \theta_8\) and \(g\).

```
library(laGP)
ga <- garg(list(mle=TRUE), Y)
da <- darg(list(mle=TRUE, max=100), X)
```

Such a setup, and code below solving for \((\hat{\theta}, \hat{g})\) through the concentrated log likelihood (5.8) and furnishing predictions (5.2), is identical to previous uses in this text. The `laGP`

library is identical too. The only difference is that R's BLAS and LAPACK are linked against Intel MKL, a testament to modularity in this presentation.

```
tic <- proc.time()[3]
## GP initialization and MAP calculation
gpsepi <- newGPsep(X, Y, da$start, g=ga$start, dK=TRUE)
that <- mleGPsep(gpsepi, param="both", tmin=c(da$min, ga$min),
tmax=c(da$max, ga$max), ab=c(da$ab, ga$ab), maxit=1000)
## predict out of sample
p <- predGPsep(gpsepi, XX, lite=TRUE)
deleteGPsep(gpsepi)
## timing end
toc <- proc.time()[3]
```

Here are the timing results. For the record, my workstation has an eight-core hyperthreaded Intel i7-6900K CPU running at 3.20GHz with 128GB of RAM. It was purchased in 2016 -- a mid-range desktop workstation from that era, optimized to be as quiet as possible, not for speed.

`toc - tic`

```
## elapsed
## 667
```

So it takes about 0.2 hours which, while not instantaneous, is still well within the realm of tolerable. Running the same example on a two-core laptop, such as my 2015 hyperthreaded Intel i7-5600U running at 2.60GHz with 8GB of RAM, takes about 25% longer. A problem of this size is all but out of reach without optimized linear algebra. I tried, letting it run overnight in ordinary R on the same workstation, and still it had not yet finished. Very conservatively, Intel MKL gives a \(10\times\) speedup on this example.

#### How does it work?

Intel MKL and OSX Accelerate offer a three-pronged approach to faster basic linear algebra (matrix--vector multiplications, etc.), matrix decompositions (determinants, LU, Cholesky), and solves of linear systems. Prong one leverages highly specialized machine instructions supported by modern processor hardware. Intel MKL has a huge advantage here. Experts designing processor pipelines, instructions, and compilers/options work closely together for the specific purpose of optimizing linear algebra and other benchmarks in order to show-up competitors and market to industrial clients.

Prong two involves tuning myriad alternatives to problem sizes and representative spans of use cases. Over the years, beginning in the 1960s, many numerical linear algebra codes have been developed, extended and refined. Some codes and combinations thereof are better than others depending on how big the problem is, and on other features describing common situations. For example, algorithms requiring time in \(\mathcal{O}(n^3)\) are sometimes faster than \(\mathcal{O}(n^{5/2})\) methods for small \(n\). Determining which \(n\) goes to which algorithm depends upon architecture details and can only be determined by extensive experimentation. This tuning enterprise is itself a computer experiment and Bayesian optimization (BO) of sorts. For more details and an open source scripting of such experimentation and search, see the ATLAS project.

Prong three is symmetric multiprocessor (SMP) parallelization, which is intimately linked to the first two prongs. Divide-and-conquer multi-threaded calculation features in many modern linear algebra libraries, and the optimal way to divvy up tasks for parallel evaluation depends upon problem size \(n\), numbers of processor cores, cache size, high-speed memory (RAM) capacity and performance, etc.

Swapping in fast linear algebra benefits more than just GP regression. Speedups can be expected in any linear algebra-intensive implementation of statistical methodology, or otherwise. Not every application will see substantial gains. For example, building the entirety of Chapter 9 takes 57 minutes with MKL and 98 minutes without. That may not seem impressive, but keep in mind that a large portion of the R code in that chapter doesn't involve linear algebra (e.g., plotting), uses non-BLAS/LAPACK solvers (e.g., sparse matrix CSK examples using `spam`

), or is already heavily parallelized (`tgp`

and `laGP`

examples). There are even examples where Intel MKL is slower than R's default BLAS and LAPACK. Those are few and far between in my experience. Typical speedups for linear algebra-intensive tasks are between \(2\times\) and \(10\times\), and the bigger the problem the bigger the time savings. I run both an MKL-linked and ordinary R on my machine. Development is more straightforward on the latter, because that gives me a good sense of what most users experience. Big problems get run with MKL.

Most university/lab-level supercomputing services offer an R linked against MKL. Virginia Tech's Advanced Research Computing (ARC) service goes one step further by compiling all of R with Intel's C and Fortran compilers, in addition to linking with MKL. Differences compared to ordinary (GCC) compiling are slight, as you can see on that page. The big winner is MKL. It's not hard to get similar capabilities on your own laptop or workstation. VT ARC doesn't offer an R linked against the default BLAS/LAPACK as provided by CRAN.

#### A caution on OpenBLAS

The Berkeley page explains how to link to OpenBLAS on Linux, and VT ARC also provides an OpenBLAS option. But I strongly caution against OpenBLAS when Intel MKL and OSX Accelerate alternatives are available. OpenBLAS is fast, but has thread safety issues which means it doesn't play well with some of the methods in this text, particularly those from Chapter 9 providing additional, bespoke divide-and-conquer parallelization. The `tgp`

package (Gramacy and Taddy 2016) uses pthreads to parallelize prediction across leaves; see Appendix C.2 of Gramacy (2007). The `laGP`

package uses OpenMP to parallelize over elements of the testing set. These are incompatible with OpenBLAS. Related messages on discussion boards can be found by googling "OpenBLAS is not thread safe" and adding "with pthreads" or "with OpenMP".

Thread safety means that two independent calculations can run in parallel without interfering with one another. Unfortunately, two GP predictive or MLE subroutines (e.g., via `aGP`

in §9.3) could not occur in parallel, without error, when the underlying linear algebra is off-loaded to OpenBLAS. When thread safety cannot be guaranteed, calculations which should transpire independently under the two processor threads carrying out their instructions will instead corrupt one another, without any indication that something is wrong. Results output are garbage.

There are ways to cripple OpenBLAS, making it slower but ensuring thread safety by limiting to a single thread. That defeats the purpose, especially on large problems where fast modern divide-and-conquer methods shine most. Intel MKL and OSX Accelerate are both thread safe and their compatibility spans more than 99% of architectures in modern use. There's almost no reason to entertain OpenBLAS, in my opinion, as long as thread safety remains an issue. That said, I've had little traction getting OpenBLAS removed as a default option at VT ARC. OpenBLAS is common in spite of thread safety issues, so beware.

## A.2 Stochastic approximation

Many exciting inroads are being made in the realm of stochastic approximation of the sorts of matrix solves and decompositions required for GP inference and prediction. Specifically, the important calculations are \(K_n^{-1} Y_n\) and \(\log |K_n|\). Both are prohibitive for larger \(n\). These passages are not intended to review the breadth of related methodology, but instead to showcase one modern approach as a representative. In contrast to Appendix A.1, which emphasizes a modular and generic fast linear algebra capability, pairing existing high-level GP code with fast low-level computations provided by libraries, the methods here are more tailored to GPs. They involve re-implementing calculations core to GP likelihood (and derivative) evaluation, and to predictive equations.

Gardner et al. (2018) proposed combining linear conjugate gradients for matrix solves (\(K_n^{-1} Y_n\)) with stochastic Lanczos quadrature (Ubaru, Chen, and Saad 2017) to approximate log determinant evaluations (\(\log |K_n|\)). They carefully describe many engineering details that make this work, including preconditioning, pivoted Cholesky decomposition, and considerations specific to GP inference and prediction. The methods are approximate and stochastic, which may seem like downsides, but they make the method highly parallelizable, enabling trivial SMP distribution. Another advantage is that they don't require storage of large \(K_n\), but rather only access kernel evaluations \(k(\cdot, \cdot)\). This dramatically reduces communication overheads in offloading data and calculations to customized hardware, such as to graphical processing units (GPUs). Other attempts to leverage GPU linear algebra in GP inference have, by contrast, yielded lukewarm results (Franey, Ranjan, and Chipman 2012).

Gardner et al. describe a Python implementation, GPyTorch based on the PyTorch toolkit for distributed computing. They report tractable inference on large-scale GP regression problems, including up to \(n = 10^6\) (Wang et al. 2019). I'm not aware of any R implementations at this time, but would anticipate similar developments coming online soon. Such techniques will be key to applying GP regression at scale, especially when assumptions of smoothness and stationarity are key to effective application, i.e., meaning that many of the Chapter 9 divide-and-conquer alternatives are less than ideal.

### References

Morris, MD, TJ Mitchell, and D Ylvisaker. 1993. “Bayesian Design and Analysis of Computer Experiments: Use of Derivatives in Surface Prediction.” *Technometrics* 35 (3). Taylor & Francis: 243–55.

Gramacy, RB, and F Sun. 2018. *laGP: Local Approximate Gaussian Process Regression*. http://bobby.gramacy.com/r_packages/laGP.

Gramacy, RB, and MA Taddy. 2016. *tgp: Bayesian Treed Gaussian Process Models*. https://CRAN.R-project.org/package=tgp.

Gramacy, RB. 2007. “tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models.” *Journal of Statistical Software* 19 (9): 6.

Gardner, J, G Pleiss, KQ Weinberger, D Bindel, and AG Wilson. 2018. “GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration.” In *Advances in Neural Information Processing Systems*, 7576–86.

Ubaru, S, J Chen, and Y Saad. 2017. “Fast Estimation of \(\mathrm{tr}(f(A))\) via Stochastic Lanczos Quadrature.” *SIAM Journal on Matrix Analysis and Applications* 38 (4). SIAM: 1075–99.

Franey, M, P Ranjan, and H Chipman. 2012. “A Short Note on Gaussian Process Modeling for Large Datasets Using Graphics Processing Units.” *Preprint on ArXiv:1203.1269*.

Wang, KA, G Pleiss, JR Gardner, S Tyree, KQ Weinberger, and AG Wilson. 2019. “Exact Gaussian Processes on a Million Data Points.” *Preprint on ArXiv:1903.08114*.

Even when compiling R from source, or pulling binaries from standard repositories, R links against a vanilla reference BLAS and LAPCK by default.↩

Actually, results from an MKL linked R were performed off-line and read in after-the-fact because RStudio doesn't make it easy for two separate Rs to build a single document.↩