Chapter 10 Heteroskedasticity

Historically, design and analysis of computer experiments focused on deterministic solvers from the physical sciences via Gaussian process (GP) interpolation (Sacks et al. 1989). But nowadays computer modeling is common in the social (Cioffi–Revilla 2014, Chapter 8), management (Law 2015) and biological (Johnson 2008) sciences, where stochastic simulations abound. Queueing systems and agent based models replace finite elements and simple Monte Carlo with geometric convergence (Picheny and Ginsbourger 2013). Data in geostatistics (Banerjee, Carlin, and Gelfand 2004) and machine learning (Rasmussen and Williams 2006) are not only noisy but frequently involve signal-to-noise ratios that are low or changing over the input space. Noisier simulations/observations demand bigger experiments/studies to isolate signal from noise, and more sophisticated GP models – not just adding nuggets to smooth the noise, but variance processes to track changes in noise throughout the input space in the face of that heteroskedasticity.

Modeling methodology for large simulation efforts with intrinsic stochastic dynamics has lagged until recently. Partitioning is one option (Chapter 9.2.2), but leaves something to be desired when underlying processes are smooth and signal-to-noise ratios are low. A theme in this chapter is that replication, a tried and true design strategy for separating signal from noise, can play crucial role in computationally efficient heteroskedastic modeling, especially with GPs. In operations research, stochastic kriging (SK; Ankenman, Nelson, and Staum 2010) offers approximate methods that exploit large degrees of replication. Its independent method of moments-based inferential framework can yield an efficient heteroskedastic modeling capability. However the setup exploited by SK is highly specialized; moment-based estimation can strain coherency in a likelihood dominated surrogate landscape. Here focus is on a heteroskedastic GP technology (Binois, Gramacy, and Ludkovski 2018) that offers a modern blend of SK and ideas from machine learning (largely predating SK: Goldberg, Williams, and Bishop 1998; Quadrianto et al. 2009; Boukouvalas and Cornford 2009; Lazaro–Gredilla and Titsias 2011).

Besides leveraging replication in design, the idea is: what’s good for the mean is good for the variance. If it’s sensible to model the a latent (mean) field via GPs (Chapter 5.3.2, why not extend that to variance too? Unfortunately latent variance fields are not as easy to integrate out. (Variances must also be positive, whereas means are less constrained.) Numerical schemes are required for latent learning, and there are a myriad of strategies – hence the plethora of machine learning citations above. Binois, Gramacy, and Ludkovski (2018)’s scheme remains within our familiar class of library-based (e.g., BFGS) likelihood-optimizing methodology, and is paired with tractable strategies for sequential design (Binois et al. 2019) and Bayesian optimization (BO; Chapter 8). Other approaches achieving a degree of input-dependent variance include quantile kriging (QK; Plumlee and Tuo 2014); use of pseudoinputs (Snelson and Ghahramani 2006) or predictive process (Banerjee et al. 2008); and (non-GP-based) tree methods (Pratola et al. 2017). Unfortunately, none of those important methodological contributions, to my knowledge, pair with design, BO or open source implementation.

library(hetGP)

Illustrations here leverage hetGP (Binois and Gramacy 2019) on CRAN, supporting a wide class of alternatives in the space of coupled-GP heteroskedastic (and ordinary homoskedastic) modeling, design and optimization.

10.1 Replication and stochastic kriging

Replication can be a powerful device for separating signal from noise, offering a pure look at noise not obfuscated by signal. When modeling with GPs, replication in design can yield substantial computational savings as well. Continuing with \(N\)-notation from Chapter 9, consider training data pairs \((x_1, y_1), \dots, (x_N, y_N)\). These comprise of the “full-\(N\)” data set \((X_N, Y_N)\). Now suppose that the number \(n\) of unique \(x_i\)-values in \(X_N\) is far smaller than \(N\), i.e., \(n \ll N\). Let \(\bar{Y}_n = (\bar{y}_1, \dots, \bar{y}_n)^\top\) collect averages of \(a_i\) replicates at unique locations \(\bar{x}_i\), and similarly let \(\hat{\sigma}^2_i\) collect normalized residual sums of squares for those replicate measurements:

\[ \bar{y}_i = \frac{1}{a_i} \sum_{j=1}^{a_i} y_i^{(j)} \quad \mbox{and} \quad \hat{\sigma}^2_i = \frac{1}{a_i - 1} \sum_{j=1}^{a_i} (y_i^{(j)} - \bar{y}_i)^2. \]

Unfortunately, \(\bar{Y}_n = (\bar{y}_1, \dots, \bar{y}_n)\) and \(\hat{\sigma}_{1:n}^2\) don’t comprise of a set of sufficient statistics for GP prediction under the full-\(N\) training data. But these are nearly sufficient: only one quantity is missing and will be provided momentarily in Section 10.1.1. Nevertheless “unique-\(n\)” predictive equations based on \((\bar{X}_n, \bar{Y}_n)\) are a best linear unbiased predictor (BLUP):

\[\begin{align} \mu^{\mathrm{SK}}_n(x) &= \nu k_n^\top(x) (\nu C_n + S_n)^{-1} \bar{Y}_n \notag \\ \sigma^{\mathrm{SK}}_n(x)^2 &= \nu K_\theta(x,x) - \nu^2 k_n^\top(x) (\nu C_n + S_n)^{-1} k_n(x), \tag{10.1} \end{align}\]

where \(k_n(x) = (K_\theta(x, \bar{x}_1), \dots, K_\theta(x, \bar{x}_n))^\top\) as in Chapter 5,

\[ S_n = [\hat{\sigma}_{1:n}^2] A_n^{-1} = \mathrm{Diag}(\hat{\sigma}_1^2/a_1, \dots, \hat{\sigma}_n^2/a_n), \]

\(C_n = \{ K_\theta(\bar{x}_i, \bar{x}_j)\}_{1 \leq i, j \leq n}\), and \(a_i \gg 1\). This is the basis of stochastic kriging (SK; Ankenman, Nelson, and Staum 2010), implemented as an option in DiceKriging (Roustant, Ginsbourger, and Deville 2018) and mlegp (Dancik 2018) packages on CRAN. SK’s simplicity is a virtue. Notice that when \(n = N\), all \(a_i = 1\) and \(\hat{\sigma}_i = \nu g\), a standard GP predictor (Chapter 5) is recovered when mapping \(\nu = \tau^2\).10 Eq. (10.1) has intuitive appeal as an application of ordinary kriging equations (5.2) on (almost) sufficient statistics. An independent moments-based estimate of variance is used in lieu of the more traditional, likelihood-based (hyperparametric) alternative. This could be advantageous if variance is changing in the input space, as discussed further in Section 10.2. Computational expedience is readily apparent even in the usual homoskedastic setting: \(S_n = \mathrm{Diag}(\frac{1}{n} \sum \hat{\sigma}_i^2)\). Only \(\mathcal{O}(n^3)\) matrix decompositions are required, which could represent a huge savings compared with \(\mathcal{O}(N^3)\) if the degree of replication is high.

Yet independent calculations have their drawbacks. Thinking heteroskedastically, this setup lacks a mechanism for specifying a belief that variance evolves smoothly over the input space. Lack of smoothness thwarts a pooling of variance that is essential predicting uncertainties out-of-sample at novel \(x' \in \mathcal{X}\). Basically we don’t know \(K_\theta(x, x)\) in Eq. (10.1), whereas homoskedastic settings use nugget \(g\) for all \(x\). Ankenman, Nelson, and Staum (2010) suggest smoothing \(\hat{\sigma}_i^2\) values with a second GP, but that two-stage approach will feel ad hoc compared to what’s presented momentarily in Section 10.2. More fundamentally, numbers of replicates \(a_i\) must be relatively large in order for the \(\hat{\sigma}^2_i\) values to be reliable. Ankenman, et al., recommend \(a_i \geq 10\) for all \(i\), which can be prohibitive. Thinking homoskedastically, the problem with this setup is that it doesn’t emit a likelihood for inference for other hyperparameters, such as lengthscale(s) \(\theta\) and scale \(\nu\). Again, this is because \((\bar{Y}_n, S_n)\) don’t constitute a set of sufficient statistics for hyperparameter inference. Although they are close to being so.

10.1.1 Woodbury trick

The “fix”, ultimately revealing the full set of sufficient statistics for prediction and likelihood-based inference, involves Woodbury linear algebra identities. These are not unfamiliar to the spatial modeling community (e.g., Opsomer et al. 1999; Banerjee et al. 2008; Ng and Yin 2012). However, their application toward efficient GP inference under replicated design is relatively recent (Binois, Gramacy, and Ludkovski 2018). Let \(D\) and \(B\) be invertible matrices of size \(N \times N\) and \(n \times n\) respectively, and let \(U\) and \(V^\top\) be matrices of size \(N\times n\). The Woodbury identities are

\[\begin{align} (D + UBV)^{-1} &= D^{-1} - D^{-1} U (B^{-1} + V D^{-1} U)^{-1} V D^{-1} \tag{10.2} \\ |D + U B V| &= |B^{-1} + V D^{-1}U | \times |B| \times |D|. \tag{10.3} \end{align}\]

Eq. (10.3) is also known as the matrix determinant lemma, although it’s clearly based on the same underlying principles it’s inversion cousin (10.2). We refer to both as “Woodbury identities”, in part because GP likelihood and prediction application requires both inverse and determinant calculations. To build \(K_N = U C_n V^\top + D\) for GPs under replication, take \(U = V^\top = \mathrm{Diag}(1_{a_1}, \dots, 1_{a_n})\), where \(1_k\) is a \(k\)-vector of ones, \(D = g \mathbb{I}_N\) with nugget \(g\), and \(B = C_n\). Later in Section 10.2, we’ll take \(D = \Lambda_n\) where \(\Lambda_n\) is a diagonal matrix of latent variances.

Before detailing how the “Woodbury trick” maps to prediction (kriging) and likelihood identities for \(\mathcal{O}(n^3)\) rather than \(\mathcal{O}(N^3)\) calculations in Section 10.1.2, consider how it helps operate on a full covariance structure \(K_N\) via its unique counterpart \(K_n = C_n + g A_n^{-1}\). The example below creates a matrix \(X_n\) with \(n=50\) unique rows and then builds \(X_N\) identical to \(X_n\) except that four of its rows have been replicated a number of times, leading to a much bigger \(N=150\)-sized matrix.

n <- 50
Xn <- matrix(runif(2*n), ncol=2)
Xn <- Xn[order(Xn[,1]),]
ai <- c(8, 27, 38, 45)
mult <- rep(1, n)
mult[ai] <- c(25, 30, 9, 40)
XN <- Xn[rep(1:n, times=mult),]
N <- sum(mult)

Code below calculates covariance matrices \(K_n\) and \(K_N\) corresponding to \(X_n\) and \(X_N\), respectively. Rather than use our usual distance or covar.sep from plgp (Gramacy 2014), covariance matrix generation subroutines from hetGP are introduced here as an alternative. An arbitrary lengthscale of \(\theta=0.2\) and (default) nugget of \(g=\epsilon\) is used with a (default) isotropic Gaussian kernel. Separable Gaussian, and isotropic and separable Matèrn \(\nu \in \{3/2, 5/2\}\) are also supported.

KN <- cov_gen(XN, theta=0.2)
Kn <- cov_gen(Xn, theta=0.2)

Next, build \(U\) for use in the Woodbury identity (10.2)(10.3).

U <- c(1, rep(0, n-1))
for(i in 2:n){
  tmp <- rep(0, n)
  tmp[i] <- 1
  U <- rbind(U, matrix(rep(tmp, mult[i]), nrow=mult[i], byrow=TRUE))
}
U <- U[,n:1]

Figure 10.1 shows these three matrices side-by-side, illustrating the mapping \(K_n \rightarrow U \rightarrow K_N\) with choices of \(B\) and \(U=V^\top\) described after Eq. (10.3) above. Note that in this case \(C_n = K_n\) since nugget \(g\) is essentially zero.

cols <- heat.colors(128)
layout(matrix(c(1, 2, 3), 1, 3, byrow=TRUE), widths=c(2, 1, 2))
image(Kn, x=1:n, y=1:n, main="uniq-n: Kn", xlab="1:n", ylab="1:n", col=cols)
image(t(U), x=1:n, y=1:N, asp=1, main="U", xlab="1:n", ylab="1:N", col=cols)
image(KN, x=1:N, y=1:N, main="full-N: KN", xlab="1:N", ylab="1:N", col=cols)
Example mapping \(K_n \rightarrow U \rightarrow K_N\) through the Woodbury identities (10.2)–(10.3).

FIGURE 10.1: Example mapping \(K_n \rightarrow U \rightarrow K_N\) through the Woodbury identities (10.2)(10.3).

Storage of \(K_n\) and \(U\), which may be represented sparsely (or even implicitly) is not only more compact than \(K_N\), but the Woodbury formulas show how to calculate requisite inverse and determinants by acting on \(\mathcal{O}(n^2)\) quantities rather than \(\mathcal{O}(N^2)\) ones.

10.1.2 Efficient inference and prediction under replication

Here the aim is to make SK simultaneously more general, exact, and prescriptive, facilitating full likelihood-based inference and conditionally (on hyperparameters) exact predictive equations with the Woodbury trick. Pushing the matrix inverse identity (10.2) through to the predictive equations, mapping \(\nu C_N + S_N = \nu (C_N + \Lambda_N) \equiv \nu (C_N + g \mathbb{I}_N)\) between SK and more conventional Chapter 5 notation, yields the following predictive identities:

\[\begin{align} \nu k_N^\top(x) (\nu C_N + S_N)^{-1} Y_N &= k_n^\top(x) (C_n + \Lambda_n A_n^{-1}) \bar{Y}_n \tag{10.4} \\ vk - \nu^2 k_N^\top(x) (\nu C_N + S_N)^{-1} k_N(x) &= vk - \nu k_n^\top(x) (C_n + \Lambda_n A_n^{-1})^{-1} k_n(x), \notag \end{align}\]

where \(vk \equiv \nu K_\theta(x, x)\) is used as a shorthand to save horizontal space.

In words, typical full-\(N\) predictive quantities may be calculated identically via unique-\(n\) counterparts, potentially yielding dramatic savings in computational time and space. The unique-\(n\) predictor (10.4), implicitly defining \(\mu_n(x)\) and \(\sigma_n^2(x)\) on the right-hand-side(s) and updating SK (10.1), is unbiased and minimizes MSPE by virtue of the fact that those properties hold for the full-\(N\) one on the left-hand-side(s). No asymptotic or frequentist arguments are required. Crucially, no minimum data or numbers of replicates (e.g., \(a_i \geq 10\) for SK) are required to push asymptotic arguments through, although replication can still be helpful from a statistical efficiency perspective. See Section 10.3.1.

The same trick can be played with the concentrated log likelihood (5.8). Recall that \(U_n = C_n + A_n^{-1} \Lambda_n\), where for now \(\Lambda_n = g \mathbb{I}_n\) encoding homoskedasticity. Later we shall generalize \(\Lambda_n\) for the heteroskedastic setting. Using these quantities and Eqs. (10.2)(10.3) simultaneously,

\[\begin{align} \ell &= c + \frac{N}{2} \log \hat{\nu}_N - \frac{1}{2} \sum_{i=1}^n [(a_i - 1) \log \lambda_i + \log a_i] - \frac{1}{2} \log | K_n |, \tag{10.5} \\ \mbox{where} \;\; \hat{\nu}_N &= \frac{1}{N} (Y_N^\top \Lambda_N^{-1} Y_N - \bar{Y}_n^\top A_n \Lambda_n^{-1} \bar{Y}_n + \bar{Y}_n^\top K_n^{-1} \bar{Y}_n). \notag \end{align}\]

Notice additional terms in \(\hat{\nu}_N\) compared with \(\hat{\nu}_n = n^{-1} \bar{Y}_n^\top K_n^{-1} \mathrm{Y}_n\). Thus \(N^{-1} Y_N^\top \Lambda_N^{-1} Y_N\) is our missing statistic for sufficiency. Since \(\Lambda_N\) is diagonal, evaluation of \(\ell\) requires just \(\mathcal{O}(n^3)\) operations beyond the \(\mathcal{O}(N)\) required for \(\bar{Y}_n\). Closed form derivative evaluations are available in \(\mathcal{O}(n^3)\) time too, facilitating library-based optimization, e.g., with BFGS.

\[\begin{align} \frac{\partial \ell}{\partial \cdot} &= \frac{N}{2} \frac{\partial ( Y_N^\top \Lambda_N^{-1} Y_N - \bar{Y}_n A_n \Lambda_n^{-1} \bar{Y}_n + n \hat{\nu}_n)}{\partial \cdot } \times (N \hat{\nu}_N)^{-1} \notag \\ &- \frac{1}{2} \sum_{i=1}^n \left[(a_i - 1) \frac{\partial \log \lambda_i}{\partial \cdot} \right] - \frac{1}{2} \mathrm{tr} \left(K_n^{-1} \frac{\partial K_n}{\partial \cdot} \right), \tag{10.6} \end{align}\]

Above, “\(\cdot\)” is a place holder for hyperparameter(s) of interest. Fast likelihood and derivative evaluation when \(N \gg n\) can be computationally much more efficient, compared to schemes introduced in Chapter 5), yet still reside under an otherwise identical numerical umbrella.

Example: computational advantage of replication

To demonstrate the potential benefit of this Woodbury mapping, consider the following modestly-sized example based on our favorite 2d test function (Chapter 5.1.2). You may recall we leveraged replication in an extension of this example (Chapter 5.2.4) in order to control signal-to-noise behavior in repeated Rmarkdown builds. Here the response is observed at \(n=100\) unique input locations, each having a random number of replicates \(a_i \sim \mathrm{Unif}\{1,2,\dots,50\}\). Otherwise the setup is identical to previous (noisy) uses of this data-generating mechanism.

library(lhs)
Xbar <- randomLHS(100, 2)
Xbar[,1] <- (Xbar[,1] - 0.5)*6 + 1
Xbar[,2] <- (Xbar[,2] - 0.5)*6 + 1
ytrue <- Xbar[,1]*exp(-Xbar[,1]^2 - Xbar[,2]^2)
a <- sample(1:50, 100, replace = TRUE)
N <- sum(a)
X <- matrix(NA, ncol=2, nrow=N)
y <- rep(NA, N)
nf <- 0
for(i in 1:100) {
  X[(nf+1):(nf+a[i]),] <- matrix(rep(Xbar[i,], a[i]), ncol=2, byrow=TRUE)
  y[(nf+1):(nf+a[i])] <- ytrue[i] + rnorm(a[i], sd=0.01)
  nf <- nf + a[i]
}

The code below invokes mleHomGP from the hetGP package in two ways. One cripples automatic pre-processing subroutines that would otherwise calculate summary statistics for unique-\(n\) modeling, forcing a more cumbersome full-\(N\) calculation. Another does things the default, more thrifty unique-\(n\) way. In both cases, mleHomGP serves as a wrapper automating calls to optim with method="lbfgsb", furnishing implementations of log likelihood (10.5) and derivative (10.6).

Lwr <- rep(sqrt(.Machine$double.eps), 2) 
Upr <- rep(10, 2)
fN <- mleHomGP(list(X0=X, Z0=y, mult=rep(1, N)), y, Lwr, Upr)
un <- mleHomGP(X, y, Lwr, Upr)

Repeating searching bounds Lwr and Upr to match input dimension invokes a separable kernel, which is Gaussian by default. Execution times saved by the mleHomGP calls above show a dramatic difference.

c(fN=fN$time, un=un$time)
## fN.elapsed un.elapsed 
##     168.42       0.03

Calculations on unique-\(n\) quantities via Woodbury identities results in execution that is 5614 times faster than an otherwise equivalent full-\(N\) analog. The outcome of those calculations, exemplified below through a reporting of the estimated lengthscales, is nearly identical in both versions.

rbind(fN=fN$theta, un=un$theta)
##     [,1]  [,2]
## fN 1.092 1.941
## un 1.092 1.942

Excepting an SK feature offered by the mlegp and DiceKriging packages on CRAN, I’m not aware of any other software package, for R or otherwise, that automatically pre-process data into a format leveraging Woodbury identities to speed up calculations in the face of heavy replication. Repeated sampling is a common design tactic in the face of noisy processes, and the example above demonstrates potential for substantial computational benefit to careful handling when modeling with GPs. Replication is essential when signal-to-noise features may exhibit heterogeneity across the input space. Whenever two things are changing simultaneously, a means of pinning down one – in this case variance, if only locally – can be a game-changer.

10.2 Coupled mean and variance GPs

GPs kernels based on Euclidean distance (e.g., Chapter 5 or above) emit stationary processes where input–output behavior exhibits highly regular dynamics throughout the input space. Yet we saw in Chapter 9.2.29.3 that many data-generating mechanisms are at odds with stationarity. A process can diverge from stationary in various ways, but few are well accommodated by computationally viable modeling methodology – exceptions in Chapter 9 notwithstanding. Input-dependent variance, or heteroskedasticity, is a particular form of nonstationarity that is increasingly encountered in stochastic simulation. A fine example is the motorcycle accident data introduced in Chapter 9.2.1. Our TGP fit to these data, shown in Figure 9.11, nicely captures noise regime changes from low, to high, to medium as inputs track from left to right. Accommodating those shifts required partitioning/hard breaks in the predictive surface which (we hoped) could be smoothed over by MCMC sampling from the Bayesian posterior on trees. Entertaining a smoother alternative might be worthwhile.

To lay the groundwork for a hetGP-based alternative, and remind the reader of the setting and challenges involved, consider the following ordinary homoskedastic GP fit to these data, followed by prediction on a testing grid.

library(MASS)
hom <- mleHomGP(mcycle$times, mcycle$accel)
Xgrid <- matrix(seq(0, 60, length=301), ncol=1)
p <- predict(x=Xgrid, object=hom)
df <- data.frame(table(hom$mult))
colnames(df) <- c("reps", "howmany")
rownames(df) <- NULL
df
##   reps howmany
## 1    1      66
## 2    2      22
## 3    3       3
## 4    4       2
## 5    6       1

Considering the importance of separating signal from noise in this simulation experiment, perhaps it’s not surprising that the design includes moderate replication. Figure 10.2 shows the resulting predictive surface overlaid on a scatter plot of the data. The solid red line is the predictive mean, and the dashed red lines trace out a 90% predictive interval. Output from predict.homGP separates variance in terms of mean (p$sd2) and residual (p$nugs) estimates, which are combined in the figure to show full predictive uncertainty.

plot(mcycle)
lines(Xgrid, p$mean)
lines(Xgrid, qnorm(0.05, p$mean, sqrt(p$sd2 + p$nugs)), col=2, lty=2)
lines(Xgrid, qnorm(0.95, p$mean, sqrt(p$sd2 + p$nugs)), col=2, lty=2)
Homoskedastic GP fit to the motorcycle data via mean (solid red) and 95 percent error bars (dashed red).

FIGURE 10.2: Homoskedastic GP fit to the motorcycle data via mean (solid red) and 95 percent error bars (dashed red).

As in Figure 9.10 showing (Bayesian) stationary GP, this fit is undesirable on several fronts. Not only is the variance off, but the mean is off too: it’s way too wiggly in the left-hand, pre-impact regime. Perhaps this is due to noise in the right-hand regime being incorrectly interpreted as signal, which then bleeds into the rest of the fit because the covariance structure is stationary. Learning how the variance is changing is key to learning about mean dynamics. Replication is key to learning about variance.

10.2.1 Latent variance process

Heteroskedastic GP modeling targets learning the diagonal matrix \(\Lambda_N\) (or its unique-\(n\) counterpart \(\Lambda_n\); see Eq. (10.4). Allowing the \(\lambda_i\) values to exhibit heterogeneity, i.e., not all \(\lambda_i = g/\nu\) as in the homoskedastic case, is easier said than done. Care is required when performing inference for such a high dimensional parameter. SK (10.1) suggests taking \(\Lambda_n A_n^{-1} = S_n = \mathrm{Diag}(\hat{\sigma}^2_1, \dots, \hat{\sigma}^2_n)\), but that requires large numbers of replicates \(a_i\) and is anyways useless out of sample. By fitting each \(\hat{\sigma}_i^2\) separately there is no pooling effect for interpolation/smoothing. Ankenman, Nelson, and Staum (2010) suggest the quick fix of fitting a second GP to the variance observations with “data”:

\[ (\bar{x}_1, \hat{\sigma}_1^2), (\bar{x}_2, \hat{\sigma}_2^2), \dots, (\bar{x}_n, \hat{\sigma}_n^2), \]

to obtain a smoothed variance for use out of sample.

A more satisfying approach that is similar in spirit, coming from the machine learning literature (Goldberg, Williams, and Bishop 1998) and actually predating SK by more than a decade, involves applying regularization in the form of a prior that encourages a smooth evolution of variance. Goldberg, et al., introduce latent (log) variance variables under a GP prior and develop an MCMC scheme performing joint inference of all unknowns, including hyperparameters for both mean and noise GPs and latent variances. The overall method, which is effectively on the order of \(\mathcal{O}(TN^4)\) for \(T\) MCMC samples, is totally impractical but works well on small problems. Several authors have economized on aspects of this framework (Kersting et al. 2007; Lazaro–Gredilla and Titsias 2011) with approximations and simplifications of various sorts, but none to my knowledge have resulted in public R software.11 The key ingredient in these works, of latent variance quantities smoothed by a GP, has merits and can be effective when handled gingerly. Binois, Gramacy, and Ludkovski (2018) introduced the setup described below.

Let \(\delta_1, \delta_2, \dots, \delta_n\) denote latent variance variables (or equivalently latent nuggets), each corresponding to one of the \(n\) unique design sites \(\bar{x}_i\) under study. It’s important to introduce latents only for the unique-\(n\) locations. A similar approach on the full-\(N\) setup, i.e., without exploiting the Woodbury identities, is fraught with identifiability and numerical stability challenges. Store these latents diagonally in a matrix \(\Delta_n\) and place them under a GP prior with mean \(\mu_{(\delta)}\),

\[ \Delta_n \sim \mathcal{N}_n(0, \nu_{(\delta)} (C_{(\delta)} + g_{(\delta)} A_n^{-1})). \]

Inverse Euclidean distance-based correlations in \(n \times n\) matrix \(C_{(\delta)}\) are hyperparameterized by novel lengthscales \({\theta}_{(g)}\) and nugget \(g_{(\delta)}\). Smoothed \(\lambda_i\) values can be calculated by plugging \(\Delta_n\) into GP mean predictive equations:

\[\begin{equation} \Lambda_n = C_{(\delta)} K_{(\delta)}^{-1} \Delta_n, \quad \mbox{ where } \quad K_{(\delta)} = (C_{(\delta)} + g_{(\delta)} A_n^{-1}). \tag{10.7} \end{equation}\]

Smoothly varying \(\Lambda_n\) generalize both \(\Lambda_n = g \mathbb{I}_n\) from a homoskedastic setup and moments estimated \(S_n\) from SK. Hyperparameter \(g_{(\delta)}\) is a nugget of nuggets controlling the smoothness of \(\lambda_i\)’s relative to the \(\delta_i\)’s; when \(g_{(\delta)} = 0\) the \(\lambda_i\)’s interpolate \(\delta_i\)’s. Although the formulation above uses a canonical zero-mean GP setup, a positive nonzero mean \(\mu_{(g)}\) may be preferable for variances. The hetGP package automates a classical plugin estimator of the mean for a GP, that is, \(\hat{\mu}_{(\delta)} = \Delta_n^\top K_{(\delta)}^{-1} \Delta_n (1_n^\top K_{(\delta)}^{-1} 1_n)^{-1}\). See exercise #2 from Chapter 5.5 and take intercept-only \(\beta\). This seemingly simple twist unnecessarily complicates the remainder of the exposition, so we shall continue to assume zero mean throughout.

Variances must be positive, and the equations above give nonzero probability to negative \(\delta_i\) and \(\lambda_i\) values. One solution is to threshold latent variances at zero. Another is to model \(\log \Delta_n\) in this way instead. The latter option, guaranteeing positive variance after exponentiating, is the default in hetGP but both are implemented. Differences in empirical performance and mathematical development are slight. Modeling \(\log \Delta\) makes derivative expressions slightly more complicated after applying the chain rule, so our exposition here shall continue with the simpler un-logged setup. Logged analogues are left to an exercise in Section 10.4.

So far the setup isn’t much different than what Goldberg, Williams, and Bishop (1998) described more than twenty years ago, except an emphasis here on unique-\(n\) latents. Rather than cumbersome MCMC, Binois et al., describe how to stay within a (Woodbury) MLE framework, by defining a joint log-likelihood over both mean and variance GPs:

\[\begin{align} \tilde{\ell} &= c - \frac{N}{2} \log \hat{\nu}^2_N - \frac{1}{2} \sum\limits_{i=1}^n \left[(a_i - 1)\log \lambda_i + \log a_i \right] - \frac{1}{2} \log |K_n| \tag{10.8} \\ & - \frac{n}{2} \log \hat{\nu}_{(\delta)} - \frac{1}{2} \log |K_{(\delta)}|. \notag \end{align}\]

Maximization is assisted by closed-form derivatives which may be evaluated with respect to all unknown quantities in \(\mathcal{O}(n^3)\) time. For example, the derivative with respect to latent \(\Delta_n\) may be derived as follows.

\[ \begin{aligned} \frac{\partial \tilde{\ell}}{\partial \Delta_n} &= \frac{\partial \Lambda_n}{\partial \Delta_n} \frac{\partial \tilde{\ell}}{\partial \Lambda_n} = C_{(\delta)} K_{(\delta)}^{-1} \frac{\partial \tilde{\ell}}{\partial \Lambda_n} - \frac{K_{(\delta)}^{-1} \Delta_n}{\hat{\nu}_{(\delta)}} \\ \mbox{ where } \quad \frac{\partial \tilde{\ell}}{\partial \lambda_i} &= \frac{N}{2} \times \frac{\frac{ a_i \hat{\sigma}_i^2} {\lambda_i^2} + \frac{(K_n^{-1} \bar{Y}_n)^2_i}{a_i}}{\hat{\nu}_N} - \frac{a_i - 1}{2\lambda_i} - \frac{1}{2a_i} (K_n)_{i, i}^{-1} \end{aligned} \]

Recall that \(\hat{\sigma}_i^2 = \frac{1}{a_i} \sum_{j=1}^{a_i} (y_i^{(j)} - \bar{y}_j)^2\). So an interpretation of Eq. (10.8) is as extension of the SK estimate \(\hat{\sigma}_i^2\) at \(\bar{x}_i\). In contrast with SK, GP smoothing provides regularization needed in order to accommodate small numbers of replicates, even \(a_i = 1\) in spite of \(\hat{\sigma}_i^2 = 0\). In this single replicate case, \(y_i\) still contributes to local variance estimates via the rest of Eq. (10.8). Note that Eq. (10.8) is not constant in \(\delta_i\); in fact it depends (10.7) on all of \(\Delta_n\). Smoothing may be entertained on other quantities, e.g., \(\Lambda_n \hat{\nu}_N = C_{(\delta)} K_{(\delta)}^{-1} S_n^2\) presuming \(a_i > 1\), resulting in smoothed moment-based variance estimates in the style of SK (Kamiński 2015; Wang and Chen 2016). There may similarly be scope for bypassing a latent GP noise process with a SiNK predictor (Lee and Owen 2015) by taking

\[ \Lambda_n = \rho(\bar{X}_n)^{-1} C_{(\delta)} K_{(\delta)}^{-1}\Delta_n \quad \mbox{ with } \quad \rho(x) = \sqrt{\hat{\nu}_{(\delta)} c_{(\delta)}(x)^\top K_{(\delta)}^{-1} c_{(\delta)}(x)}, \]

or alternatively via \(\log \Lambda_n\).

Binois, Gramacy, and Ludkovski (2018) show that \(\tilde{\ell}\) in Eq. (10.8) is maximized when \(\Delta_n = \Lambda_n\) and \(g_{(\delta)}=0\). In other words, smoothing latent nuggets (10.7) is unnecessary at the MLE. However, intermediate smoothing is useful as a device in three ways: (1) connecting SK to Goldberg’s latent representation; (2) annealing to avoid local minima; and (3) yielding a smooth solution when the numerical optimizer is stopped prematurely, which may be essential in big data (large unique-\(n\)) contexts.

It’s amazing that a simple optim call could be brought to bear in such a high dimensional setting Thoughtful initialization helps. Residuals from an initial homoskedastic fit can prime \(\Delta_n\) values. For more implementation details and options, see Section 3.2 of Binois and Gramacy (2018). \(\mathcal{O}(n)\) latent variables can get big, especially as higher dimensional examples demand training data sets with more unique inputs. This is where replication really shines, inducing a likelihood that locks in high-replicate latents early on in the optimization. The Woodbury trick (Section 10.1.1) is essential here. Beyond simply being inefficient, having multiple latent \(\delta_i^{(j)}\) at unique \(\bar{x}_i\) corrupts optimization by inserting numerical weaknesses in log likelihood. Goldberg, Williams, and Bishop (1998)’s MCMC is more forgiving in this regard. Monte Carlo is a robust, if (very) slowly converging, numerical procedure. Besides that nuance favoring a Woodbury unique-\(n\) formulation, the log likelihood surface is exceedingly well behaved. Critics have suggested EM to integrate out latent variances. But this simply doesn’t work except in small problems; EM represents nearly as much work as MCMC. Kersting et al. (2007) and Quadrianto et al. (2009) keyed into this fact a decade ago, but without the connection to SK, replication, and the Woodbury trick.

10.2.2 Illustrations with hetGP

Passages below take the reader through a cascade of examples illustrating hetGP: starting simple by returning to the motorcycle accident data, and progressing up though “real”-simulation examples from epidemiology to Bayesian model selection and inventory management.

Motorcycle accident data

The code below shows how to fit a heteroskedastic, coupled mean and variance GP, via a smoothed initialization and Matèrn covariance structure. This is but one of several potential ways to obtain a good fit to this data using the methods provided hetGP.

het <- mleHetGP(mcycle$times, mcycle$accel, covtype="Matern5_2", 
  settings=list(initStrategy="smoothed"))
het$time
## elapsed 
##   0.335

Time required to perform the calculations is trivial, although admittedly this is not a big problem. Built-in predict.hetGP works the same as its .homGP cousin, illustrated earlier.

p2 <- predict(het, Xgrid)
ql <- qnorm(0.05, p2$mean, sqrt(p2$sd2 + p2$nugs))
qu <- qnorm(0.95, p2$mean, sqrt(p2$sd2 + p2$nugs))

Figure 10.3, shows the resulting predictive surface in two views. The first view, in the left panel, complements Figure 10.2. Observe how the surface is heteroskedastic, learning the low-variance region in the first third of inputs and higher variance for the later two-thirds. As a consequence of being able to better track signal-to-noise over the input space, extraction of signal (particularly for the first third of times) is better than in the homoskedastic case.

par(mfrow=c(1,2))
plot(mcycle, ylim=c(-160, 90), ylab="acc", xlab="time", 
  main="predictive surface")
lines(Xgrid, p2$mean)
lines(Xgrid, ql, col=2, lty=2)
lines(Xgrid, qu, col=2, lty=2)
plot(Xgrid, p2$nugs, type="l", ylab="s2", xlab="times", 
  main="variance surface", ylim=c(0,2e3))
points(het$X0, sapply(find_reps(mcycle[,1], mcycle[,2])$Zlist, var), 
  col=3, pch=20)
“Heteroskedastic GP fit to the motorcycle data. Left panel shows the predictive distribution via mean (solid red) and 90% error bars (dashed red). Compare to Figure 10.2. Right panel shows the estimated variance surface and moment-based estimates of variance.

FIGURE 10.3: “Heteroskedastic GP fit to the motorcycle data. Left panel shows the predictive distribution via mean (solid red) and 90% error bars (dashed red). Compare to Figure 10.2. Right panel shows the estimated variance surface and moment-based estimates of variance.

The second view, in the right panel of the figure, provides more detail on latent variances. Predictive uncertainty is highest for the middle third of the times, which makes sense because this is where the whiplash effect is most prominent. Green dots in that panel indicate moment-based estimates of variance obtained from a limited number of replicates \(a_i > 1\) available for some inputs. (There are nowhere near enough for an SK-like approach.) Observe how the black curve, smoothing latent variance values \(\Delta_n\), extracts the essence of the pattern of those values. This variance fit uses the full data set, leveraging smoothness of the GP prior to incorporate responses at inputs with only one replicate, which in this case represents most of the data.

Susceptible, infected, recovered

Hu and Ludkovski (2017) describe a 2d simulation arising of the spread of an epidemic in a susceptible, infected, recovered (SIR) setting, a common stochastic compartmental model. Inputs are \((S_0, I_0\)), the numbers of initial susceptible and infected individuals. Output is the total number of newly infected individuals which may be calculated by approximating

\[ f(x) := \mathbb{E}\left\{ S_0 - \lim_{T \to \infty} S_T \mid (S_0,I_0,R_0) = x \right\} = \gamma \mathbb{E} \left\{ \int_0^\infty I_t \,dt \mid x \right\} \]

under continuous time Markov dynamics with transitions \(S + I \rightarrow 2I\) and \(I \rightarrow R\). The resulting surface is heteroskedastic and has some very-high noise regions. Parts of the input space represent highly volatile regimes wherein random chance intimately affects whether or not epidemics spread quickly or die out.

A function generating the data for standardized inputs in the unit square (corresponding to coded \(S_0\) and \(I_0\)) is provided by sirEval in the hetGP package. Output is also standardized. Consider a space-filling design of \(n=200\) unique runs under random replication \(a_i \sim \mathrm{Unif}\{1,\dots, 100\}\). Coordinate \(x_1\) represents the initial number of infecteds \(I_0\), and \(x_2\) is the initial number of susceptibles \(S_0\).

Xbar <- randomLHS(200, 2)
a <- sample(1:100, nrow(Xbar), replace=TRUE)
X <- matrix(NA, ncol=2, nrow=sum(a))
nf <- 0
for(i in 1:nrow(Xbar)) {
  X[(nf+1):(nf+a[i]),] <- matrix(rep(Xbar[i,], a[i]), ncol=2, byrow=TRUE)
  nf <- nf + a[i]
}
nf
## [1] 10407

The result is a full data set with about ten thousand runs. Code below gathers responses – expected total number of infecteds at the end of the simulation – at each input location in the design, including replicates.

Y <- apply(X, 1, sirEval)

Code below fits our hetGP model. By default, lengthscales for the variance GP (\(\theta_{(\delta)}\)) are linked to those from the mean GP (\(\theta\)), requiring that the former be a scalar multiple \(k>1\) of the latter. That feature can be switched off, however, as illustrated below. Specifying scalar lower and upper signals an isotropic kernel.

fit <- mleHetGP(X, Y, settings=list(linkThetas="none"),
  covtype="Matern5_2", maxit=1e4)
fit$time
## elapsed 
##   3.263

Around 3.3 seconds are needed to train the model. To visualize the resulting predictive surface, code below creates a dense grid in 2d and calls predict.hetGP on the output object.

xx <- seq(0, 1, length=100)
XX <- as.matrix(expand.grid(xx, xx))
psir <- predict(fit, XX)
vsir <- psir$sd2 + psir$nugs

Figure 10.4 shows the resulting predictive surface. The left panel in the figure captures predictive means; the right panel shows predictive standard deviation. Text overlaid on the panels indicate the location of the training data inputs and the number of replicates observed thereupon.

par(mfrow=c(1,2))
image(xx, xx, matrix(psir$mean, ncol=100), xlab="S0", ylab="I0", 
  col=cols, main="mean infected")
text(Xbar, labels=a, cex=0.5)
image(xx, xx, matrix(sqrt(vsir), ncol=100), xlab="S0", ylab="I0", 
  col=cols, main="sd infected")
text(Xbar, labels=a, cex=0.5)
Heteroskedastic GP fit to the SIR data showing predictive mean surface(left) and estimated standard deviation (Right).  Text in both panels shows numbers of replicates.

FIGURE 10.4: Heteroskedastic GP fit to the SIR data showing predictive mean surface(left) and estimated standard deviation (Right). Text in both panels shows numbers of replicates.

Notice in the figure how the mean is changing more slowly than the variance. Un-linking their lengthscales was probably the right move. Interpreting the mean surface, the number of infecteds is maximized when the initial \(S_0\) and/or \(I_0\) is high. That makes sense. Predictive uncertainty in the total number of infecteds is greatest when there are a large number of initial susceptible individuals, but a moderate number of initial infecteds. That too makes sense. A moderate injection of disease into a large (vulnerable) population could swell to spell disaster.

Bayesian model selection

Model selection by Bayes factor (BF) is known to be sensitive to hyperparameter settings, which is further complicated and obscured by Monte Carlo evaluation injecting a substantial source of noise. To study BF surfaces in such settings, Franck and Gramacy (2020) propose treating expensive BF evaluations, with MCMC say, as a (stochastic) computer simulation experiment. BF calculations at a space-filling design in the (input) hyperparameter space can be used to map and thus better understand sensitivity of model selection to those settings.

As a simple warmup example, consider an experiment described by Section 3.3-3.4 of Gramacy and Pantaleo (2010) where BF calculations determine whether data is leptokurtic (Student-\(t\) errors) or not (simply Gaussian). Here we study BF approximation as a function of prior parameterization on Student-\(t\) degrees of freedom parameter \(\nu\), which Gramacy and Pantaleo took as \(\nu \sim \mathrm{Exp}(\theta = 0.1)\). Their intention was to be diffuse, but ultimately they lacked an appropriate framework for studying sensitivity to this choice. Franck, et al., designed a grid of \(\theta\) values, evenly spaced in \(\log_{10}\) space from \(10^{-3}\) to \(10^6\) spanning “solidly Student-\(t\)” (even Cauchy) to “essentially Gaussian” prior specifications for \(\mathbb{E}\{\nu\}\). Each \(\theta_i\) gets its own reversible jump MCMC simulation, as automated by blasso in the monomvn package (Gramacy 2018b) on CRAN, taking about 36 minutes on a 4.2 GHz Intel Core i7 processor. Converting those posterior draws into \(\mathrm{BF}_{\mathrm{St}\mathcal{N}}^{(i)}\) evaluations follows a post-processing scheme described by Jacquier, Polson, and Rossi (2004). To better understand Monte Carlo variability in those calculations, ten replicates of BFs under each hyperparameter setting \(\theta_i\) were collected. Collecting 200 BFs in this way takes about 120 hours. These data are saved in the hetGP package for later analysis.

data(bfs)
thetas <- matrix(bfs.exp$theta, ncol=1)
bfs <- as.matrix(t(bfs.exp[,-1]))

BF evaluation by MCMC is notoriously unstable. We shall see that even in log-log space, the \(\mathrm{BF}_{\mathrm{St}\mathcal{N}}\) process is not only heteroskedastic but also has heavy tails. Consequently Franck and Gramacy (2020) fit a heteroskedastic Student-\(t\) process. Details are provided in Section 3.3 of Binois and Gramacy (2018) and Chung et al. (2019) who describe several other extensions including an additional layer of latent variables to handle missing data, and a scheme to enforce a monotonicity property. Both are motivated a challenging class of inverse problems involved in characterizing spread of influenza.

bfs1 <- mleHetTP(X=list(X0=log10(thetas), Z0=colMeans(log(bfs)), 
  mult=rep(nrow(bfs), ncol(bfs))), Z=log(as.numeric(bfs)), lower=1e-4, 
  upper=5, covtype="Matern5_2")

Predictive evaluations may be extracted on a grid in the input space …

dx <- seq(0, 1, length=100)
dx <- 10^(dx*4 - 3)
p <- predict(bfs1, matrix(log10(dx), ncol=1))

… and visualized in Figure 10.5. Each open circle is a \(\mathrm{BF}_{\mathrm{St}\mathcal{N}}\) evaluation, plotted in \(\log_{10}\)\(\log_e\) space.

matplot(log10(thetas), t(log(bfs)), col=1, pch=21, ylab="log(bf)")
lines(log10(dx), p$mean)
lines(log10(dx), p$mean + 2*sqrt(p$sd2 + p$nugs), col=2, lty=2)
lines(log10(dx), p$mean - 2*sqrt(p$sd2 + p$nugs), col=2, lty=2)
legend("topleft", c("hetTP mean", "hetTP interval"), col=1:2, lty=1:2)
Heteroskedastic TP fit to Bayes factor data under exponential hyperprior.

FIGURE 10.5: Heteroskedastic TP fit to Bayes factor data under exponential hyperprior.

It bears repeating that the \(\mathrm{BF}_{\mathrm{St}\mathcal{N}}\) surface is heteroskedastic, even after log transform, and has heavy tails. A take-home message from these plots is that BF surfaces can be extremely sensitive to hyperparameterization. When \(\theta\) is small, the Student-\(t\) (\(\mathrm{BF}_{\mathrm{St}\mathcal{N}} < 1\)) is essentially a foregone conclusion, whereas if \(\theta\) is large, the Gaussian (\(\mathrm{BF}_{\mathrm{St}\mathcal{N}} > 1\)) is. This is an extremely negative result for model selection via BF in this setting: a seemingly innocuous hyperparameter setting is essentially determining the outcome of selection.

Although the computational burden involved in this experiment – 120 hours – is tolerable, extending the idea to higher dimensions is problematic. Suppose one wished to entertain \(\nu \sim \mathrm{Gamma}(\alpha, \beta)\), where the \(\alpha=1\) case reduces to \(\nu \sim \mathrm{Exp}(\beta \equiv \theta)\) above. Over a similarly dense hyperparameter grid, the runtime would balloon to more than one hundred days which is clearly unreasonable. Instead it makes sense to build a surrogate model from a more limited space-filling design and use the resulting posterior predictive surface to understand variability in BFs in the hyperparameter space. Five replicate responses on a size \(n=80\) LHS in \(\alpha \times \beta\)-space were obtained for a total of \(N=400\) MCMC runs. These data are provided as bfs.gamma in the bfs data object in hetGP.

D <- as.matrix(bfs.gamma[,1:2])
bfs <- as.matrix(t(bfs.gamma[,-(1:2)]))

A similar hetTP fit may be obtained with the following command …

bfs2 <- mleHetTP(X=list(X0=log10(D), Z0=colMeans(log(bfs)), 
  mult=rep(nrow(bfs), ncol(bfs))), Z=log(as.numeric(bfs)), 
  lower=rep(1e-4, 2), upper=rep(5, 2), covtype="Matern5_2")

… followed by predictions on a dense grid in the 2d input space.

dx <- seq(0, 1, length=100)
dx <- 10^(dx*4 - 3)
DD <- as.matrix(expand.grid(dx, dx))
p <- predict(bfs2, log10(DD))

Figure 10.6 shows the outcome of that experiment: mean surface in the left panel and standard deviation in the right. Numbers overlaid indicate average \(\mathrm{BF}_{\mathrm{St}\mathcal{N}}\) obtained for the five replicates at each input location. Contours in the left panel correspond to levels in Jeffrey’s scale interpretation.

par(mfrow=c(1,2))
mbfs <- colMeans(bfs)
image(log10(dx), log10(dx), t(matrix(p$mean, ncol=length(dx))), col=cols, 
  xlab="log10 alpha", ylab="log10 beta",  main="mean log BF")
text(log10(D[, 2]), log10(D[, 1]), signif(log(mbfs), 2), cex=0.5)
contour(log10(dx), log10(dx), t(matrix(p$mean, ncol=length(dx))),
  levels=c(-5,-3,-1,0,1,3,5), add=TRUE, col=4)
image(log10(dx), log10(dx), t(matrix(sqrt(p$sd2 + p$nugs), 
  ncol=length(dx))), co =cols, xlab="log10 alpha", 
  ylab="log10 beta", main="sd log BF")
text(log10(D[,2]), log10(D[,1]), signif(apply(log(bfs),2,sd), 2), cex=0.5)
Heteroskedastic TP fit to the Bayes factor data under Gamma hyperprior.

FIGURE 10.6: Heteroskedastic TP fit to the Bayes factor data under Gamma hyperprior.

The story here is much the same as before in terms of \(\beta\), which maps to \(\theta\) in the earlier experiment, especially near \(\alpha=1\), or \(\log_{10} \alpha=0\) where the equivalence is exact. The left panel shows that along that slice one can get just about a model selection conclusion one wants. Smaller \(\alpha\) values tell a somewhat more nuanced story, however. A rather large range of smaller \(\alpha\) values leads to somewhat less sensitivity in the outcome, except when \(\beta\) is quite large. Apparently, having a small \(\alpha\) setting is essential if the data is going to have any influence on model selection via BF. The right panel shows that the variance surface is indeed changing over the input space, justifying the heteroskedastic surrogate.

Inventory management

The assemble-to-order (ATO) problem (Hong and Nelson 2006) involves a queuing simulation targeting inventory management scenarios. At its heart it’s an optimization (or reinforcement learning) problem, however here we simply treat it as a response surface to be learned. Although the signal-to-noise ratio is relatively high, ATO simulations are known to be heteroskedastic, e.g., as illustrated by the documentation for the Matlab library we utilized for the simulations (Xie, Frazier, and Chick 2012).

The setup is as follows. A company manufactures \(m\) products. Products are built from base parts called items, some of which are “key” in that the product can’t be built without them. If a random request comes in for a product that’s missing a key item, a replenishment order is executed and filled after a random delay. Holding items in inventory is expensive, so there a balance must be struck between inventory costs and revenue. Binois, Gramacy, and Ludkovski (2018) describe an experiment involving target stock vector inputs \(b \in \{0,1,\dots,20\}^8\) for eight items.

Code below replicates results from that experiment, which involved a uniform design of size \(n_{\mathrm{tot}}=2000\) in 8d with ten replicates for a total design size of \(N_{\mathrm{tot}}=20000\). A training–testing partition was constructed by randomly selecting \(n=1000\) unique locations and replicates \(a_i \sim \mathrm{Unif}\{1,\dots,10\}\) thereupon. The ato data object in hetGP contains one such random partition, which is subsequently coded into the unit cube \([0,1]^8\). Further detail is provided in the package documentation for the ato data object.

Actually the object provides two testing sets. One is a genuine out-of-sample set, where testing sites are exclusive of training locations. The other is replicate based, involving those \(10-a_i\) runs not selected for training. The training set is large, making MLE calculations slow, so the ato object additionally provides a fitted model for comparison. Examples in the ato documentation file provide code which may be used reproduce that fit or to create a new one based on novel training–testing partition.

data(ato)
c(n=nrow(Xtrain), N=length(unlist(Ztrain)), time=out$time)
##            n            N time.elapsed 
##         1000         5594         8584

Storing these objects facilitates fast illustration of prediction and out-of-sample comparison. It also provides a benchmark against which a more thoughtful sequential design scheme, introduced momentarily in Section 10.3.1, can be judged. Code below performs predictions at the held-out testing locations and then calculates a proper score (Eq. (27) from Gneiting and Raftery 2007) with ten replicates observed at each of those locations. Higher scores are better.

phet <- predict(out, Xtest)
phets2 <- phet$sd2 + phet$nugs
mhet <- as.numeric(t(matrix(rep(phet$mean, 10), ncol=10)))
s2het <- as.numeric(t(matrix(rep(phets2, 10), ncol=10)))
sehet <- (unlist(t(Ztest)) - mhet)^2
sc <- - sehet/s2het - log(s2het)
mean(sc)
## [1] 3.396

A similar calculation may be made for the held-out training replicates, shown below. These are technically out of sample, but accuracy is higher since training data was provided at these locations.

phet.out <- predict(out, Xtrain.out)
phets2.out <- phet.out$sd2 + phet.out$nugs
s2het.out <- mhet.out <- Ztrain.out
for(i in 1:length(mhet.out)) {
  mhet.out[[i]] <- rep(phet.out$mean[i], length(mhet.out[[i]]))
  s2het.out[[i]] <- rep(phets2.out[i], length(s2het.out[[i]]))
}
mhet.out <- unlist(t(mhet.out))
s2het.out <- unlist(t(s2het.out))
sehet.out <- (unlist(t(Ztrain.out)) - mhet.out)^2
sc.out <- - sehet.out/s2het.out - log(s2het.out)
mean(sc.out)
## [1] 5.085

These two testing sets may be combined to provide a single score calculated on the entire corpus of held-out data. As expected, the result is a compromise between the two score statistics calculated earlier.

mean(c(sc, sc.out))
## [1] 3.926

Binois, Gramacy, and Ludkovski (2018) repeat that training-testing partition one hundred times to provide score boxplots which may be compared to simpler (e.g., homoskedastic) GP-based approaches. I shall refer the curious reader to Figure 2 in that paper for more details. To make a long story short, fits accommodating heteroskedasticity in the proper way – via coupled GPs and fully likelihood-based inference – are superior to all other (computationally tractable) ways entertained.

10.3 Sequential design

A theme from Chapter 6 is that design for GPs has a chicken or egg problem. Model-based designs are hyperparameter dependent, and data are required to estimate parameters. Space-filling designs may be sensible but are often sub-optimal. It makes sense to proceed sequentially (Chapter 6.2). That was for ordinary, homoskedastic/stationary GPs of Chapter 5. Now noise is high and possibly varying regionally. Latent variables must be optimized. Replication helps, but how much and where?

The following caricatures the state of affairs as regards replication in experiment design for GP surrogates. In low/no-noise settings and under the assumption of stationarity (i.e., constant stochasticity), replication is of little value. Yet no technical result precludes replication except in deterministic settings. Conditions under which replicating is an optimal design choice are, until recently, unknown. When replicating, as motivated perhaps by an abundance of caution, one spot is sufficient because under stationarity the variance is the same everywhere.

Even less is known about heteroskedastic settings. Perhaps more replicates may be needed in high-variance regions? Ankenman, Nelson, and Staum (2010) show that once you have a design, under a degree of replication “large enough” to trust it’s moment-based estimates of spatial variance, new replicates can be allocated optimally. But what use are they if you already have “enough”, and what about the quagmire when you learn you need fewer than you already have? (More on that in Section 10.3.2.) On the other hand, if the mean is changing quickly – but otherwise exhibits stationary dynamics – might it help to concentrate design acquisitions where noise is lower so that signal-learning is relatively cheap in replication terms? One thing is for sure: we’d better proceed sequentially. Before data are collected, regions of high or low variance are unknown.

Details driving such trade-offs, requisite calculations and what can be learned from them, depend upon design criteria. Choosing an appropriate criteria depends upon the goal of modeling and/or prediction. Sections 10.3.110.3.3 concentrate on predictive accuracy under heteroskedastic models, extending Chapter 6.2.2. Section 10.3.4 concludes with pointers to recent work in Bayesian optimization (e.g., extending Chapter 7.2), level set finding, and inverse problems/calibration (Chapter 8.1).

10.3.1 Integrated mean-square prediction error

Integrated mean-square prediction error (IMSPE) is a common model-based design criteria, whether for batch design (Chapter 6.1.2) or in sequential application as embodied by ALC (Chapter 6.2.2). Here our focus is sequential, however many of calculations, and their implementation in hetGP, naturally extend to the batch setting as already illustrated in Chapter 6.1.2. IMSPE is predictive variance averaged over the input space, which must be minimized.

\[\begin{equation} I_{n+1}(x_{n+1}) \equiv \mathrm{IMSPE}(\bar{x}_1, \dots, \bar{x}_n, x_{n+1}) = \int_{x \in \mathcal{X}} \breve{\sigma}^2_{n+1}(x) \, dx. \tag{10.9} \end{equation}\]

Recall that \(\breve{\sigma}^2(x)\) is nugget-free predictive variance (7.5), capturing only epistemic uncertainty about the latent random field (Chapter 5.3.2). IMSPE above is expressed in terms of unique-\(n\) inputs for reasons that shall be clarified shortly. Following the right-hand-side of Eq.~(10.4), we have

\[\begin{equation} \breve{\sigma}_{n+1}^2(x) = \nu(1 - k_{n+1}^\top(x) (C_{n+1} + \Lambda_{n+1} A_{n+1}^{-1})^{-1} k_{n+1}(x)) \tag{10.10} \end{equation}\]

Re-expression for full-\(N\) is straightforward. The next design location, \(x_{N+1}\), may be a new unique location \((\bar{x}_{n+1})\) or a repeat of an existing one (i.e., one of \(\bar{x}_1, \dots, \bar{x}_n\)).

Generally speaking, integral (10.9) requires numerical evaluation (Seo et al. 2000; Gramacy and Lee 2009; Gauthier and Pronzato 2014; Gorodetsky and Marzouk 2016; M. Pratola et al. 2017). Chapter 6.2.2 offers several examples; or Ds2x=TRUE in the tgp (Gramacy and Taddy 2016) package – see Chapter 9.2.2. Both offer approximations based on sums over a reference set, with the latter providing additional averaging over MCMC samples from the posterior of hyperparameters (and trees). Conditional on GP hyperparameters, and when the study region \(\mathcal{X}\) is an easily integrable domain such as a hyperrectangle, the requisite integration may be calculated in closed form. Although examples exist elsewhere in the literature (e.g., Ankenman, Nelson, and Staum 2010; Anagnostopoulos and Gramacy 2013; Burnaev and Panov 2015; Leatherman, Santner, and Dean 2017) for particular cases (and approximations), hetGP offers the only implementation on CRAN that I am aware of. Binois et al. (2019) extends many of those calculations to the Woodbury trick, showing for the first time how choosing a replicate can be optimal in active learning and sequential design.

The closed form solution leverages that IMSPE is similar to an expectation of covariance functions.

\[\begin{align} I_{n+1}(x_{n+1}) &= \mathbb{E} \{ \breve{\sigma}^2_{n+1}(X) \} = \mathbb{E} \{K_\theta(X, X) - k_{n+1}^\top(X) K_{n+1}^{-1} k_{n+1}(X) \} \tag{10.11} \\ &= \mathbb{E} \{K_\theta(X,X)\} - \mathrm{tr}(K_{n+1}^{-1} W_{n+1}), \notag \end{align}\]

where \(W_{ij} = \int_{x \in \mathcal{X}} k(x_i, x) k(x_j, x)\, dx\). Closed forms for \(W_{ij}\) exist with \(\mathcal{X}\) being a hyperrectangle. Notice that \(K_{n+1}\) depends on the number of replicates per unique design, so this representation includes a tacit dependence on noise level and replication counts \(a_1, \ldots, a_n\). Binois et al. (2019) provide forms for several popular covariance functions, including the Matèrn. In the case of the Gaussian:

\[ W_{ij}= \prod_{k=1}^d \dfrac{\sqrt{2\pi \theta_k} }{4} \exp\left\{-\dfrac{(x_{i,k}-x_{j,k})^2}{2 \theta_k}\right\} \left[\mathrm{erf}\left\{\dfrac{2-(x_{i,k}+x_{j,k})}{\sqrt{2 \theta_k}}\right\}+ \mathrm{erf}\left\{\dfrac{x_{i,k}+x_{j,k}}{\sqrt{ 2\theta_k}}\right\} \right]. \]

Gradients are also available in closed form, which is convenient for fast library-based optimization as a means of solving for acquisitions. Partitioned inverse equations (6.8) make evaluation of \(I_{n+1}(x_{n+1})\) and its derivative quadratic in \(n\). See Binois, et al., for details.

To investigate how replication can be favored by IMSPE, consider the following setup. Let \(r(x)\) denote a belief about extrinsic variance. Choice of \(r(x)\) is arbitrary. In practice we shall use estimated \(r(x) = \breve{\sigma}_n(x)\), as above. However the following illustration contrives two simpler \(r(x)\), primarily for pedagogical purposes, based on splines that agree at five knots. Knot locations could represent design sites \(x_i\) where potentially many replicate responses have been observed.

rn <- c(6, 4, 5, 6.5, 5) 
X0 <- matrix(seq(0.2, 0.8, length.out=length(rn)))
X1 <- matrix(c(X0, 0.3, 0.4, 0.9, 1))
Y1 <- c(rn, 4.7, 4.6, 6.3, 4.5)
r1 <- splinefun(x=X1, y=Y1, method="natural")
X2 <- matrix(c(X0, 0.0, 0.3))
Y2 <- c(rn, 7, 2)
r2 <- splinefun(x=X2, y=Y2, method="natural")

Figure 10.7 provides a visual of these two hypotheses evaluated on a testing grid. Below we shall refer to these surfaces as “green” and “blue”, respectively, referencing the colors from the figure. The “knots” are shown as red open circles.

xx <- matrix(seq(0, 1, by=0.005))
plot(X0, rn, xlab="x", ylab="r(x)", xlim=c(0,1), ylim=c(2,8), col=2)
lines(xx, r1(xx), col=3)
lines(xx, r2(xx), col=4)
Two example $r(x)$ variance surfaces based on splines with knots as open red circles.

FIGURE 10.7: Two example \(r(x)\) variance surfaces based on splines with knots as open red circles.

The code below implements Eq. (10.11) for generic variance function r, like one of our splines from above. It uses internal hetGP functions such as Wij and cov_gen. (We shall illustrate the intended hooks momentarily; these low-level functions are of value here in this toy example.)

IMSPE.r <- function(x, X0, theta, r) {
  x <- matrix(x, nrow = 1)
  Wijs <- Wij(mu1 = rbind(X0, x), theta=theta, type="Gaussian")
  K <- cov_gen(X1 = rbind(X0, x), theta=theta)
  K <- K + diag(apply(rbind(X0, x), 1, r))
  return(1 - sum(solve(K) * Wijs))
}

Next, apply this function on a grid for each of the two \(r(x)\), green and blue …

imspe1 <- apply(xx, 1, IMSPE.r, X0=X0, theta=0.25, r=r1)
imspe2 <- apply(xx, 1, IMSPE.r, X0=X0, theta=0.25, r=r2)
xstar1 <- which.min(imspe1)
xstar2 <- which.min(imspe2)

… and visualize in Figure 10.8. The \(x\)-locations of the knots – our hypothetical design sites \(x_1,\dots, x_n\) – are indicated as dashed red vertical bars.

plot(xx, imspe1, type="l", col=3, ylab="IMSPE", xlab="x", ylim=c(0.6, 0.7))
lines(xx, imspe2, col=4)
abline(v=X0, lty=3, col='red')
points(xx[xstar1], imspe1[xstar1], pch=23, bg=3)
points(xx[xstar2], imspe2[xstar2], pch=23, bg=4)
IMSPE surfaces from the two \(r(x)\) in Figure 10.7; knots indicated as red vertical dashed lines.

FIGURE 10.8: IMSPE surfaces from the two \(r(x)\) in Figure 10.7; knots indicated as red vertical dashed lines.

In the figure, the blue variance hypothesis is minimized at a novel \(x_{n+1}\) location, not coinciding with any of the previous design sites. But the green hypothesis is minimized at \(x_2\), counting vertical red-dashed lines from left to right. IMSPE calculated on the green \(r(x)\) prefers replication. That’s not a coincidence or fabrication. Binois et al. (2019) showed that the next point \(x_{N+1}\) will be a replicate, i.e., one of the existing unique locations \(x_1,\dots,x_n\) rather than a new \(x_{n+1},\) when

\[\begin{align} r(x_{N+1}) &\geq \frac{k_n^\top(x_{N+1}) K_n^{-1} W_n K_n^{-1} k_n(x_{N+1}) - 2 w_{n+1}^\top K_n^{-1} k_n(x_{N+1}) + w_{n+1,n+1}}{\mathrm{tr}(B_{k^*} W_n)} \notag \\ &\quad - \sigma_n^2(x_{N+1}), \tag{10.12} \end{align}\]

where \(k^* = \mathrm{argmin}_{k \in \{1, \dots, n\}} \mathrm{IMSPE}(x_k)\) and

\[ B_k = \frac{(K_n^{-1})_{.,k} (K_n^{-1})_{k,.} }{\frac{\nu \lambda_k }{a_k (a_k + 1)} - (K_n)^{-1}_{k,k}}. \]

To describe that result from thirty thousand feet: IMSPE will prefer replication when the variance is “everywhere large enough.” To illustrate, code below utilizes hetGP internals to enable evaluation of the right-hand side of inequality (10.12) above.

rx <- function(x, X0, rn, theta, Ki, kstar, Wijs) {  
  x <- matrix(x, nrow=1)
  kn1 <- cov_gen(x, X0, theta=theta)
  wn <- Wij(mu1=x, mu2=X0, theta=theta, type="Gaussian")
  a <- kn1 %*% Ki %*% Wijs %*% Ki %*% t(kn1) - 2*wn %*% Ki %*% t(kn1) 
  a <- a + Wij(mu1=x, theta=theta, type="Gaussian")
  Bk <- tcrossprod(Ki[, kstar], Ki[kstar,])/(2/rn[kstar] - Ki[kstar, kstar])
  b <- sum(Bk*Wijs)
  sn <- 1 - kn1 %*% Ki %*% t(kn1) 
  return(a/b - sn)
}

Evaluating on the XX grid commences as follows …

bestk <- which.min(apply(X0, 1, IMSPE.r, X0=X0, theta=0.25, r=r1))
Wijs <- Wij(X0, theta=0.25, type="Gaussian")
Ki <- solve(cov_gen(X0, theta=0.25, type="Gaussian") + diag(rn))
rx.thresh <- apply(xx, 1, rx, X0=X0, rn=rn, theta=0.25, Ki=Ki,
  kstar=bestk, Wijs=Wijs)

… which may be used to augment Figure 10.7 with a gray-dashed line in Figure 10.9. Since the green hypothesis is everywhere above that threshold in this instance, replication is recommended by the criteria. Observe that the point of equality coincides with the selection minimizing IMSPE.

plot(X0, rn, xlab="x", ylab="r(x)", xlim=c(0,1), ylim=c(2,8), lty=2, col=2)
lines(xx, r1(xx), col=3)
lines(xx, r2(xx), col=4)
lines(xx, rx.thresh, lty=2, col="darkgrey")
Variance hypotheses from Figure 10.7 with replicating threshold added in gray.

FIGURE 10.9: Variance hypotheses from Figure 10.7 with replicating threshold added in gray.

That green hypothesis is, of course, just one instance of a variance function above the replication threshold. Although those hypotheses were not derived from GP predictive equations, the example is illustrates potential. So here is what we know: replication is (1) good for GP calculations (\(n^3 \ll N^3\)); (2) preferred by IMSPE under certain (changing) variance regimes; and (3) intuitively helps separate signal from noise. But how often is IMSPE going to “ask” for replications in practice? The short answer is, not often enough.

One challenge is numerical precision in optimization when mixing discrete and continuous search. Given a continuum of potential new locations, up to floating-point precision, a particular setting corresponding to a finite number of replicate sites is not likely to be preferred over all other settings, such as ones infinitesimally nearby. Another issue, which is more fundamental, is that IMSPE is myopic. Value realize by the current selection, be it a replicate or new unique location, is not assessed through its impact on a future decision landscape.

10.3.2 Lookahead over replication

Entertaining acquisition decision spaces that are forward looking is hard. Several examples may be found in the literature, primarily emphasizing BO applications (Chapter 7). (See., e.g., Ginsbourger and Le Riche 2010; Gonzalez, Osborne, and Lawrence 2016; Lam, Willcox, and Wolpert 2016; Huan and Marzouk 2016). To my knowledge, closed form solutions only exist for the one-step-ahead case. Many approaches, like IECI from Chapter 7.2.3, leverage numerics to a degree. More on optimization is provided, in brief, in Section 10.3.4. In the context of IMSPE, Binois et al. (2019) found that a “replication-biased” lookahead, correcting myopia in the context of separating signal from noise, is manageable.

Consider a horizon \(h\) into the future, over which the goal is to plan for up to one new, unique site \(\bar{x}_{n+1}\), and \(h\) or \(h+1\) replicates, each at \(n\) or \(n+1\) existing unique locations. The first can be a new, unique site, and the remaining \(h\) replicates. Or replicates can come first with \(h\) ways to assign a new unique site later. Figure 10.10 provides a flow diagram corresponding to the \(h=3\) case. Each node in the graph represents a lookahead state. Red arrows denote potential timings for a new unique site graphically as a transition between states. Replicates are indicated by black arrows.

Flow chart of lookahead over replication. A similar chart may be found in Binois et al. (2019).

FIGURE 10.10: Flow chart of lookahead over replication. A similar chart may be found in Binois et al. (2019).

Each new unique location involves a continuous, potentially derivative-based IMSPE search. Entertaining replicates is discrete. In total, looking ahead over horizon \(h\) replicates requires \(h+1\) continuous searches, and \((h+1)(h+2)/2 -1\) discrete ones. Searching over discrete alternatives, involving replicates at one of the existing \(\bar{x}_k\) locations, \(k=1,\dots,n\), may utilize a simplified IMSPE (10.9) calculation:

\[\begin{align} I_{n+1}(\bar{x}_k) &= \nu( 1 - \mathrm{tr}(B_k' W_n)), \quad \mbox{ with } \tag{10.13} \\ B_k' &= \frac{\left(\left(C_n + A_n^{-1}\Lambda_n\right)^{-1}\right)_{.,k} \left(\left(C_n + A_n^{-1}\Lambda_n\right)^{-1}\right)_{k,.} }{a_k (a_k + 1)/\lambda_k - \left(C_n + A_n^{-1}\Lambda_n \right)^{-1}_{k,k}}. \notag \end{align}\]

Implementation in hetGP considers \(h \in \{0, 1, 2, \ldots \}\) with \(h=0\) representing ordinary (myopic) IMSPE search. Although larger values of \(h\) entertain future sequential design decisions, the goal (for any \(h\)) is to determine what to do now as \(N \rightarrow N+1\). Toward that end, \(h+1\) decision paths are entertained spanning alternatives between exploring sooner and replicating later, or vice versa. During each iteration along a given path through Figure 10.10, either Eq. (10.9) or (10.13), but not simultaneously, is taken up as the hypothetical action. In the first iteration, if a new \(\bar{x}_{n+1}\) is chosen by optimizing Eq. (10.9), that location combined is with existing \((\bar{x}_1, \dots, \bar{x}_n)\) and considered as candidates for future replication over the remaining \(h\) lookahead iterations (when \(h \geq 1\)). If instead a replicate is chosen in the first iteration, the lookahead scheme recursively searches over the choice of which of the remaining \(h\) iterations will pick a new \(\bar{x}_{n+1}\), with others optimizing over replicates. Recursion is resolved by moving to the second iteration and again splitting the decision path into a choice between replicate-explore-replicate-… and replicate-replicate-…, etc. After fulling unrolling the recursion in this way, optimizing up to horizon \(h\) along the \(h+1\) paths, the ultimate IMSPE for the respective hypothetical design with size \(N + 1 +h\) is computed, and the decision path yielding the smallest IMSPE is stored. Selection \(x_{N+1}\) is a new location if the explore-first path was optimal, and is a replicate otherwise.

Horizon \(h\) is a tuning parameter. It determines the extent to which replicates are entertained in lookahead, and therefore larger \(h\) somewhat inflates chances for replication. As \(h\) grows there are more and more decision paths that delay exploration to a later iteration; if any of them yield a smaller IMSPE than the explore-first path, the immediate action is to replicate. Although larger \(h\) allows more replication before committing to a new, unique \(\bar{x}_{n+1}\), it also magnifies the value of an \(\bar{x}_{n+1}\) chosen in the first iteration, as it could potentially accrue its own replicates in subsequent lookahead or real design iterations. Therefore, although we do find in practice that larger \(h\) leads to more replication in the final design, this association is weak. Before considering criteria for how to set horizon, consider the following illustration with fixed \(h=5\). Take \(f(x) = (6x -2)^2 \sin(12x-4)\) from Forrester, Sobester, and Keane (2008), implemented as f1d in hetGP, and observe \(Y(x)\sim \mathcal{N}(f(x), r(x))\), where the noise variance function is \(r(x) = (1.1 + \sin(2 \pi x))^2\).

fr <- function(x) { (1.1 + sin(2*pi*x)) }
fY <- function(x) { f1d(x) + rnorm(length(x), sd=fr(x)) }

This example was engineered to be similar to the motorcycle accident data (Section 10.2), but allow for bespoke evaluation. Begin with an initial uniform design of size \(N=n=10\), i.e., without replicates, and associated hetGP fit based on a Gaussian kernel.

X <- seq(0, 1, length=10)
Y <- fY(X)
mod <- mleHetGP(X=X, Z=Y, lower=0.0001, upper=10)

Using that fit, calculate IMSPE with horizon \(h=5\) lookahead over replication. The IMSPE_optim call below leverages closed form IMSPE (10.11) and derivatives within an optim call with method="L-BFGS-B", mixed with discrete evaluations (10.13).

opt <- IMSPE_optim(mod, h=5) 
X <- c(X, opt$par)
X
##  [1] 0.00000 0.11111 0.22222 0.33333 0.44444 0.55556 0.66667 0.77778
##  [9] 0.88889 1.00000 0.06662

Whether or not the chosen location, in position eleven above (0.0666), is a replicate depends the random Rmarkdown build, challenging precise narrative here. The hetGP package provides an efficient partition inverse-based updating method (6.8), utilizing \(\mathcal{O}(n^2)\) or \(\mathcal{O}(n)\) updating calculations for new data points depending on whether that point is unique or a replicate, respectively. Details are provided by Binois et al. (2019), extending those from Chapter 6.3 to the heteroskedastic case.

ynew <- fY(opt$par)
Y <- c(Y, ynew)
mod <- update(mod, Xnew=opt$par, Znew=ynew, ginit=mod$g*1.01)

That’s the basic idea. Let’s continue and gather a total of 500 samples in this way, in order to explore the aggregate nature of the sequential design. Periodically (every 25 iterations in the code below), it can be help to restart MLE calculations to help “unstick” any solutions found in local modes of the likelihood surface. Gathering 500 points is overkill for this simple 1d problem, but it helps create a nice visualization.

for(i in 1:489) {
  
  ## find the next point and update 
  opt <- IMSPE_optim(mod, h=5)
  X <- c(X, opt$par)
  ynew <- fY(opt$par)
  Y <- c(Y, ynew)
  mod <- update(mod, Xnew=opt$par, Znew=ynew, ginit=mod$g*1.01)
  
  ## periodically attempt a restart to try to escape local modes
  if(i %% 25 == 0){ 
    mod2 <- mleHetGP(X=list(X0=mod$X0, Z0=mod$Z0, mult=mod$mult), Z=mod$Z, 
      lower=0.0001, upper=1)
    if(mod2$ll > mod$ll) mod <- mod2
  } 
}
nrow(mod$X0)
## [1] 73

Of the \(N=500\) total acquisitions, the final design contains \(n= 73\) unique locations. To help visualize and assess the quality of the final surface with that design, code below gathers predictive quantities on a dense grid in the input space.

xgrid <- seq(0, 1, length=1000)
p <- predict(mod, matrix(xgrid, ncol=1)) 
pvar <- p$sd2 + p$nugs

Figure 10.11 shows the resulting predictive surface in red, with the true analog in black. Gray vertical bars help visualize relative degrees of replication at each unique input location.

plot(xgrid, f1d(xgrid), type="l", xlab="x", ylab="y",  ylim=c(-8,18))
lines(xgrid, qnorm(0.05, f1d(xgrid), fr(xgrid)), col=1, lty=2)
lines(xgrid, qnorm(0.95, f1d(xgrid), fr(xgrid)), col=1, lty=2)
points(X, Y)
segments(mod$X0, rep(0, nrow(mod$X0)) - 8, mod$X0, 
  (mod$mult - 8)*0.5, col="gray")
lines(xgrid, p$mean, col=2)
lines(xgrid, qnorm(0.05, p$mean, sqrt(pvar)), col=2, lty=2)
lines(xgrid, qnorm(0.95, p$mean, sqrt(pvar)), col=2, lty=2)
legend("top", c("truth", "estimate"), col=1:2, lty=1:2)
Sequential design with horizon $h=5$:  truth in black; predictive in red; vertical gray line segments indicate relative degrees of replication.

FIGURE 10.11: Sequential design with horizon \(h=5\): truth in black; predictive in red; vertical gray line segments indicate relative degrees of replication.

Both degree of replication and density of unique elements is higher in the high-noise region than where noise is lower. In a batch design setting and in the unique situation where relative noise levels are somehow known, a rule of thumb of more samples or replicates in higher noise regimes is sensible. Such knowledge is unrealistic, and even so the optimal density differentials and degrees of replication are not immediate. Proceeding sequentially allows learning to adapt to design and vice versa.

A one-dimensional input case is highly specialized. In higher dimension, where volumes are harder to fill, a more delicate balance may need to be struck between the value of information in high-versus-low noise regions. Given the choice between high- and low-noise locations, which are otherwise equivalent, a low noise acquisition is clearly preferred. Determining an appropriate lookahead horizon can be crucial to making such trade-offs.

Tuning the horizon

Horizon \(h=5\) is rather arbitrary. Although it’s difficult to speculate on details regarding the quality of the surface presented in Figure 10.11, due to the random nature of the Rmarkdown build, scope for improvement may be apparent. Chances are that uncertainty is overestimated in some regions and underestimated in others. In the version I’m looking at, the right-hand/low-noise region is over-sampled. A solution lies in tuning lookahead horizon \(h\) online. Binois et al. (2019) proposed two simple schemes. The first adjusts \(h\) in order to target a desired ratio \(\rho = n/N\) and thus manage the surrogate modeling cost through the Woodbury trick (Section 10.1.1):

\[ \mbox{Target:} \quad h_{N+1} \leftarrow \left\{ \begin{array}{lll} h_N + 1 & \mbox{if } n/N > \rho & \mbox{and a new $\bar{x}_{n+1}$ is chosen} \\ \max\{h_N-1,-1\} & \mbox{if } n/N < \rho & \mbox{and a replicate is chosen} \\ h_N & \mbox{otherwise}. \end{array} \right. \]

Horizon \(h\) is nudged downward when the empirical degree of replication is lower than the desired ratio; or vice versa. The second attempts to adapt and minimize IMSPE regardless of computational cost:

\[ \mbox{Adapt:} \quad h_{N+1} \sim \mathrm{Unif} \{ a'_1, \dots, a'_n \} \quad \text{ with }\quad a'_i := \max(0, a_i^* - a_i). \]

Ideal replicate values

\[\begin{equation} a_i^* \propto \sqrt{ r(\bar{x}_i) (K_n^{-1} W_n K_n^{-1})_{i,i}} \tag{10.14} \end{equation}\]

come from a criterion in the SK literature (Ankenman, Nelson, and Staum 2010). See the allocate_mult in hetGP and homework exercises in Section 10.4 for more details. Here the idea is to stochastically nudge empirical replication toward an optimal setting derived by fixing unique-\(n\) locations \(\bar{x}_1,\dots,\bar{x}_n\), and allocating any remaining design budget entirely to replication. An alternative, deterministic analog, could be \(h_{N + 1} = \max_i a_i'\).

Code below duplicates the example above with the adapt heuristic. Alternatively, horizon calls can be provided target and previous_ratio arguments in order to implement the target heuristic instead. Begin by reinitializing the design.

X <- X[1:10]
Y <- Y[1:10]
mod.a <- mleHetGP(X=X, Z=Y, lower=0.0001, upper=10)
h <- rep(NA, 500)

Next, loop to obtain \(N=500\) observations under an adaptive horizon scheme.

for(i in 1:490) {
  
  ## adaptively adjust the lookahead horizon
  h[i] <- horizon(mod.a)
  
  ## find the next point and update 
  opt <- IMSPE_optim(mod.a, h=h[i])
  X <- c(X, opt$par)
  ynew <- fY(opt$par)
  Y <- c(Y, ynew)
  mod.a <- update(mod.a, Xnew=opt$par, Znew=ynew, ginit=mod.a$g*1.01)
  
  ## periodically attempt a restart to try to escape local modes
  if(i %% 25 == 0){ 
    mod2 <- mleHetGP(X=list(X0=mod.a$X0, Z0=mod.a$Z0, mult=mod.a$mult), 
      Z=mod.a$Z, lower=0.0001, upper=1)
    if(mod2$ll > mod.a$ll) mod.a <- mod2
  } 
}

Once that’s done, predict on the grid.

p.a <- predict(mod.a, matrix(xgrid, ncol=1))
pvar.a <- p.a$sd2 + p.a$nugs

The left panel of Figure 10.12 shows adaptively selected horizon over iterations of sequential design. Observe that a horizon of \(h=5\) is not totally uncommon, but is also higher than typically preferred by the adaptive scheme. In total, \(n=\) 114 unique sites were chosen – more than the with the fixed \(h=5\) case, but in the same ballpark compared to the full size of \(N=500\) acquisitions. The right panel of the figure shows final design and predictions versus the truth, matching closely that of Figure 10.11 corresponding to the fixed horizon (\(h=5\)) case.

par(mfrow = c(1,2))
plot(h, main="horizon", xlab="iteration")
plot(xgrid, f1d(xgrid), type="l", xlab="x", ylab="y", 
  main="adaptive horizon design", ylim=c(-8,18))
lines(xgrid, qnorm(0.05, f1d(xgrid), fr(xgrid)), col=1, lty=2)
lines(xgrid, qnorm(0.95, f1d(xgrid), fr(xgrid)), col=1, lty=2)
points(X, Y)
segments(mod$X0, rep(0,nrow(mod$X0)) - 8, mod$X0, 
  (mod$mult - 8)*0.5,  col="gray")
lines(xgrid, p$mean, col=2)
lines(xgrid, qnorm(0.05, p$mean, sqrt(pvar.a)), col=2, lty=2)
lines(xgrid, qnorm(0.95, p$mean, sqrt(pvar.a)), col=2, lty=2)
Horizons chosen per iteration (left); final design and predictions versus the truth (right) as in Figure 10.11.

FIGURE 10.12: Horizons chosen per iteration (left); final design and predictions versus the truth (right) as in Figure 10.11.

The code below offers an out-of-sample comparison via RMSE (lower is better).

ytrue <- f1d(xgrid)
rmse <- c(h5=mean((ytrue - p$mean)^2), ha=mean((ytrue - p.a$mean)^2))
rmse
##      h5      ha 
## 0.01202 0.03500

Although speculating on the precise outcome of this comparison is difficult because of the noisy nature of sampling and horizon updates in this Rmarkdown build, a typical outcome is that the adaptive scheme yields slightly lower RMSE. Being able to adjust lookahead horizon enables the adaptive scheme to better balance exploration versus replication in order to efficiently learn a smoothly changing signal-to-noise ratio exhibited by the data-generating mechanism.

10.3.3 Examples

Here we revisit the SIR and ATO examples from Section 10.2.2 from a sequential design perspective.

Susceptible, infected, recovered

Begin with a limited initial design with just twenty unique locations \(n\) and five replicates upon each for \(N=100\) total runs.

X <- randomLHS(20, 2)
X <- rbind(X, matrix(rep(t(X), 4), ncol=2, byrow=TRUE))
Y <- apply(X, 1, sirEval)

A fit to that initial design uses options similar to the batch fit from 10.2.2, augmenting with a fixed mean and limits on \(\theta_{(\delta)}\). Neither is essential in this illustration, but help to limit sensitivity to a small initial design in this random Rmarkdown build.

fit <- mleHetGP(X, Y, covtype="Matern5_2", settings=list(linkThetas="none"), 
  known=list(beta0=0), noiseControl=list(upperTheta_g=c(1,1)))

To visualize predictive surfaces and progress over acquisitions out-of-sample, code below completes the 2d testing input grid XX from the batch example in Section 10.2.2 with output evaluations for benchmarking via RMSE and score.

YY <- apply(XX, 1, sirEval)

Completing the setup, code below allocates storage for RMSEs, scores and horizons, for a total of 900 acquisitions.

N <- 1000
score <- rmse <- h <- rep(NA, N)

The acquisition sequence here, under an adaptive horizon scheme, looks is very similar to our previous examples, modulo a few subtle changes. When reinitializing MLE calculations, for example, notice that fit$used_args is used below to maintain consistency in search parameters and modeling choices.

for(i in 101:N) {
  
  ## find the next point and update 
  h[i] <- horizon(fit)
  opt <- IMSPE_optim(fit, h=h[i])
  ynew <- sirEval(opt$par)
  fit <- update(fit, Xnew=opt$par, Znew=ynew, ginit=fit$g*1.01)
  
  ## periodically attempt a restart to try to escape local modes
  if(i %% 25 == 0){ 
    fit2 <- mleHetGP(X=list(X0=fit$X0, Z0=fit$Z0, mult=fit$mult), Z=fit$Z, 
      maxit=1e4, upper=fit$used_args$upper, lower=fit$used_args$lower, 
      covtype=fit$covtype, settings=fit$used_args$settings, 
      noiseControl=fit$used_args$noiseControl)
    if(fit2$ll > fit$ll) fit <- fit2
  }

  ## track progress
  p <- predict(fit, XX)
  var <- p$sd2 + p$nugs
  rmse[i] <- sqrt(mean((YY - p$mean)^2))
  score[i] <- mean(-(YY - p$mean)^2/var - log(var))
}

Figure 10.13 shows the resulting predictive surfaces and design selections. Compare to Figure 10.4 for a batch version. Similarity in the surfaces is high, especially for the mean (left panel), despite an order of magnitude smaller sequential design compared to the batch, which was based on more than ten thousand runs. Depending on the random Rmarkdown build, the standard deviation (right) may mis-locate the area of high noise in the \(I_0\) direction, over-sampling in the upper-right corner. This is eventually resolved with more active learning.

par(mfrow=c(1,2))
xx <- seq(0, 1, length=100)
image(xx, xx, matrix(p$mean, ncol=100), xlab="S0", ylab="I0", 
  col=cols, main="mean infected")
text(fit$X0, labels=fit$mult, cex=0.5)
image(xx, xx, matrix(sqrt(var), ncol=100), xlab="S0", ylab="I0", 
  col=cols, main="sd infected")
text(fit$X0, labels=fit$mult, cex=0.5)
Heteroskedastic GP fit to sequentially design SIR data showing predictive mean surface(left) and estimated standard deviation (right). Compare to batch analogue in Figure 10.4.

FIGURE 10.13: Heteroskedastic GP fit to sequentially design SIR data showing predictive mean surface(left) and estimated standard deviation (right). Compare to batch analogue in Figure 10.4.

Density of unique locations, indicated by the coordinates of the numbers potted in both panels, appears uniform to the naked eye. However, degree of replication is greater in the right-half of inputs corresponding to \(S_0 > 0.5\). All double-digit multiplicities are in this right-hand half-plane. As a summary of progress over sequential design iterations, Figure 10.14 shows a progression of RMSEs (left), proper scores (middle) and horizon (right). The first two panels have the batch analog overlayed in dashed-red.

par(mfrow=c(1,3))
plot(rmse, type="l", xlab="n: sequential iterate", ylim=c(0.06, 0.07))
abline(h=sqrt(mean((YY - psir$mean)^2)), col=2, lty=2)
legend("topright", c("sequential", "batch"), col=1:2, lty=1:2)
plot(score, type="l", xlab="n: sequential iterate", ylim=c(4, 5))
abline(h=mean(-(YY - psir$mean)^2/vsir - log(vsir)), col=2, lty=2)
plot(h, xlab="n: sequential iterate", ylab="lookahead horizon")
Summary of sequential design progress via RMSE (left), proper score (middle) and selected lookahead horizon (right) under the adapt scheme.

FIGURE 10.14: Summary of sequential design progress via RMSE (left), proper score (middle) and selected lookahead horizon (right) under the adapt scheme.

In spite of the much larger batch training set, our sequential design performs very similarly out-of-sample. The precise outcome depends on the Rmarkdown build; note that both solid and dashed lines depend on random initializations/designs and responses through sirEval. Thus they have a distribution that could be averaged-over in a Monte Carlo fashion. That exercise is left to the curious reader. To wrap up, notice in the right-hand panel that horizon favors low numbers, emphasizing \(h=0\) and \(h=1\), but more than twenty percent of selections are larger than that.

mean(h > 1, na.rm=TRUE)
## [1] 0.2489

Limited lookahead horizon is deemed to be beneficial for separating signal from noise in this example. It is worth reiterating that the IMSPE criteria (10.9) and lookahead horizon (10.14) emphasize mean-accuracy not variance accuracy or minimization. Nevertheless they perform well on score which combines measurements of both. This is because they – especially longer horizons – encourage replication which offers a pure look at noise in order to pin down local variance.

Inventory management

ATO is a much larger example. Section 10.2.2 leveraged a fit stored in the ato data object. Fitting live in the Rmarkdown build would have represented a prohibitive undertaking. That same data object contains an "hetGP"-class model out.a that was trained under an adaptive horizon IMSPE-based sequential design scheme. The size of that design and the time it took to train are quoted by the output of the code below.

c(n=nrow(out.a$X0), N=length(out.a$Z), time=out.a$time)
##            n            N time.elapsed 
##         1194         2000        38738

Recall that the earlier experiment involved \(n=1000\) unique sites with an average of five replicates upon each, for a total of about \(N=5000\) training data points. The actively selected training set here is much smaller, having \(N=2000\) at \(n=\) 1194 unique locations. So the sequential design has more unique locations but still a nontrivial degree of replication, resulting in many fewer overall runs of the ATO simulator. Utilizing the same out-of-sample testing set from the previous score-based comparison, the code below calculates predictions and scores with this new sequential design.

phet.a <- predict(out.a, Xtest)
phets2.a <- phet.a$sd2 + phet.a$nugs
mhet.a <- as.numeric(t(matrix(rep(phet.a$mean, 10), ncol=10)))
s2het.a <- as.numeric(t(matrix(rep(phets2.a, 10), ncol=10)))
sehet.a <- (unlist(t(Ztest)) - mhet.a)^2
sc.a <- - sehet.a/s2het.a - log(s2het.a)
c(batch=mean(sc), adaptive=mean(sc.a))
##    batch adaptive 
##    3.396    3.615

Observe that sequential design leads to more a accurate predictor than the earlier batch alternative, despite the former having been trained on about 60% fewer runs. Illustrating those acquisitions requires “rebuilding” the out.a object. To keep hetGP compact for CRAN, \(\mathcal{O}(n^2)\) covariance matrices and inverses have been deleted via return.matrices=FALSE in the mleHetGP command used to build out.a.

out.a <- rebuild(out.a)

The calculation sequence for acquisitions involved in this sequential design begins by determining the horizon and then searching with that horizon via IMSPE. In code, that amounts to the following.

Wijs <- Wij(out.a$X0, theta=out.a$theta, type=out.a$covtype)
h <- horizon(out.a, Wijs=Wijs)
control <- list(tol_dist=1e-4, tol_diff=1e-4, multi.start=30)
opt <- IMSPE_optim(out.a, h, Wijs=Wijs, control=control)

Pre-calculating Wijs (\(W\)) economizes a little on execution time since these are needed by both horizon and IMSPE_optim. A control argument to IMSPE_optim allows users to fine-tune fidelity of search over the criteria. Above, the number of restarts and tolerances on distance to existing sites (i.e., replicates) in search is adjusted upwards of their default settings. For restarts, our adjustments to the default are intended to make search more global. Regarding tolerances, higher values make selection of replicates more likely, determining that optim outputs within 1e-4 in input (\(x\)) or output (\(I_{N+1}(x_{N+1}\)) should snap to existing sites.

opt$par
##        [,1]    [,2]   [,3]   [,4]   [,5]   [,6] [,7]   [,8]
## [1,] 0.1047 0.03462 0.6493 0.3915 0.1239 0.0874 0.03 0.4859

Acquisition is completed by feeding that 8d location into the ATO simulator. Subsequently the chosen input–output pair would be used to update the model fit. An indication of whether or not the new location is unique (i.e., actually new) or a replicate is provided by the path field in the IMSPE_optim output. The path list contains the best sequence of points via elements par, value, and new of the acquisition followed by \(h\) others computed as lookahead selections.

opt$path[[1]]$new
## [1] TRUE

ATO inputs are actually on a grid, so evaluation would require “snapping” this continuous solution to that grid (after undoing the coding of the inputs). Further replication could be induced as part of that discretization process.

10.3.4 Optimization, level sets, calibration and more

Throughout this chapter we have focused on reducing mean-squared prediction error, a global criterion. Targeting specific regions of interest, such as global minima (BO; Chapter 7) or level sets is also of interest. Bogunovic et al. (2016) provide a treatment unifying criteria for these settings. Consider BO first, having already treated the noisy case in Chapter 7.2.4. Picheny, Wagner, and Ginsbourger (2012) provide an important set of benchmarks specifically for this setting. Although there are many suitable criteria for BO with GPs, EI emerges as perhaps the most appealing because it organically balances exploitation and exploration without any tuning parameters, does not require numerical approximation, and has a closed-form derivative. In the face of noise, and particularly heteroskedasticity, simplicity and computational tractability are important features when adapting a method originally designed for noiseless/deterministic settings.

As mentioned in the lead-in to Section 10.3.2, lookahead versions of EI have been studied (see, e.g., Ginsbourger and Le Riche 2010; Gonzalez, Osborne, and Lawrence 2016; Lam, Willcox, and Wolpert 2016; Huan and Marzouk 2016), but closed-form expressions exists only for one-step-ahead versions. Unlike IMPSE, future values of the criterion depend on future function values. Inspired by Lam, Willcox, and Wolpert (2016) and IMSPE-based lookahead presented in Section 10.3.2, Binois and Gramacy (2018) introduce a replication-biased lookahead for EI which circumvents unknown future function values through simple imputation: \(y_{N+1} \leftarrow \mu_N(x_{N+1})\), which is also known as a kriging “believer” approach Ginsbourger, Le Riche, and Carraro (2010). Examples are left to Section 4.1 in that paper, which is also available as a vignette in the hetGP package.

Active learning for the related problem of contour finding, or level set estimation, proceeds similarly. The objective is to identify a region of inputs defined by a threshold on outputs

\[ \Gamma = \left\{x \in \mathbb{R}^m : Y(x) > T \right\}, \]

where \(T\) can be zero without loss of generality. As with EI, the canonical development is for deterministic settings. In the presence of noise, take \(Y(x) \equiv F(x)\), the latent random field (Chapter 5.3.2) with nugget-free epistemic uncertainty \(\breve{\sigma}(x)\) from Eq. (10.10). Criteria defined by Lyu, Binois, and Ludkovski (2018) are implemented for homoskedastic and heteroskedastic GPs and TPs in hetGP. Several other criteria can be found in the literature (Chevalier et al. 2013; Bogunovic et al. 2016; Azzimonti et al. 2016) with a selection of implementations provided by the KrigInv package (Chevalier et al. 2018) on CRAN (Chevalier, Picheny, and Ginsbourger 2014; Chevalier, Ginsbourger, and Emery 2014).

The simplest criterion for active learning in this context is maximum contour uncertainty (MCU), implemented as crit_MCU in hetGP. MCU is based on local probability of misclassification, namely, that \(f\) is incorrectly predicted to be below or above \(T\) (Bichon et al. 2008; Ranjan, Bingham, and Michailidis 2008). A second criterion, contour stepwise uncertainty reduction (cSUR), implemented by crit_cSUR in hetGP, amounts to calculating a one-step-ahead reduction of MCU. A more computationally intensive, but more global, alternative involves integrating cSUR over the domain in a manner similar to IMSPE (10.9) for variance reduction or IECI (Gramacy and Lee 2011) for BO (Chapter 7.2.3). In practice, the integral is approximated by a finite sum, which is the approach taken by crit_ICU in hetGP. Finally, targeted mean square error (tMSE) (Picheny et al. 2010), is provided by crit_tMSE and designed to reduce variance close to the target contour. Again, examples are provided in Section 4.2 of Binois and Gramacy (2018).

As a final topic, consider computer model calibration (Chapter 8.1) and inverse problems with a noisy, possibly heteroskedastic simulator. I’m not aware of SK or hetGP-like methodology yet being deployed in this setting, but Fadikar et al. (2018) consider quantile kriging (QK; Plumlee and Tuo 2014) to good effect. Pairing a massive agent-based epidemiological simulation campaign based on social networks with data on an Ebola outbreak in Liberia, Fadikar, et al., built a QK surrogate for functional output (counts of infecteds over time) via principle components (Higdon et al. 2008) in order to back out a distribution on simulator tuning parameters most likely to have generated and observed (real) sequence of infecteds. Entertaining alternatives to QK, like hetGP, in Rmarkdown on this specific example are complicated by access to a supercomputer scale simulation capability and Matlab (rather than R) libraries for working with functional output (Gattiker et al. 2016).

In lieu of that, consider the following setup involving similar features without the complications of functional output, and ready access to simulations. Herbei and Berliner (2014) describe a Feynman-Kac simulator, whose source may be downloaded from GitHub, that models the concentration of a tracer within a given spatial domain. One application is modeling the oxygen concentration in a thin water layer deep in the ocean (McKeague et al. 2005). The simulator is stochastic, approximating the solution of an advection-diffusion via Monte Carlo, and is highly heteroskedastic. There are four real-valued inputs: two spatial coordinates (longitude and latitude) and two diffusion coefficients. Interest is in learning, i.e., inverting, the diffusion coefficients. In the notation of Chapter 8, the spatial coordinates are \(x\) and diffusion coefficients are \(u\). So the goal is use the simulator \(Y^M(x,u)\) to learn \(\hat{u}\) from measurements \(Y^F(x)\) of oxygen concentration in the field, e.g., at \(x\)’s in the Southern Ocean.

In some regions of the input space the simulator response is multimodal, i.e., non-Gaussian, but this can be resolved through minimal (and automatic) replication and averaging. While simulation code is available for C/CUDA, R and Matlab on the GitHub page, code linked from the book webpage provides the essentials, via fksim.R and configuration files fkset.RData read therein, with several enhancements. Coded inputs and sensible defaults for the diffusion coefficients (calibration parameters \(u\)) promote plug-and-playability; automatic degree-6 replication multiplicity with averaging eliminates multimodality.

With those codes, consider simulations along a longitudinal slice in the input space …

source("fksim.R")
x2 <- seq(0, 1, length=100)
y <- sapply(x2, function(x2) { fksim(c(0.8, x2)) })

… with visual following in Figure 10.15.

plot(x2, y, xlab="coded longitude", ylab="oxygen concentration")
Simulated oxygen concentration along a slice in longitude, with the other three parameters fixed.

FIGURE 10.15: Simulated oxygen concentration along a slice in longitude, with the other three parameters fixed.

Clear signal is evident in the figure, but also clear input-dependent variance. It seems that a much higher degree of replication would be required to effectively separate signal and noise, at least in some parts of the input space, in addition to more exploration to fill out the 4d space. Two homework exercises in Section 10.4 take the curious reader through surrogate modeling, design, and inversion with this simulator. For inversion, calibration methods from Chapter 8.1 can be applied. Alternatively the log posterior coded below, provided in order to keep our development here somewhat more self-contained, can be used instead.

library(mvtnorm)
lpost.invert <- function(theta, XF, yF, GP)
 {
  ## input processing and checking
  if(length(theta) != ncol(GP$X0) - ncol(XF) + 1) 
    stop("length(theta), ncol(XF), ncol(GP$X0) mismatch")
  u <- theta[-length(theta)]
  s2 <- theta[length(theta)]
  
  ## prior checking  
  if(any(u < 0 | u > 1)) return (-Inf)
  if(s2 < 0) return(-Inf)

  ## derive predictive distribution for XF paired with u
  XFU <- cbind(XF, matrix(rep(u, nrow(XF)), ncol=length(u), byrow=TRUE)) 
  p <- predict(GP, XFU, xprime=XFU)
  C <- s2*diag(nrow(p$cov)) + (p$cov + t(p$cov))/2

  ## gaussian log density evaluation for yF under that predictive
  return(dmvnorm(yF, p$mean, C, log=TRUE) - log(s2))
 }

Notice that lpost.invert, above, is coded to work with a (possibly homoskedastic) "hetGP"-class object GP. Parameter theta combines diffusion coefficients \(u\), in its first several coordinates, with iid noise variance paramter \(\sigma^2\) in its last coordinate. Implicit in lpost.invert is a uniform prior \(u \sim \mathrm{Unif}(0,1)^{m_u}\) and scale-invariant Jeffrey’s prior \(\sigma^2 \sim 1/\sigma^2\). Arguments XF and YF are field data pairs \((X_{n_F}, Y_{{n_F}})\), observed under the correct but unknown \(u^\star\). That likelihood could be optimized, or used in a Bayesian analysis after placing priors on \(u\) and \(\sigma^2\). It measures the surrogate predictive probability density of \(Y_{n_F}\) values when paired with \(X_{n_F}\) and candidate \(u\)-settings. This is more or less equivalent to Kennedy & O’Hagan’s likelihood (8.3) in the “nobias” case.

10.4 Homework exercises

Prompts below explore hetGP variations, sequential design and replication, and application in a calibration (Chapter 8.1) setting.

#1: Latent log variances

In Section 10.2.1, inferential expressions are developed for \(\Lambda_n\) as a diagonal storing smoothed variances for GP modeling with covariance structure \(\nu (C_N + \Lambda_N)\). Unfortunately, such \(\Lambda_n \mid \Delta_n\) given latent \(\Delta_n\) in Eq. (10.7) can be negative, which is bad for a variance. Describe how the scheme can be adapted to work with log variances instead. Hint: the answer to this question is relatively short, but getting it right requires a careful inspection of likelihood and derivatives.

#2: Compensating for mean nonstationarity

Chapter 5.3.3 discussed an equivalence between mean and kernel/covariance characterization for GPs. What happens when you fit hetGP, designed for variance nonstationarity, to data which is homoskedastic but exhibits mean nonstationarity? Consider two examples.

  1. Begin with our favorite 2d data: \(x_1 \exp(-x_1^2 - x_2^2) + \varepsilon \sim \mathcal{N}(0, 0.1^2)\), where \(x \in [-2,4]^2\). In Chapter 9.2.2 we used trees to partition up the input space and focus on the interesting lower-left quadrant. We also used it above in Section 10.1.2. Train homoskedastic and heteroskedastic separable Gaussian kernel GPs to outputs collected on a design composed of a uniform \(21 \times 21\) grid in the input space, with three replicates on each site. Plot the predictive mean and standard deviation surfaces using a \(101 \times 101\) grid and comment. Compare pointwise proper scores [Gneiting and Raftery (2007), Eq. (27); also see Chapter 9.1.3] with testing data simulated on that grid uniform grid.

  2. Now investigate a 1d response with a regime shift: \(f(x) = \sin(\pi(10x + 0.5)) + 10 (-1)^{\mathbb{I}(x < 0.5)}\), for \(x \in [0,1]\) and observed under standard normal noise. Train homoskedastic and heteroskedastic separable Gaussian kernel GPs to training data observed on a uniform length \(n=30\) grid. Visualize the predictive surfaces on a uniform 100-length testing grid and comment. Compare RMSEs and pointwise proper scores using testing evaluations on that grid.

#3: Optimal replication?

Here the goal is to compare SK’s ideal replication numbers \(a_i\) at existing training inputs \(\bar{x}_i\), \(i=1,\dots,n\), with lookahead based IMSPE analogues limited to those same sites.

  1. Create a uniform design in \([0,1]\) of length \(n=21\) and evaluate five replicates of each under fY from Section 10.3.2 and fit a hetGP surface using a Matèrn \(\nu=5/2\) kernel. (Otherwise the fit <- mleHetGP(...) defaults are fine.) Obtain predictions on a fine grid in the input space and compare your predicted nuggets predict(...)$nugs to the truth fr.

  2. Suppose the plan was to gather a total of \(N = 201\) simulations from fY. You’ve gathered \(n=21\) already, so there are \(N_{\mathrm{rest}} = 180\) more to go. Use allocate_mult with Ntot \(=N\) to calculate the optimal \(a_i\), for \(i=1,\dots,n\).

  3. Similarly use opt <- IMSPE_optim(...) with argument h = \(N_{\mathrm{rest}}\) candidates Xcand=fit$X0 from existing unique sites. Record all opt$path[[j]]$par as selected \(x\) locations and extract from those a multiplicity count, i.e., number of replicates, corresponding to each unique setting.

  4. Plot your results from #a–c on common axes and comment.

#4: Ocean sequential design

Consider the ocean/oxygen simulator introduced in Section 10.3.4. Here we shall treat only the spatial inputs \(x\), leaving diffusion coefficients \(u\) at their default values.

  1. Build a design in 2d comprised of \(n=25\) unique LHS (Chapter 4.1) locations with twenty replicates on each. Obtain simulation responses at each of these \(N=500\) locations. Fit heteroskedastic and homomskedastic GPs to these data and provide a visual comarison between the two on a \(100 \times 100\) predictive grid, and on a slice of that grid fixing the latitude input near 80% in coded units. Overlay runs from Section 10.3.4 to that slice and comment. You might find it helpful, but not essential, to use the following settings in your fit(s).
lower <- rep(0.01, 2)
upper <- rep(30, 2)
covtype <- "Matern5_2"
noiseControl <- list(g_min=1e-6, g_bounds=c(1e-6, 1), lowerDelta=log(1e-6))
settings <- list(linkThetas="none", initStrategy="smoothed", 
  return.hom=TRUE)
  1. Now consider sequential design under the heteroskedastic GP only. Begin with \(n_{\mathrm{init}} = 25\) locations, ideally the same ones from #a, but only four replicates on each so that \(N_{\mathrm{init}} = 100\). Acquire 400 more simulations under an adaptive lookahead IMSPE scheme. At the end, augment your visuals from #a, including horizon over the iterations, and comment. You might find it helpful, but not essential, to utilize the following in your IMSPE searches.
control <- list(tol_dist=1e-4, tol_diff=1e-4, multi.start=30)
  1. Finally, repeat #a–b at least ten times and save RMSEs and pointwise proper scores calculated on out-of-sample testing data observed on the predictive grid. (No need to repeat the visuals, just the designs and fits.) Provide boxplots of RMSEs and scores and comment.

#5: Ocean calibration

Data in ocean_field.txt contains \(n_F = N_F = 150\) observations of oxygen concentration at sites in the Southern Ocean matching baseline diffusion coefficients \(u^\star\) encoded in the default fksim formals. See Section 10.3.4 and #4 for details. Separately, build homoskedastic and heteroskedastic GP surrogates from fksim simulation campaign output provided by ocean_runs.txt that combines a 4d LHS of size 500, over latitude and longitude \(x\)-sites with the two diffusion coefficients \(u\), with a 2d maximin LHS of \(u\)-values paired with \(X_{n_F}\) from the field data. Ten replicate runs at each location are gathered for a grand total of \(N_M = 6000\) simulations. Use lpost.invert from Section 10.3.4 in order to sample from a Bayesian posterior distribution of \(u\) and \(\sigma^2\), separately for both surrogates. Compare your posterior distribution(s) for \(u\) with true \(u^\star\) values.

References

Sacks, J, WJ Welch, TJ Mitchell, and HP Wynn. 1989. “Design and Analysis of Computer Experiments.” Statistical Science 4 (4). Institute of Mathematical Statistics: 409–23.

Cioffi–Revilla, Claudio. 2014. Introduction to Computational Social Science. New York, NY: Springer.

Law, A. 2015. Simulation Modeling and Analysis. 5th ed. McGraw-Hill.

Johnson, LR. 2008. “Microcolony and Biofilm Formation as a Survival Strategy for Bacteria.” Journal of Theoretical Biology 251: 24–34.

Picheny, V, and D Ginsbourger. 2013. “A Nonstationary Space-Time Gaussian Process Model for Partially Converged Simulations.” SIAM/ASA Journal on Uncertainty Quantification 1: 57–78.

Banerjee, S, BP Carlin, and AE Gelfand. 2004. Hierarchical Modeling and Analysis for Spatial Data. Chapman; Hall/CRC.

Rasmussen, CE, and CKI Williams. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press.

Ankenman, B, BL Nelson, and J Staum. 2010. “Stochastic Kriging for Simulation Metamodeling.” Operations Research 58 (2). INFORMS: 371–82.

Binois, M, RB Gramacy, and M Ludkovski. 2018. “Practical Heteroscedastic Gaussian Process Modeling for Large Simulation Experiments.” Journal of Computational and Graphical Statistics 27 (4). Taylor & Francis: 808–21. doi:10.1080/10618600.2018.1458625.

Goldberg, PW, CKI Williams, and CMB Bishop. 1998. “Regression with Input-Dependent Noise: A Gaussian Process Treatment.” In Advances in Neural Information Processing Systems, 10:493–99. Cambridge, MA: MIT Press.

Quadrianto, N, K Kersting, M Reid, T Caetano, and W Buntine. 2009. “Kernel Conditional Quantile Estimation via Reduction Revisited.” In Proceedings of the 9th Ieee International Conference on Data Mining, 938–43.

Boukouvalas, A, and D Cornford. 2009. “Learning Heteroscedastic Gaussian Processes for Complex Datasets.” Aston University, Neural Computing Research Group.

Lazaro–Gredilla, M, and M Titsias. 2011. “Variational Heteroscedastic Gaussian Process Regression.” In Proceedings of the International Conference on Machine Learning, 841–48. New York, NY: ACM.

Binois, M, J Huang, RB Gramacy, and M Ludkovski. 2019. “Replication or Exploration? Sequential Design for Stochastic Simulation Experiments.” Technometrics 27 (4). Taylor & Francis: 808–21. doi:10.1080/00401706.2018.1469433.

Plumlee, M, and R Tuo. 2014. “Building Accurate Emulators for Stochastic Simulations via Quantile Kriging.” Technometrics 56 (4). Taylor & Francis: 466–73.

Snelson, E, and Z Ghahramani. 2006. “Sparse Gaussian Processes Using Pseudo-Inputs.” In Advances in Neural Information Processing Systems, 1257–64.

Banerjee, S, AE Gelfand, AO Finley, and H Sang. 2008. “Gaussian Predictive Process Models for Large Spatial Data Sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (4). Wiley Online Library: 825–48.

Pratola, M, H Chipman, EI George, and R McCulloch. 2017. “Heteroscedastic BART Using Multiplicative Regression Trees.” Preprint on arXiv:1709.07542.

Binois, M, and RB Gramacy. 2019. hetGP: Heteroskedastic Gaussian Process Modeling and Design Under Replication. https://CRAN.R-project.org/package=hetGP.

Roustant, O, D Ginsbourger, and Y Deville. 2018. DiceKriging: Kriging Methods for Computer Experiments. https://CRAN.R-project.org/package=DiceKriging.

Dancik, GM. 2018. mlegp: Maximum Likelihood Estimates of Gaussian Processes. https://CRAN.R-project.org/package=mlegp.

Opsomer, J.D., D. Ruppert, W.P. Wand, U. Holst, and O. Hossler. 1999. “Kriging with Nonparameteric Variance Function Estimation.” Biometrics 55: 704–10.

Ng, SH, and J Yin. 2012. “Bayesian Kriging Analysis and Design for Stochastic Systems.” ACM Transations on Modeling and Computer Simulation (TOMACS) 22 (3): article no. 17.

Gramacy, RB. 2014. plgp: Particle Learning of Gaussian Processes. https://CRAN.R-project.org/package=plgp.

Kersting, K, C Plagemann, P Pfaff, and W Burgard. 2007. “Most Likely Heteroscedastic Gaussian Process Regression.” In Proceedings of the International Conference on Machine Learning, 393–400. New York, NY: ACM.

Kamiński, B. 2015. “A Method for the Updating of Stochastic Kriging Metamodels.” European Journal of Operational Research 247 (3). Elsevier: 859–66.

Wang, W, and X Chen. 2016. “The Effects of Estimation of Heteroscedasticity on Stochastic Kriging.” In Proceedings of the 2016 Winter Simulation Conference, 326–37. IEEE Press.

Lee, MR, and AB Owen. 2015. “Single Nugget Kriging.” Stanford University.

Binois, M, and RB Gramacy. 2018. “hetGP: Heteroskedastic Gaussian Process Modeling and Sequential Design in R.” Available as a Vignette in the R Package.

Hu, R, and M Ludkovski. 2017. “Sequential Design for Ranking Response Surfaces.” SIAM/ASA Journal on Uncertainty Quantification 5 (1). SIAM: 212–39.

Franck, CT, and RB Gramacy. 2020. “Assessing Bayes Factor Surfaces Using Interactive Visualization and Computer Surrogate Modeling.” To Appear in the American Statistician.

Gramacy, RB, and E Pantaleo. 2010. “Shrinkage Regression for Multivariate Inference with Missing Data, and an Application to Portfolio Balancing.” Bayesian Analysis 5 (2). International Society for Bayesian Analysis: 237–62. doi:10.1214/10-BA602.

Gramacy, RB. 2018b. monomvn: Estimation for Multivariate Normal and Student-\(t\) Data with Monotone Missingness. https://CRAN.R-project.org/package=monomvn.

Jacquier, E, NG Polson, and PE Rossi. 2004. “Bayesian Analysis of Stochastic Volatility Models with Fat-Tails and Correlated Errors.” Journal of Econometrics 122: 185–212.

Chung, M, M Binois, J Bardsley, RB Gramacy, DJ Moquin, AP Smith, and AM Smith. 2019. “Parameter and Uncertainty Estimation for Dynamical Systems Using Surrogate Stochastic Processes.” SIAM Journal on Scientific Computing 41 (4): A2212–A2238.

Hong, LJ, and BL Nelson. 2006. “Discrete Optimization via Simulation Using COMPASS.” Operations Research 54 (1): 115–29.

Xie, J, PI Frazier, and S Chick. 2012. “Assemble to Order Simulator.” http://simopt.org/wiki/index.php?title=Assemble_to_Order&oldid=447.

Gneiting, T, and AE Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477). Taylor & Francis: 359–78.

Seo, S, M Wallat, T Graepel, and K Obermayer. 2000. “Gaussian Process Regression: Active Data Selection and Test Point Rejection.” In Mustererkennung 2000, 27–34. New York, NY: Springer–Verlag.

Gramacy, RB, and HKH Lee. 2009. “Adaptive Design and Analysis of Supercomputer Experiments.” Technometrics 51 (2). Taylor & Francis: 130–45.

Gauthier, B, and L Pronzato. 2014. “Spectral Approximation of the IMSE Criterion for Optimal Designs in Kernel-Based Interpolation Models.” SIAM/ASA Journal on Uncertainty Quantification 2 (1). SIAM: 805–25.

Gorodetsky, A, and YM Marzouk. 2016. “Mercer Kernels and Integrated Variance Experimental Design: Connections Between Gaussian Process Regression and Polynomial Approximation.” SIAM/ASA Journal on Uncertainty Quantification 4 (1). SIAM: 796–828.

Pratola, MT, O Harari, D Bingham, and GE Flowers. 2017. “Design and Analysis of Experiments on Nonconvex Regions.” Technometrics. Taylor & Francis, 1–12.

Gramacy, RB, and MA Taddy. 2016. tgp: Bayesian Treed Gaussian Process Models. https://CRAN.R-project.org/package=tgp.

Anagnostopoulos, C, and RB Gramacy. 2013. “Information-Theoretic Data Discarding for Dynamic Trees on Data Streams.” Entropy 15 (12). Multidisciplinary Digital Publishing Institute: 5510–35.

Burnaev, E, and M Panov. 2015. “Adaptive Design of Experiments Based on Gaussian Processes.” In Statistical Learning and Data Sciences, 116–25. Springer-Verlag.

Leatherman, ER, TJ Santner, and AM Dean. 2017. “Computer Experiment Designs for Accurate Prediction.” Statistics and Computing. Springer-Verlag, 1–13.

Ginsbourger, David, and Rodolphe Le Riche. 2010. “Towards Gaussian Process-Based Optimization with Finite Time Horizon.” In MODa 9–Advances in Model-Oriented Design and Analysis, 89–96. Springer-Verlag.

Gonzalez, J, M Osborne, and N Lawrence. 2016. “GLASSES: Relieving the Myopia of Bayesian Optimisation.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 790–99.

Lam, R, K Willcox, and DH Wolpert. 2016. “Bayesian Optimization with a Dinite Budget: An Approximate Dynamic Programming Approach.” In Advances in Neural Information Processing Systems, 883–91.

Huan, X, and YM Marzouk. 2016. “Sequential Bayesian Optimal Experimental Design via Approximate Dynamic Programming.” Preprint on ArXiv:1604.08320.

Forrester, AIJ, A Sobester, and AJ Keane. 2008. Engineering Design via Surrogate Modelling: A Practical Guide. John Wiley & Sons.

Bogunovic, I, J Scarlett, A Krause, and V Cevher. 2016. “Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation.” In Advances in Neural Information Processing Systems, 1507–15.

Picheny, V, T Wagner, and D Ginsbourger. 2012. “A Benchmark of Kriging-Based Infill Criteria for Noisy Optimization.” Structural and Multidisciplinary Optimization. Springer-Verlag, 1–20.

Ginsbourger, D, R Le Riche, and L Carraro. 2010. “Kriging Is Well-Suited to Parallelize Optimization.” In Computational Intelligence in Expensive Optimization Problems, 131–62. Springer-Verlag.

Lyu, X, M Binois, and M Ludkovski. 2018. “Evaluating Gaussian Process Metamodels and Sequential Designs for Noisy Level Set Estimation.” Preprint on ArXiv:1807.06712.

Chevalier, C, D Ginsbourger, J Bect, and I Molchanov. 2013. “Estimating and Quantifying Uncertainties on Level Sets Using the Vorob’ev Expectation and Deviation with Gaussian Process Models.” In mODa 10 - Advances in Model-Oriented Design and Analysis, edited by Dariusz Ucinski, Anthony C. Atkinson, and Maciej Patan, 35–43. Contributions to Statistics. Springer-Verlag. doi:10.1007/978-3-319-00218-7_5.

Azzimonti, D, D Ginsbourger, C Chevalier, J Bect, and Y Richet. 2016. “Adaptive Design of Experiments for Conservative Estimation of Excursion Sets.” Preprint on ArXiv:1611.07256.

Chevalier, C, V Picheny, D Ginsbourger, and D Azzimonti. 2018. KrigInv: Kriging-Based Inversion for Deterministic and Noisy Computer Experiments. https://CRAN.R-project.org/package=KrigInv.

Chevalier, C, V Picheny, and D Ginsbourger. 2014. “KrigInv: An Efficient and User-Friendly Implementation of Batch-Sequential Inversion Strategies Based on Kriging.” Computational Statistics & Data Analysis 71. Elsevier: 1021–34.

Chevalier, C, D Ginsbourger, and X Emery. 2014. “Corrected Kriging Update Formulae for Batch-Sequential Data Assimilation.” In Mathematics of Planet Earth, 119–22. Springer-Verlag.

Bichon, BJ, MS Eldred, LP Swiler, S Mahadevan, and JM McFarland. 2008. “Efficient Global Reliability Analysis for Nonlinear Implicit Performance Functions.” AIAA Journal 46 (10): 2459–68.

Ranjan, P, D Bingham, and G Michailidis. 2008. “Sequential Experiment Design for Contour Estimation from Complex Computer Codes.” Technometrics 50 (4). Taylor & Francis: 527–41.

Gramacy, RB, and Herbert KH Lee. 2011. “Optimization Under Unknown Constraints.” In Bayesian Statistics. Vol. 9. Oxford University Press.

Picheny, V, D Ginsbourger, O Roustant, RT Haftka, and N Kim. 2010. “Adaptive Designs of Experiments for Accurate Approximation of a Target Region.” Journal of Mechanical Design 132 (7). American Society of Mechanical Engineers: 071008.

Fadikar, A, D Higdon, J Chen, B Lewis, S Venkatramanan, and M Marathe. 2018. “Calibrating a Stochastic, Agent-Based Model Using Quantile-Based Emulation.” SIAM/ASA Journal on Uncertainty Quantification 6 (4). SIAM: 1685–1706.

Higdon, D, J Gattiker, B Williams, and M Rightley. 2008. “Computer Model Calibration Using High-Dimensional Output.” Journal of the American Statistical Association 103 (482). Taylor & Francis: 570–83.

Gattiker, J, K Myers, B Williams, D Higdon, M Carzolio, and A Hoegh. 2016. “Gaussian Process-Based Sensitivity Analysis and Bayesian Model Calibration with Gpmsa.” Handbook of Uncertainty Quantification. Springer, 1–41.

Herbei, Radu, and L Mark Berliner. 2014. “Estimating Ocean Circulation: An MCMC Approach with Approximated Likelihoods via the Bernoulli Factory.” Journal of the American Statistical Association 109 (507). Taylor & Francis: 944–54.

McKeague, Ian W, Geoff Nicholls, Kevin Speer, and Radu Herbei. 2005. “Statistical Inversion of South Atlantic Circulation in an Abyssal Neutral Density Layer.” Journal of Marine Research 63 (4). Sears Foundation for Marine Research: 683–704.


  1. Letter \(\nu\) is preferred to the familiar \(\tau^2\) from Chapter 5 to avoid double squaring: \(\nu^2 \equiv (\tau^2)^2\).

  2. A partial implementation is available for Python in GPy.