Computer simulation experiments are essential to modern scientific discovery. Barriers to computing have come way down. Meanwhile all the low-hanging fruit has been picked from the mathematical tree of cute closed-form solutions serving as crude approximations to reality. Occam's Razor is nice philosophy, but the real world isn't always simple. On that Wikipedia page, Isaac Newton is quoted as saying the following about how simple the world must be.
We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, as far as possible, assign the same causes.
If Newton believes in parsimony, he's sure chosen a complicated way of saying so. Given otherwise equivalent competing explanations for puzzling phenomena, I agree simpler is better. But we live in a world exhibiting fine balance, disequilibrium and chaotic behavior all at once. Inherent complexity rules the day. Solving interesting problems at high fidelity requires intricate numerics. Fortunately, in our modern age, lots of highly modular public libraries are available. It's never been easier to patch together a simulation to entertain "what ifs?", discover emergent behavior in novel circumstances, challenge hypotheses with data, and stress-test scenarios; that is, assuming you can code and tolerate a bit of iteration or Monte Carlo. Computer simulations aren't just for physics and chemistry anymore. Biology, epidemiology, ecology, economics, business, finance, engineering, sociology even politics are experiencing a renaissance of mathematical exploration through simulation.
Trouble is, while vastly greater than just a few decades ago, computing capacity isn't infinite. Simulation experiments must be carefully planned to make the most of a finite resource, input configurations chosen to span representative scenarios, and appropriate meta-models fit in order to effectively utilize simulations towards the advance of science. That's where surrogates come in; as meta models of computer simulations used to solve mathematical systems that are too intricate to be worked by hand. Gaussian process (GP) regression has percolated up to the canonical position in this arena. It sounds hard, but it's actually quite straightforward and supremely flexible at the same time. One of the main purposes of this text is to expose the beauty and potential held by GPs in a variety of contexts. Our emphasis will be on GP surrogates for computer simulation experiments, but we'll draw upon and exemplify many successes with similar tools deployed in geostatistics and machine learning.
Emphasis is on methods, recipes and reproducibility. The latter two make this book unique by many measures, but especially in the subjects of surrogate modeling and GPs. Methodology wise, this monograph is somewhat less broad than Santner, Williams, and Notz (2018), a standard text on design and analysis of computer experiments. But it offers more depth on core subjects, particularly on computing and implementation in R (R Core Team 2019), and is more modern in its connection to methodology from machine learning, industrial statistics and geostatistics. Every subject, with the exception of a few references to ancillary material, is paired with illustration in worked code. There's not a single table or figure (which is not a drawing) in the book that's not supported by code on the page. Everything is fully reproducible. What you see is what you get. This wouldn't be possible without modern extensions to R such as RStudio. Specifically, this book is authored in Rmarkdown via
bookdown (Xie 2016; Y. Xie 2018a) on CRAN, combining
knitr (Xie 2015; Y. Xie 2018b) and
rmarkdown (Allaire et al. 2018) packages.
One downside to Rmarkdown is that sometimes, e.g., when illustrations are based on randomly generated data, it's hard to precisely narrate the outcome of a calculation. I hope that readers will appreciate the invitation this implies. You're encouraged to cut-and-paste the example into your own session and see what I mean when I say something like "It's hard to comment precisely about outcomes in this Rmarkdown build." In a small handful of places, random number generator (RNG) seeds are fixed to "freeze" the experiments and enhance specificity, even though that's technically cheating.1 Uncertainty quantification (UQ) is a major theme in this book. A disappointingly vague narrative represents a perfect opportunity to catch a glimpse at how important, and difficult, it can be to appropriately quantify salient uncertainties.
Use of R, rather than say Python or MATLAB®, signals that this book is statistical in nature, as opposed to computer science/machine learning, engineering or applied math. That's true, to a point. I'm a professor of statistics, but I was trained as a mathematician, computer scientist, and engineer first. I moved to stats later primarily as a means of latching onto interesting applications from other areas of applied science. This book is the product of that journey. Its mathematical language is statistical because surrogate modeling involves random variables, estimators, uncertainty, conditioning and inference. But that's where it ends. This book has (almost) none of the things practitioners hate about statistics: \(p\)-values, sampling distributions, asymptotics, consistency, and so on. The writing is statistical in form, but the subjects are not about statistics. They're about prediction and synthesis of model and data under uncertainty, about visualization and analysis of information, about design and decision making, about computing, and about implementation. Crucially, it's about all of those things in the context of experimentation through simulation. The target audience is PhD students and post-doctoral scientists in the natural and engineering sciences, in which I include statistics and computer science. The social sciences are increasingly mathematical and computational and I think this book will appeal to folks there as well.
There's nothing special about R here, except that I know R and CRAN packages for surrogate modeling best. Many good tools exist in Python and MATLAB, and pointers are provided. R is lingua franca in the statistical surrogate modeling world, with MATLAB on somewhat of a decline and Python picking up pace. Any coded examples in the book which don't leverage highly customized CRAN libraries would be trivial to port to any high-level language. Illustrations emphasize algorithmic execution first, using basic subroutines, and library-based automation second. An effort is made to strip the essence of numerical calculations into digestible component parts. I view code readability as at least as important as efficiency. I don't make use of Tidyverse, just ordinary R.2 Anyone with experience coding, not only R experts, should have no trouble following the examples. Reproducibility, and careful engineering of clean and well-documented code are important to me, and I intend this book as a showcase, benchmark and template for young coders.
The progression of subjects is as follows. Chapters 1--2 offer a gentle introduction comprised of historical perspective followed by an overview of four challenging real data/simulator applications. Links to data and simulation code are provided on the book web page: http://bobby.gramacy.com/surrogates. These motivating examples are revisited periodically in the remainder of the text, but mostly in later chapters. Chapter 3 covers classical response surface methodology (RSM), primarily developed before computer simulation modeling and GP surrogates became mainstream. Most of the exposition here is a fly-by of Chapters 5--6 from Myers, Montgomery, and Anderson–Cook (2016) with refreshed examples in R. I'm grateful to Christine Anderson--Cook for help on some of the details. Chapter 4 begins a transition to modern surrogate modeling by introducing appropriate experiment designs. Chapter 5 is on GP regression, starting simple and building up slowly, extolling virtues but not ignoring downsides, and offering several competing perspectives on almost magical properties. Material here served as the basis of a webinar I gave for the American Statistical Association's (ASA) Section on Physical and Engineering Sciences (SPES) in 2017. Chapter 6 revisits design aspects in Chapter 4 from a GP perspective, motivating sequential design as modus operandi and setting the stage for Bayesian optimization (BO) in Chapter 7. Data acquisition as a decision problem, for the purpose of learning and optimization under uncertainty, is one of the great success stories of GP surrogates, combining accurate predictions with autonomous action. Chapter 8 covers calibration and input sensitivity analysis, two important applications of GP surrogates leveraging their ability to synthesize sources of information and to sensibly quantify uncertainty. Smooshing these somewhat disparate themes into a single chapter may seem awkward. My intention is to feature them as two examples of things people do with GP surrogates where UQ is key. Other texts like Santner, Williams, and Notz (2018) present these in two separate chapters. Chapter 9 addresses many of the drawbacks alluded to in Chapter 5, tackling computational bottlenecks limiting training data sizes, scaling up modeling fidelity, hybridizing and dividing-and-conquering with trees, and approximating with highly-parallelizable local GP surrogates. Chapter 10 discusses recent upgrades to address surrogate modeling and design for highly stochastic, low signal-to-noise, simulations in the face of heteroskedasticity (input-dependent noise). Appendix A discusses linear algebra libraries that are all but essential when working with larger problems; Appendix B introduces a game that helps reinforce many of the ideas expounded upon in this text.
While intended for instruction at the PhD level, I hope you'll find this book to be a useful reference as well. Excepting Chapters 1--2 which target perspective, overview and motivation, the technical progression within and between chapters is highly linear. Methodological development and examples within a chapter build upon one another. Later chapters assume familiarity of concepts introduced earlier, with appropriate context and pointers provided. Each chapter's R examples execute in a novel, standalone R session.3 Chapter 3 on RSM is relatively self-contained, and not essential for subsequent methodological development, except perhaps as a straw man. Instructors wishing to cut material in order to streamline content should consider Chapter 3 first, possibly encouraging students to skim these sections. Relying on simple linear models, basic calculus and linear algebra, material here is the most intuitive, least mathematically and computationally challenging. Nevertheless, RSM works astonishingly well and is used widely in industry. (These techniques are highly effective on the game in Appendix B.) Chapters 4--8 are the "meat", with Chapters 9--10 demarcating the surrogate modeling frontier. Homework exercises are provided at the end of each chapter. These have been vetted in the classroom over two semesters. Many are deliberately open-ended, framed as research vignettes where students are invited to fill in the gaps. For assignments, I try to strike a balance between mathematical and computational problems (e.g., do #1, #3 and two others of your choosing ...) in a way that allows students to play to their strengths while avoiding crutches. Fully reproducible solutions in Rmarkdown are available from me upon request.
There are many subjects that are not in this book, but very well could be. GP surrogates are king here, but they're by no means the only game in town. Polynomial chaos and deep neural networks are popular alternatives, but they're not covered in this text. My opinion is that both fall short from a UQ perspective, although they offer many other attractive features, especially in big data contexts. Even limiting to GPs, the presentation is at times deliberately narrow. Chosen methods and examples are unashamedly biased toward what I know well, to problems and methods I've worked on, and to R packages in wide use and available on CRAN. Many of those are my own contribution. If it looks like shameless self-promotion, it probably is. I like my work and want to share it with you. Although I've tried to provide pointers to related material when relevant, this book makes no attempt to serve as a systematic review of anything. Books like Santner, Williams, and Notz (2018) are much better in this regard. I hope that readers of my book will appreciate that its value lies in the recipes and intuition it provides, combining math and code in a (hopefully) seamless way, and as a demonstration that reproducibility in science is well within reach.
Before we get started, there are plenty of folks to thank. Let's start with family. Where would I be without Mama and those sweet kiddos? Thank you Leah, Natalia and Kaspar for letting me be proud of you and for helping me be proud of myself. This book is the outcome of confidence's virtuous cycle more than any other single thing. Thanks to my parents for encouraging me in school and for asking "who's paying for that?" every time they called to say hi (to the kids) only to find I'm out of town. Thanks to the Universities of California (Santa Cruz), Cambridge, Chicago and Virginia Tech, for supporting my research and for nurturing my career, and thanks to the US National Aeronautics and Space Administration (NASA), UK Engineering and Physical Sciences Research Council (EPSRC), US National Science Foundation (NSF) and the US Department of Energy (DOE) for funding over the years. Kudos to the Virginia Tech Department of Statistics for inviting me to teach a graduate course on the subject of my choosing, and thereby planting the seed for this book in my mind. Many thanks to students in my Fall 2016 and Spring 2019 classes on Response Surface Methods and Surrogate Modeling, for being my guinea pigs and for helping me refine presentation and fix typos along the way; shout outs to Sierra Merkes, Valeria Quevedo and Ryan Christianson in particular. I appreciated invitations to give short courses to the Statistics Department at Brigham Young University in 2017, a summer program at Lawrence Livermore National Laboratory in 2017, the 2017 Fall Technical Conference and a 2018 DataWorks meeting. Huge thanks to Max Morris (IA State) and Brian Williams (LANL) for going above and beyond with their reviews for CRC.
Robert B. Gramacy
Santner, TJ, BJ Williams, and W Notz. 2018. The Design and Analysis of Computer Experiments, Second Edition. New York, NY: Springer–Verlag.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Xie, Y. 2016. bookdown: Authoring Books and Technical Documents with Rmarkdown. Chapman; Hall/CRC.
Xie, Y. 2018a. bookdown: Authoring Books and Technical Documents with Rmarkdown. https://CRAN.R-project.org/package=bookdown.
Xie, Y. 2015. Dynamic Documents with R and knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.
Xie, Y. 2018b. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.
Allaire, JJ, Y Xie, J McPherson, J Luraschi, K Ushey, A Atkins, H Wickham, J Cheng, and W Chang. 2018. rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.
Myers, RH, DC Montgomery, and CM Anderson–Cook. 2016. Response Surface Methodology: Process and Product Optimization Using Designed Experiments. John Wiley & Sons.
Seeds are not provided, in part because RNG sequences can vary across R versions. Conditional expressions involving floating point calculations can change across architectures and lead to different results in stochastic experimentation even with identical pseudorandom numbers. It's impossible to fully remove randomness from the experience of engaging with the book material, which inevitably thwarts precise verbiage at times.↩
Tidyverse is a very important part of the R ecosystem, and its introduction has helped keep the R community on the cutting edge of analytics and data science. Its target audience is data wranglers. Mine is methodological developers and practitioners of applied science. I feel strongly that building from a simple base is essential to effective communication and portability of code.↩
library(knitr)begins each chapter for pretty table printing, as with