Preface

Exploratory Data Analysis (EDA) is a philosophy on how to work with data, and for many applications, the workflow is better suited for scientist and engineers. As a scientist, we are trained to formulate a hypothesis and design a series of experiments that allow us to test the hypothesis effectively. Most data, however, doesn’t come from carefully controlled trials, but from observations. Statisticians can readily jump into describing the difference in as much detail they would like.

For most of us, we need tools to characterize an instrument or a process. The philosophy of EDA provides the framework to do this work.

Most textbooks still focus on traditional statistical techniques and even while it is essential to understand the underlying assumptions and fundamentals, I would argue that the work we do as scientist and engineers is not well suited for rigorous statistical analysis. In many cases, the need to disseminate information to a broad audience is best served by the methods espoused by EDA. The NIST e-Handbook Engineering Statistics is a welcome deviation from the norm.

In the Spring of 2018, I adopted this text as the basis of a one-semester, graduate course that focused applied statistical techniques. The audience for this course were working scientist, and the course was a core course in a Professional Science Master’s (PSM) program.

Unfortunately, the one drawback of the NIST Handbook is the use of Dataplot as the primary software package for analysis. The authors have provided examples using the R statistical language; however, most—if not all—of these scripts are written using base R. Modern R now incorporates many packages for streamlining the EDA process. This book attempts to capture my efforts to use these methods and share them with students in the course. The two packages that I primarily used were tidyverse and ggplot2.

Before going further, I should clarify one thing—I classify expertise with three levels: novice, hack, expert.

Novice: basic knowledge of how to use a tool with a desire to learn. Hack: Basic to intermediate knowledge of how to use a tool accompanied by resources to produce a finished product. Expert: Extensive knowledge of how to use a tool; can produce a finished product with few outside resources.

I’m sure other factors can be added to each category, but these capture the spirit of how I approach learning.

The number of resources available to learn R is numerous, and the first I would strongly recommend is R for Data Science. This text is an introduction to the tidyverse. The tidyverse is not just a collection of R packages, but a philosophy on how to work with data. It makes data analysis almost fun!

The other primary resource available for EDA is ggplot2. Like the tidyverse, ggplot2 is not just a package of tools, but a philosophy built around the Grammar of Graphics.

I encourage the reader to explore the references related to these two packages and their underlying design philosophies.

This book will show how I have worked through the exercises and case studies presented in the NIST handbook using methods found in the tidyverse and ggplot2. I have found this framework to be incredibly satisfying and one I was eager to share beyond my class.

If you find this material useful, please send me an email.

Structure of the book

Content was built around the e-book NIST/SEMATECH e-Handbook of Statistical Methods.

At the begining of each exercise or case study, I’ve included a link back to the specific page of the e-Handbook. The e-Handbook can be downloaded in full from the NIST site. The compressed file is over 100Mb (not 43Mb) as stated.

Software information and conventsions

Follow “best practices” of the tityverse

The R session information for this book is shown below:

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.9     Rcpp_1.0.0       lattice_0.20-38  digest_0.6.18   
##  [5] MASS_7.3-51.1    grid_3.4.3       nlme_3.1-137     magrittr_1.5    
##  [9] evaluate_0.12    stringi_1.2.4    rstudioapi_0.9.0 minqa_1.2.4     
## [13] nloptr_1.2.1     Matrix_1.2-15    rmarkdown_1.11   splines_3.4.3   
## [17] lme4_1.1-19      tools_3.4.3      stringr_1.3.1    yaml_2.2.0      
## [21] xfun_0.4         compiler_3.4.3   htmltools_0.3.6  knitr_1.21

Acknowledgements

This book was created using the bookdown package (Xie 2018), which was built on top of R Markdown and knitr (Xie 2015).

Ray James Hoobler
Salt Lake City, Utah May 2018

References

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://CRAN.R-project.org/package=bookdown.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

An Incomplete Solutions Guide to the NIST/SEMATECH e-Handbook of Statistical Methods