An Incomplete Solutions Guide to the NIST/SEMATECH e-Handbook of Statistical Methods
examples and case studies using the tidyverse and ggplot2
Exploratory Data Analysis (EDA) is a philosophy on how to work with data, and for many applications, the workflow is better suited for most working scientist and engineers. As a scientist, we are trained to formulate a hypothesis and design a series of experiments that will allow us to test the hypothesis effectively. Unfortunately, most data doesn’t from carefully controlled trials, but from observations. Statisticians will readily jump into describing the difference in as much detail as you would like.
For most of us, we need tools to characterize an instrument or a process. The philosophy of EDA provides the framework to do this work.
Unfortunately, most textbooks still focus on traditional statistical techniques and even while it is essential to understand the underlying assumptions and fundamentals, I would argue that most of the work we do as scientist and engineers are not well suited for rigorous statistical analysis. In many cases, the need to disseminate information to a broad audience is best served by the methods espoused by EDA. The NIST e-Handbook Engineering Statistics is a welcome deviation from the norm.
In the Spring of 2018, I adopted this text as the basis of a one-semester, graduate course that focused applied statistical techniques. The audience for this course were working scientist, and the course was a core course in a Professional Science Master’s (PSM).
Unfortunately, the one drawback of the NIST Handbook is the use of Dataplot as the primary software package for analysis. The authors have provided examples using the R statistical language; however, most–if not all–of these scripts are written using base R which is unfortunate. Modern R now incorporates many packages for streamlining the EDA process. This book attempts to capture my efforts to use these methods and share them with students in the course. The two packages that I primarily used were tidyverse and ggplot2.
Before going further, I should clarify one thing—I’m a hack. I classify learning as three levels: novice, hack, expert.
Novice: basic knowledge of how to use a tool with a desire to learn. Hack: Basic to intermediate knowledge of how to use a tool accompanied by resources to produce a finished product. Expert: Extensive knowledge of how to use a tool; can produce a finished product with few outside resources.
I’m sure other factors can be added to each category, but these capture the spirit of how I approach learning.
The number of resources available to learn R is numerous, and the first I would strongly recommend is R for Data Science. This text is an introduction to the tidyverse. The tidyverse is not just a collection of R packages, but a philosophy on how to work with data. It makes data analysis almost fun!
The other primary resource available for EDA is ggplot2. Like the tidyverse, ggplot2 is not just a package of tools, but a philosophy built around the Grammar of Graphics.
I encourage the reader to explore the references related to these two packages and their underlying design philosophies.
This book will show how I have worked through the exercises and case studies presented in the NIST handbook using methods found in the tidyverse and ggplot2. I have found this framework to be incredibly satisfying and one I was eager to share beyond my class.
If you find this material useful, please send me an email.
Structure of the book
Content was built around the e-book NIST/SEMATECH e-Handbook of Statistical Methods.
At the begining of each exercise or case study, I’ve included a link back to the specific page of the e-Handbook. The e-Handbook can be downloaded in full from the NIST site. The compressed file is over 100Mb (not 43Mb) as stated.
Software information and conventsions
Follow “best practices” of the tityverse
The R session information for this book is shown below:
## R version 3.4.3 (2017-11-30) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib ## ## locale: ##  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ##  bookdown_0.7 Rcpp_0.12.16 lattice_0.20-35 digest_0.6.15 ##  rprojroot_1.3-2 MASS_7.3-49 grid_3.4.3 nlme_3.1-137 ##  backports_1.1.2 magrittr_1.5 evaluate_0.10.1 stringi_1.1.7 ##  rstudioapi_0.7 minqa_1.2.4 nloptr_1.0.4 Matrix_1.2-14 ##  rmarkdown_1.9 splines_3.4.3 lme4_1.1-17 tools_3.4.3 ##  stringr_1.3.0 xfun_0.1 yaml_2.1.18 compiler_3.4.3 ##  htmltools_0.3.6 knitr_1.20