I am writing this book to introduce R—a programming language and environment for statistical computing and graphics—to public health epidemiologists and health care analysts conducting population health analyses. Recent graduates come prepared with a solid foundation in epidemiological and statistical concepts and skills. However, what is sometimes lacking is the ability to implement new methods and approaches they did not learn in school. This is more apparent today with the emergence of data science and the new field of population health data science (PHDS)—the art and science of transforming data into actionable knowledge to improve health.

PHDS is a transdisciplinary, rapidly emerging field that integrates the expertise from public health and medicine, mathematics, statistics, computer science, decision sciences, health economics, behavioral economics and human-centered design. PHDS is the future of public health data analysis and synthesis, and knowledge integration. Knowledge integration is the management, synthesis, and translation of knowledge into decision support systems to improve policy, practice, and—ultimately—population health.

In contrast to custom-made tools or software packages, R is a suite of basic tools for statistical programming, analysis, and graphics. One will not find a “command” for a large number of analytic procedures one may want to execute. Instead, R is more like a set of high quality carpentry tools (hammer, saw, nails, and measuring tape) for tackling an infinite number of analytic problems, including those for which custom-made tools are not readily available or affordable. I like to think of R as a set of extensible tools to implement one’s analysis plan, regardless of simplicity or complexity. With practice, one not only learns to apply new methods, but one also develops a depth of understanding that sharpens one’s intuition and insight. With understanding comes clarity, focused problem-solving, creativity, innovation, and confidence.

This book is divided into two parts. First, I cover how to process, manipulate, and operate on data in R. Most books cover this material briefly or leave it for an appendix. I decided to dedicate a significant amount of space to this topic with the assumption that the average health analyst is not familiar with R and a good grounding in the basics will make the later chapters more understandable, and enable one to pick up any book on R and implement new methods quickly.

Second, I cover basic PHDS from an epidemiologic perspective. Data science is “the art and science of transforming data into actionable knowledge.” Here is where we can build on the strengths of epidemiology (descriptive and analytic studies). However, in public health practice we need much more than this:

We need to effectively and efficiently influence, guide, and advise decision makers in a relevent and timely way. The decision makers include patients, clients, policy makers, colleagues, and community stakeholders. When possible timeliness should be in real time. Beyond “analysis” we need “synthesis” of data, information, and knowledge from diverse sources to promote better decision making in the setting of complex environments, limited information, multiple objectives, competing trade-offs, uncertainty, and time constraints.

PHDS can be summarized with four verbs: describe, predict, discover, and advise, and extends epidemiology into six analytic categories (Table .).

Table .: Population health data science analytic domains
Action Analytic categories Purpose
Describe Descriptive analysis 1. measuring the burden of risk factors and outcomes
Predict Predictive analysis 2. early targeting of prevention and response strategies
Discover Explanatory analysiss 3. testing causal pathways for designing prevention strategies
4. discovering and testing new causal pathways
Advise Prescriptive analysis 5. optimizing decisions, priority-setting, and resource allocation
Simulation analysis 6. modeling processes for epidemiologic and decision insights

The field of data science is exploding! And the field of epidemiology—a public health basic science—is learning how to work effectively on transdisciplinary teams with mathematicians, statisticians, computer scientists, informaticians, clinicians, and subject matter experts. No individual will have all the required technical expertise for data science. Data science is a team sport. What will we bring to the data science table? I hope this book will contribute to this answering this question.

My goal is not to be comprehensive in each topic but to demonstrate how R can be used to implement a diversity of methods relevant to PHDS. My hope is that more and more epidemiologists will embrace R to become epidemiologic data scientists, or at least, include R in their epidemiologic toolbox.

Tomás J. Aragón
San Francisco, California