We are writing this book to introduce R—a programming language and environment for statistical computing and graphics—to public health epidemiologists, health care data analysts, data scientists, statisticans, and others conducting population health analyses. Recent graduates come prepared with a solid foundation in epidemiological and statistical concepts and skills. However, what is sometimes lacking is the ability to implement new methods and approaches they did not learn in school. This is more apparent today with the emergence of data science and the new field of population health data science (PHDS)—the art and science of transforming data into actionable knowledge to improve health.1

The key word is actionable knowledge—. PHDS is a transdisciplinary field that integrates the expertise from public health and medicine, probability and statistics, computer science, decision sciences, health and behavioral economics, and human-centered design. PHDS is the future of public health data analysis and synthesis, and knowledge integration—.

Why R? In contrast to custom-made tools or software packages, R is a suite of basic tools for statistical programming, analysis, and graphics. One will not find a “command” for a large number of analytic procedures one may want to execute. Instead, R is more like a set of high quality carpentry tools (hammer, saw, nails, and measuring tape) for tackling an infinite number of analytic problems, including those for which custom-made tools are not readily available or affordable. We like to think of R as a set of extensible tools to implement one’s analysis plan whether it is simple or complicated. With practice, one not only learns to apply new methods, but one also develops a depth of understanding that sharpens one’s intuition and insight. With understanding comes clarity, focused problem-solving, creativity, innovation, and confidence.

This book is divided into two parts. First, we cover how to process, manipulate, and operate on data in R. Most books cover this material briefly or leave it for an appendix. We decided to dedicate a appropiate amount of space to this topic with the assumption that the average health analyst is not familiar with R and a good grounding in the basics will make the later chapters more understandable, and enable one to pick up any book on R and implement new methods quickly.

Second, We cover basic PHDS from an public health epidemiologic perspective. We build on the strengths of epidemiology (descriptive and analytic studies). However, in public health practice we need much more:

To transform population health we need improve decision-making, problem solving, performance improvement, priority-setting, and resource allocation. The decision makers include patients, clients, policy makers, colleagues, and community stakeholders. Beyond “analysis” we need “synthesis” of data, information, and knowledge from diverse sources to promote better decision making in the setting of complex environments, limited information, multiple objectives, competing trade-offs, uncertainty, and time constraints.

How do we do this? Traditionally, epidemiologic methods are described as either descriptive (describing needs or generating hypotheses) or analytic (testing causal or intervention effects). Building upon this PHDS has five domains of analysis (Table 0.1).

TABLE 0.1: Population health data science analytic levels
Type of analysis Description
1 Description Surveillance and early detection of events
Prevalence and incidence of risks and outcomes
2 Prediction Early prediction and targeting of interventions
3 Causal inference Discovery of new causal effects and pathways
Estimation of intervention effectiveness
4 Simulation Modeling for epidemiologic or decision insights
5 Optimization Informing or optimizing decisions or efficiencies2

Each one of these analytic domains can “drive” decision-making (often referred to as “data-driven” decision-making). For PHDS, we will emphasize decision quality (DQ) in all decisions [1]. DQ is at the core of PHDS!

The field of data science is exploding! And the field of epidemiology—a public health basic science—is learning how to work effectively on transdisciplinary teams with mathematicians, statisticians, computer scientists, informaticians, clinicians, and subject matter experts. No individual has all the required technical expertise for data science. Data science is a team sport. What will we bring to the data science table? We hope this book will contribute to this answering this question.

Our goal is not to be comprehensive in each topic but to demonstrate how R can be used to implement a diversity of methods relevant to PHDS. We hope that more and more epidemiologists will embrace R and become population health data scientists, or at least, include R in their epidemiologic toolbox.

Finally, for population health leaders and data scientists, PHDS sharpens and supports population health thinking which is continuous improvement in:

  1. probabilistic reasoning,
  2. causal inference, and
  3. decision quality.

From cognitive neuroscience we know that humans perform poorly at all three, especially in the face of complexity, uncertanity, competing trade-offs, confounding, mediation, or interaction [15]. In this book, we introduce graphical models (primarily Bayesian networks and variants) as a unifying framework that connects the fields of probability and statistics, epidemiology and medicine, and decision and computer sciences in a profoundly elegant way. Once you experience the visual simplicity, analytic power, and profound insights from graphical models you will never look back. Population health thinking is the heart and soul of PHDS—making PHDS much more than the sum of its parts!

Tomás J. Aragón3 & Wayne T. Enanoria
School of Public Health, Epidemiology
University of California, Berkeley, California

Department of Epidemiology and Biostatistics
University of California, San Francisco, California


1. Spetzler C, Winter H, Meyer J. Decision quality: Value creation from better business decisions. 1st ed. Wiley; 2016.

5. Barrett L. How emotions are made: The secret life of the brain. Mariner Books; 2018.

  1. Population health is a systems framework for studying and improving the health of populations through collective action and learning.

  2. For example, cost-benefit or cost-effectiveness analysis

  3. (blog) and (GitHub)