Population Health Data Science with R
Transforming data into actionable knowledge
Tomás J. Aragón
We are writing this book to introduce R—a programming language and environment for statistical computing and graphics—to public health epidemiologists, health care data analysts, data scientists, statisticans, and others conducting population health analyses. Recent graduates come prepared with a solid foundation in epidemiological and statistical concepts and skills. However, what is sometimes lacking is the ability to implement new methods and approaches they did not learn in school. This is more apparent today with the emergence of data science and the new field of population health data science (PHDS)—the art and science of transforming data into actionable knowledge to improve health.1
The key word is actionable knowledge—. Therefore, PHDS is a transdisciplinary field that integrates the expertise from public health and medicine, probability and statistics, computer science, decision sciences, health and behavioral economics, and human-centered design. PHDS is the future of public health data analysis and synthesis, and knowledge integration—.
Why R? In contrast to custom-made tools or software packages, R is a suite of basic tools for statistical programming, analysis, and graphics. One will not find a “command” for a large number of analytic procedures one may want to execute. Instead, R is more like a set of high quality carpentry tools (hammer, saw, nails, and measuring tape) for tackling an infinite number of analytic problems, including those for which custom-made tools are not readily available or affordable. We like to think of R as a set of extensible tools to implement one’s analysis plan, regardless of simplicity or complexity. With practice, one not only learns to apply new methods, but one also develops a depth of understanding that sharpens one’s intuition and insight. With understanding comes clarity, focused problem-solving, creativity, innovation, and confidence.
This book is divided into two parts. First, we cover how to process, manipulate, and operate on data in R. Most books cover this material briefly or leave it for an appendix. We decided to dedicate a appropiate amount of space to this topic with the assumption that the average health analyst is not familiar with R and a good grounding in the basics will make the later chapters more understandable, and enable one to pick up any book on R and implement new methods quickly.
Second, We cover basic PHDS from an public health epidemiologic perspective. We build on the strengths of epidemiology (descriptive and analytic studies). However, in public health practice we need much more:
To transform population health we need improve decision making, resource allocation, and performance improvement. The decision makers include patients, clients, policy makers, colleagues, and community stakeholders. Beyond “analysis” we need “synthesis” of data, information, and knowledge from diverse sources to promote better decision making in the setting of complex environments, limited information, multiple objectives, competing trade-offs, uncertainty, and time constraints.
How do we do this? Traditionally, epidemiologic methods are described as either descriptive (describing needs or generating hypotheses) or analytic (testing causal or intervention effects). Building upon this PHDS has five domains of analysis (Table .).
|Analysis||Population health purpose|
|Descriptive||1. measuring risk or protective factors and outcomes|
|Predictive||2. early detecting and targeting of interventions|
|Causal||3. discovering and estimating causal or intervention effects|
|Simulation||4. modeling for epidemiologic or decision insights|
|Decision||5. informing, influencing or optimizing decision quality (DQ)|
Each one of these analytic domains can “drive” decision-making (often referred to as “data-driven” decision-making). For PHDS, we will emphasize decision quality (DQ) in all decisions (Spetzler, Winter, and Meyer 2016). DQ is at the core of PHDS!
The field of data science is exploding! And the field of epidemiology—a public health basic science—is learning how to work effectively on transdisciplinary teams with mathematicians, statisticians, computer scientists, informaticians, clinicians, and subject matter experts. No individual has all the required technical expertise for data science. Data science is a team sport. What will we bring to the data science table? We hope this book will contribute to this answering this question.
My goal is not to be comprehensive in each topic but to demonstrate how R can be used to implement a diversity of methods relevant to PHDS. We hope is that more and more epidemiologists will embrace R and and become population health data scientists, or at least, include R in their epidemiologic toolbox.
Finally, for population health leaders and data scientists, PHDS sharpens and supports population health thinking which is continuous improvement in:
- decision quality,
- causal inference, and
- probabilistic reasoning.
From cognitive neuroscience we know that humans mostly perform poorly at all three, especially in the face of complexity, uncertanity, competing trade-offs, confounding, mediation, or interaction (Kahneman 2011; Spetzler, Winter, and Meyer 2016; Nisbett 2016; Pearl 2018; Barrett 2018). In this book, we introduce graphical models (primarily Bayesian networks and variants) as a unifying framework that connects the fields of probability and statistics, epidemiology and medicine, and decision and computer sciences in a profoundly elegant way. Once you experience the visual simplicity, analytic power, and profound insights from graphical models you will never look back. Population health thinking is the heart and soul of PHDS—making PHDS much more than the sum of its parts!
Tomás J. Aragón
University of California
School of Public Health
Spetzler, Carl, Hannah Winter, and Jennifer Meyer. 2016. Decision Quality: Value Creation from Better Business Decisions. 1st ed. Wiley.
Kahneman, Daniel. 2011. Thinking, Fast and Slow. New York: Farrar, Straus; Giroux.
Nisbett, Richard. 2016. Mindware: Tools for Smart Thinking. New York: Farrar, Straus; Giroux.
Pearl, Judea. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.
Barrett, Lisa Feldman. 2018. How Emotions Are Made: The Secret Life of the Brain. Mariner Books.
Population health is a systems framework for studying and improving the health of populations through collective action and learning.↩