Chapter 1 Introduction

1.0.1 R packages required for this chapter

library(knitr)
library(tidyverse)

1.1 Preface

This is an intermediate epidemiology book that focuses on the methods most commonly encountered in problems arising in the domain of clinical epidemiology. The emphasis is on research design, quantitative analysis, critical thinking, and reproducible research methods. As it is my belief that epidemiologic principles are fundamental to high quality medical research and the learning and evaluation of these principles are best consolidated through direct application to problem solving, the text emphasizes the use of quantification, via open source software.

By the end of the book readers can expect to:

• Understand and critically evaluate basic study designs and statistical inferences in medical research
• Understand, identify, and quantitatively evaluate sources of bias in epidemiologic and clinical studies
• Master quantitative techniques for visualizing, cleaning, analyzing and interpreting epidemiologic and clinical data, including descriptive analyses, stratified analyses, regression analyses and meta-analyses.

1.2 Prerequisites

There are multiple excellent introductory (Gordis 2014) (Rothman 2012), intermediate (Szklo and Nieto 2019) and advanced (Rothman, Greenland, and Lash 2008) epidemiology textbooks. There are no obligatory prerequisites for this book, although its use will be easier if there has been some prior exposure to basic epidemiologic concepts.

1.3 My general philosophy for clinical epidemiology

The standard definition of epidemiology is the study of the distribution and determinants of health-related states or events in specified populations, and the application of this study to the control of health problems (Last 2001).

This definition may be expanded as follows:

  • The distribution of health-related states refers to analysis by time, place, and population.

  • The different varieties of epidemiologic study include;

    1. Descriptive: looking at patterns of disease within and between populations, trying to understand why disease rates go up and down, population variation
    2. Etiologic: understand the causes of disease, with implications for interventions to change disease distributions
    3. Predictive: predict outcomes for future patients, with implications for interventions to alter outcomes
  • The common determinants of of health-related states are biological, behavioral, social, cultural, economic, and political factors that influence health

  • The ultimate goal of epidemiology is the advancement of scientific knowledge to promote, protect, and restore health

Clinical epidemiology may therefore be seen (?defined) as the application of epidemiologic principles and reasoning to improve clinical decision making and ultimately “patient” outcomes. However, I think it is preferable to interpret the term “patient” in the widest possible context, and consequently clinical epidemiology might range from the study of preventive measures in asymptomatic individuals to the evaluation of quaternary hospital interventions in large cohorts. As patient cohorts may arise not only from hospitals but also from specific neighborhoods, cities, provinces and countries the distinction between classic epidemiology and clinical epidemiology is hardly distinct. Consequently, while this book will focus on clinical applications, the underlying methods of classic epidemiology will also be presented, admittedly in a focused or condensed manner.

Epidemiology is the quantitative heart of both public health & clinical medicine (i.e. the science of counting within and across populations). This book will emphasize a hands-on quantitative approach to clinical epidemiology. Realizations that “if you can’t measure something, you can’t improve it” and “intuition without quantification is often too imprecise to be useful” support the importance of a rigorous quantitative approach. Epidemiology concepts may be vague, or plain hard to understand, and practical data manipulation provides opportunities to reinforce our understanding of these methods making us both better consumers and producers of medical research.

While this book stresses a quantitative approach, one must be careful to avoid the McNamara or the quantifiable fallacy “if you can’t measure what is important, make what you can measure important,” first documented during the Vietnam war, but similar erroneous uses of quantification may arise in clinical medicine. For example, progression-free survival (PFS) as a primary endpoint in many cancer trials largely because it is an endpoint which is easily measurable, but it meaningfulness compared to other metrics such as overall quality of life or overall survival is highly questionable.(Kim and Prasad 2015) Extensive quantification of biased results are also of no value. Consequently this book will also concentrate on recognizing common biases and assuring the application of methodologic rigor at all levels of study design, analysis, interpretation, and dissemination.

1.4 Brief excursions into quantification

This is a brief excursions into quantification or why intuition is not good enough. Herein I offer several examples that hopefully highlight the need for careful quantification in order to reach reasonable conclusions.

1. Birthday problem

First, consider the birthday problem, a common parlor game, which shows how we may underestimate the play of chance. Imagine you are in a class of 60 people, what is the probability that at least 2 people share the same birthday?

Potential answers

## A. < 5% 
## B. 5 - 15% 
## C. 15 - 25% 
## D. >99%

Most people, who aren’t already aware of the “trick” or “paradox” assume the probability is small given that a birthday may in occur on any one of 365 possible days. Given our natural tendency to egocentricity they imagine there are 59 chances of some else having the same birthday as them, or a probability of 59/365 or 16.2%. However this ignores the other 1711 (5958) possible combinations between someone who is not you being compared to someone else who is also not you. Also as we will see, calculations involving exponentials are hardly intuitive to most people who tend to think in linearities. As an easy solution, begin by considering the reverse problem what is the probability that no one has the same birthday as you. For the 2nd person, first next to you, probability of not being born on the same day as you = 364/365. So the probability that 3rd person not on the same day as first 2 persons = 364/365 363/365 and The probability i th person not on the same day as 1st i-1 = 364/365 * 363/365 * 362/365 * (n-1)/365 or \[ \prod_{i=0}^{i-1} (1- \dfrac{i}{365}) \] Then the probability of being born on the same day = 1 - probability not being born on the same day. These calculations can be easily performed in R.

# create a birthday function to calculate probability of 2 people in a group  (with a default group of 23)
# default of 23 since that gives probbaility = 50%
birthday <- function(n=23){
  1-prod(1-(0:(n-1))/365)
  }
birthday(60)
## [1] 0.9941
# see why the default is 23 -> 50% probability
birthday()
## [1] 0.5073
p <- numeric(100)
for(i in 1:100){
     p[i] <- birthday(i)
 }

p <- data.frame(x=seq(1,100, 1), p=p)
ggplot(p, aes(x,p)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "Probability") +
  ggtitle("Probability of same birthday as a function of sample size") +
  geom_segment(aes(x = 0, y = .50, xend = 23, yend = .50), color="red") +
  geom_segment(aes(x = 23, y = .50, xend = 23, yend = 0), color="red") +
  theme_bw()

So the probability of finding 2 people in a class of 60 with the same birthday is >99%. The graph also shows that provided there are more than 23 people in the class the probability will be > 50%. One can also appreciate the non-linearity of the relationship. To give an idea about the range of functions in R it should be noted that you don’t have to write your own birthday function as above but can simple call the built in pbirthday function. One can also verify these calculations by doing simulations as demonstrated in this blog.

# using the built in R function
cat("The probability of 2 people with the same bithday in a group of 23 is ", pbirthday(23), " and for a group of 60 is ",
pbirthday(60))
## The probability of 2 people with the same bithday in a group of 23 is  0.5073  and for a group of 60 is  0.9941

2. Voter suppression Now being able to solve party games is a good, but not the only, reason to develop your quantitative skills. Consider the very current issue of voter suppression. A news story reported that in Georgia, where 32% of the population is black, the state has recently removed 670,000 voters of whom 70% were black. Is this difference likely due to the play of chance? In other words, how likely are we to observe 469,000 (.7 * 670,000) blacks out of 670,000 if the selection probability is the same as the proportion of blacks in the general population (32%). This can be found with 1 line of code. Additional information about this function can be found with help(binom.test).

binom.test(469000, 670000, .32)
## 
##  Exact binomial test
## 
## data:  469000 and 670000
## number of successes = 469000, number of trials = 670000, p-value <2e-16
## alternative hypothesis: true probability of success is not equal to 0.32
## 95 percent confidence interval:
##  0.6989 0.7011
## sample estimates:
## probability of success 
##                    0.7

It appears highly unlikely to have excluded htis number of black voters if the null hypothesis is that voter suppression is independent of race.

3. Avoiding disasters

In 2018, medical researchers form John Hopkins reported a clinical trial in JAMA entitled “Effect of a Program Combining Transitional Care and Long-term Self-management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease A Randomized Clinical Trial” that concluded the intervention lead to “significantly fewer COPD related hospitalizations.” The next year the authors published a retraction and a new article that concluded the intervention lead to “significantly greater COPD related hospitalizations.”

The problem was a simple miscoding of the intervention and control arms. Being able to read and follow code, can assure that extra eyes are available to catch these errors which can happen to all of us. While reviewing their code to correct this major error, the research team discovered at least two other areas of erroneous code (in the commands to impute missing values and to aggregate data into summary variables).

On a broader level, sharing analytic code is increasingly the norm across many fields and provides an unambiguous record of the analytical methods used, aiding reproducibility, helping avoid duplicated effort and possibly accelerating innovation. It would be nice to see this become the norm in clinical epidemiology.

1.5 Outline

The scientific discipline of epidemiology has greatly expanded from its origin narrow focus on communicable disease epidemics to broadly include all phenomena related to population health. The aim of this book is assist in developing the design and analytical skills to critically evaluate (and produce) medical research that avoids biases and quantitative errors. The topics to be discussed will hopefully assist healthcare personnel in reaching the goal of being better consumers and producers of an ever expanding medical literature.

For a variety of reasons, not the least of which is my inexperience and lack of knowledge about publishing a full html book, I have decided to release book chapters as they are completed. This will also allow me to release material more quickly and hopefully get feedback to correct the errors, incorporate suggestions, and generally improve the overall content.

Table of contents (volume 1)
Chapter Topic
Chapter 1 Introduction to clinical epidemiology
Chapter 2 Introduction to statistical software - R
Chapter 3 Exploratory Data Analysis - Data visualization
Chapter 4 Contingency tables, measures of association & R packages
Chapter 5 Statistical inference
Chapter 6 Non-experimental designs & incidence measures
Chapter 7 Causal inference & directed acyclic diagrams (DAGs)
Chapter 8 Confounding
Chapter 9 Selection bias
Chapter 10 Miclassification
Chapter 11 References

Future volumes are likely to include the some of the following topics

  • Effect Measure Modification & interaction
  • Poisson regression
  • Survival analysis
  • Propensity Scores / Instrumental variables
  • Randomized clinical trials
  • Meta-analysis
  • Pharmacoepidemiology
  • Evidence based medicine / Guidelines
  • Health economics / ethics

References

Gordis, Leon. 2014. Epidemiology. Book.
Kim, Chul, and Vinay Prasad. 2015. “Cancer Drugs Approved on the Basis of a Surrogate End Point and Subsequent Overall Survival: An Analysis of 5 Years of US Food and Drug Administration Approvals.” Journal Article. JAMA Internal Medicine 175 (12): 1992–94. https://doi.org/10.1001/jamainternmed.2015.5868.
Last, J. M. 2001. “Dictionary of Epidemiology. 4th Ed. New York: Oxford University Press; 2001. P. 61.” Journal Article.
Rothman, Kenneth J. 2012. Epidemiology : An Introduction. Book. New York, NY: Oxford University Press.
Rothman, Kenneth J., Sander Greenland, and Timothy L. Lash. 2008. Modern Epidemiology. Book.
Szklo, M., and F. Javier Nieto. 2019. Epidemiology : Beyond the Basics. Book.