Reproducible Medical Research with R
Welcome to Reproducible Medical Research with R (RMRWR). I hope that this book is helpful to you.
This is a book for anyone in the medical field interested in analyzing the data available to them to better understand health, disease, or the delivery of care.
This could include nurses, dieticians, psychologists, and PhDs in related fields, as well as medical students, residents, fellows, or doctors in practice.
I expect that most learners will be using this book in their spare time at night and on weekends, as the health training curricula are already packed full of information, and there is no room to add skills in reproducible research to the standard curriculum. This book is designed for self-teaching, and many hints and solutions will be provided to avoid roadblocks and frustration. Many learners find themselves wanting to develop reproducible research skills after they have finished their training, and after they have become comfortable with their clinical role. This is the time when they identify and want to address problems faced by patients in their practice with the data they have before them. This book is for you.
Thank you for giving this e-book a try.
This is designed for physicians and others analyzing health data who are interested in pursuing this field using the R computer language.
We will assume that:
- You have access to a computer
- You have access to the internet
- You can download and install software from the internet to your computer
How to download and install R and RStudio will be addressed, step by step, in Chapter 2.
This book is structured on the concept of a “spiral of success”, with readers learning about topics like data visualization, data wrangling, data modeling, reproducible research, and communication of results in repeated passes. These will initially be at a superficial level, and at each pass of the spiral, will provide increasing depth and complexity. This means that the chapters on data wrangling will not all be together, nor the chapters on data visualization. Our goal is to build skills gradually, and return to (and remind students of) their previously built skills in one area and to add to them. The eventual goal is for learners to be able to produce, document, and communicate reproducible research to their community.
Most medical providers who learn R to do their own data analysis do it on their own time. They rarely have time for a semester-long course, as their clinical schedules usually will not allow it. Fortunately, a lot of people learn R on their own, and there is a strong and supportive R community to help new learners. A 2019 Twitter survey conducted by @RLadies found that more than half of respondents were largely self-taught, from books and online resources.
There are a lot of good resources for learning R, so why one more? In part, because the needs of a medical audience are often different. There are distinct needs for protecting health information, generating a descriptive Table One, using secure data tools like REDCap, and creating standard medical journal and meeting output in Word, Powerpoint, and poster formats.
Further, while learning from a textbook can be helpful, this e-book has the ability to include interactive features that are important for learning to write your own analysis code. Informative flipbook demonstrations will show you what steps in R code do, and learnr exercises will give you a chance to do your own coding to solve problems, right within this e-book.
More and more, all science is becoming data science. We are able to track patients, their test results, and even the individual voxels of their CT scans electronically, and use those data points to develop new knowledge. While one could argue that health care workers should collect data and bring it to trained statisticians, this does not work nearly as well as you might expect. Most academic statisticians are incentivized to develop new statistical methods, and are not very interested (nor incentivized) to do the hand-holding required to wrangle messy clinical data into a manuscript.
There also are simply not enough statisticians to meet the needs of medical science. Having clinicians on the front lines with some data science training makes a big difference, whether in 1854 in London (John Snow, tracking a cholera outbreak to a water pump) or in 2014 in Flint, Michigan (Mona Hanna-Atisha, showing a rise in blood levels in children after a change i the water supply). Having more clinicians with some data science training will impact medical care, as they will identify local problems that would have otherwise never reached a statistician, and probably never been addressed with data otherwise.
Beginning as far back as 1989, with the David Baltimore case, in which NIH-funded cancer research was only documented on a pile of paper towels, there has been a rising tide of realization that a lot of taxpayer-funded science is done sloppily. Increasingly and publicly through the 2010s, there has been a realization that manuscripts don’t document the scientific process well, and that our standards as scientists need to be higher. The line between carelessly-done science and outright fraud is a thin one, and the case can be made that doing science in a sloppy fashion defrauds the funders, as it leads to results that can not be reproduced by the authors nor replicated by others. Particularly in medicine, where incorrect findings can cause great harm, we should take special care to do scientific research which is well-documented, reproducible, and replicable. Several examples of science that was sloppily done vs outright fraud involved mis-use of spreadsheets or sloppy coding. Anil Potti at Duke made several errors in spreadsheets that led to claims that tumor multi-omics could guide chemotherapy which were clearly wrong and likely harmed patients in clinical trials. Aboumatar and colleagues at Johns Hopkins published a study of an intervention for COPD in JAMA in 2018 that they claimed improved outcomes, but had to retract this paper in 2019 when they discovered that they had mis-coded the intervention, and that the intervention was actually harmful. The topic of coding correctly, and rooting out errors as a motivating force for doing careful medical research will be expanded upon in Chapter 1.
There are several icons at the top left of the main column, to the right of the sidebar, that can be helpful:
The Table of Contents Sidebar - Click on the ‘hamburger’ menu icon (three horizontal lines) or the
s key to toggle the sidebar (table of contents, on the left) on and off.
Within the sidebar, you can click on whichever chapter or subsection you want and go directly to it.
This book is Searchable - Click on the magnifying glass or use the
f key to toggle the
Find box and search for whatever you need to find.
3. You can change the font size, font, and background by clicking on the A icon.
4. You can download the chapter with the download icon (downward arrow into a file tray) in PDF or EPUB formats.
At the top right of the main column, there are several icons for sharing links to the current chapter through social media.
- You can scroll up and down within a chapter with your mouse, or use the up and down arrow keys.
- You can page through chapters with the left and right arrow keys.
This is not an introduction to statistics.
I am assuming that you have learned some statistics somewhere in secondary school, undergraduate studies, graduate school, or even medical school.
There are lots of statisticians with Ph.D.s who can certainly teach statistics much more effectively than I can.
While I have a master’s degree in Clinical Research Design and Statistical Analysis (isn’t that a mouthful!) from the University of Michigan, I will leave formal teaching of statistics to the pros.
If you need to brush up on your statistics, no worries. There are several excellent (and free!) e-books on that very topic, using R. Some good examples include (go ahead and click through the colored links to explore):
We will cover much of the same material as these books, but with a less theoretical and more applied approach. I will focus on specific medical examples, and emphasize issues (like Protected Health Information) that are particularly important for medical data. I am assuming that you are here because you want to analyze your own data in your (probably) very limited free time.
This book is also far from comprehensive in teaching what is available in the R ecosystem. This book should be considered a launch pad. Many of the later chapters will give you a taste of what is available in certain areas, and guide you to resources (and links) that you can explore to learn more and do more beyond the scope of this book. The R computer language has expanded far beyond statistics, and allows you to do many powerful things to improve your workflow, make amazing graphics, and share results with others.
Keep an eye out for helpful Guideposts, which look like this:
Warnings - These are common “gotchas” to watch out for
This indicates a common syntax error, especially for beginners. Watch out for this.
Tips - These are helpful items to remember
This is a helpful tip for debugging.
Try It Out - This is an example to apply what you have learned
Take what you have learned and try it yourself on your own computer.
Challenge - These are usually at the end of a chapter, to try a more challenging example and consolidate what you have learned
Take the next step, and try this more complicated example.
Explore More - These are resources for learning more about a particular topic.
If you want to learn more about Shiny apps, go to https://mastering-shiny.org to see an entire book on the topic.
Throughout this book you will find flipbook code demonstrations and learnr code interactive exercises in which you can practice writing R code right in the book. Let’s explain how to use these demonstration flipbooks and learnr exercises.
Flipbooks are windows in this book in which you can watch R code being built into pipelines, and see the results at each step. Each flipbook demonstrates some important code concepts, and often new functions in R. You can click on the window to activate it, and the fullscreen (4 arrows) icon to expand it to the full screen. Then use the left and right arrow keys to go forward and back in the code, one step at a time. You will want to go through these slowly, and make sure that you understand what is happening in each step. You may even want to take notes, particularly on the function syntax, as you will likely coding exercises with these functions shortly after the flipbook demonstration.
Take a look at the example of a flipbook below.
Activate it by clicking on it, and use the expand icon (4 arrows at the lower right) to make it full screen. You can step forward and backward through the pipeline of code with the right and left arrow keys. Watch the results of each step.
Learnr coding exercises are windows in this book in which you can write your own R code to solve a problem.
Each learnr exercise tests whether you have mastered important code concepts, and often new functions in R.
If needed, you can reset to a fresh code window with the
Start Over button.
You can type lines of code into the window, then click on the
Run Code button at the top right to run the code and get your results.
Your code may not produce the right result the first time, and you will have to interpret the error message to figure out how to fix it.
Rely on the text and your notes and the demonstrations to help you.
If you are stuck, you can click on the
Hint button to see an example of correct code, and compare it to your own.
If you would like, you can even copy this code to the clipboard with the
Copy button and
Take a look at the example of a learnr exercise below.
There is a dataset (prostate) piped into a function (‘select’), with three blanks (3 arguments to this function).
Fill in the blanks with the 3 listed variables (age, fam_hx, and t_stage are the variable names).
Then run your code with the
Run Code button to get a result.
Practice using the
Start Over button, the
Hint button (there may be more than one - usually the last one is the solution), and the
Copy To Clipboard button, then paste the solution in if you are stuck, and then
Then try out Exercise 2 to select columns related to Gleason Score.
When you get a table of data as a result from a code pipeline, it may have more columns (variables) than can be displayed easily.
When this is the case, there will be a black arrow pointing rightward at the top right of the table of results.
Click on this to scroll right and see more columns.
A table of data as a result from a code pipeline may also have more rows (observations) than can be displayed easily.
When this is the case, the table will be paginated, with 10 rows per page.
At the bottom right of the table, there will be a clickable listing of pages, along with
Click on these buttons (or the page number buttons) to see more pages of data to inspect and check your results.
An important note on writing your own code: you should always have an internet search window open when you are writing code. No one can remember every function, nor the correct arguments and syntax of each function. A critical skill in writing code is searching for how to do something correctly. This is not a sign of weakness. Professional programmers google “how do I do x?” many times each day. This is how programming is done. You will often search for things like “how do I do x in R?” or “how to x in tidyverse”. This is completely normal, and to be expected. You do not have time to memorize hundreds of functions, and you may have days or even weeks between coding sessions (because of your day job), making it hard to remember all the details from your last coding session. This is not a problem. There are lots of websites that can help you solve specific problems, as you will find in the How to Find Help chapter.