Chapter 50 Using the {REDCapTidieR} package to Extract Data from a REDCap database

This chapter is part of the Clinical Applications pathway.
Packages needed for this chapter include {REDCapTidieR} and {tidyverse}.

We are going to set up an automated website that will, every morning at around 4 AM, pull data from a REDCap database, and then publish a summary of the data to a public website. This will allow us to share our data with the public, while keeping individual patient health information (PHI) secure. The website will be updated daily with the latest data from the REDCap database, and allow us (and our collaborators at other sites) to keep track of the progress of data collection in the study. An example of what we are trying to produce can be found here. This is a public website that summarizes the data collected in an observational study at five sites. The website is updated daily with the latest data from the REDCap database, and allows us to share our data with the public while keeping individual patient health information (PHI) secure.

50.1 Why Use REDCap?

REDCap (Research Electronic Data Capture) is a secure web application for building and managing online surveys and databases. It is widely used in academic and clinical research settings for data collection, especially when dealing with sensitive information such as patient health information (PHI). It is especially helpful for multicenter projects, in which you can share a common database and definitions. The {REDCapTidieR} package provides a convenient way to interact with REDCap projects from R, allowing users to extract, manipulate, and analyze data efficiently.

50.2 Preparation for Data Extraction

To prepare for data extraction, we need the following components:

A REDCap database: Usually for an observational study, longitudinal study, or a randomized controlled trial. The database should be set up with the necessary fields and data types to capture the required information.
A REDCap API Key: This is a jumble of letters and numbers that can be used as a passkey to access the REDCap database. It is usually generated by the REDCap administrator and should be kept secure. It should generally be stored on your secure computer or a firewalled server in the .Rprofile file.
A Firewalled Server: You will be extracting PHI data from the REDCap database, so that this should go to a secured computer. This can be a desktop computer, but it is often useful to use a firewalled server, such as a Linux server, to store the data securely. The server should be configured to allow access only to authorized users, and should have appropriate security measures in place to protect the data. Having a server that is up and running 24/7 makes it easier to schedule data extraction tasks and run analyses without needing to keep your personal computer on all the time.
cron jobs: A short program, run on a schedule from a server, is known as a cron job. A set of cron jobs are stored in a crontab. A crontab is a file used by the cron utility in Unix and Linux systems to schedule tasks to run at specific times or intervals. It’s essentially a table of commands that are executed by the operating system on a regular schedule. The crontab file is typically located in the /etc/ directory, and can be edited using the crontab command. Cron jobs are often used to automate repetitive tasks, such as data extraction, data processing, and data analysis. They can be scheduled to run at specific times, such as daily, weekly, or monthly, or at specific intervals, such as every hour or every minute.
A publication site: If you want to publish public summaries (without individual PHI) of your data, you will need a publication site. This can be a website, a GitHub repository, or any other platform that allows you to share your data with the public. The publication site should be secure and should not contain any individual PHI. It is important to ensure that the data is anonymized and de-identified before publishing it to the public. It is common to use sites like RPubs, Quarto Pub, or GitHub Pages to publish public summaries of your data. These sites allow you to create static web pages that can be easily shared with others, and they provide a convenient way to share your summarized data and study progress with the public without exposing individual PHI.

Let’s walk through each step.

50.3 The REDCap Database

We are assuming that you have access to REDCap and have a REDCap administrator at your research site. The administrator can help you set up a new project, add fields to the project, and manage the data collected in the project. The administrator can also help you generate an API key for the project, which is needed to access the data from R. Most REDCap administrators are overburdened, so you will often have to seek out help to build your database and set up your project. It is important to have a clear understanding of the data you want to collect and how you want to organize it before you start building your database. This will help you avoid having to make major changes to the database later on, which can be time-consuming and frustrating. It is helpful to outline your data fields to be collected, and to organize a data dictionary that describes each field, including the field name, field type, and any validation rules that should be applied. This will help you ensure that the data is collected consistently and accurately, and will make it easier to analyze the data later on. You sshould organize your data fields into forms, or case report forms (CRFs) that are used to collect data from participants. Generally each visit or encounter will have its own CRF. You should generally structure your fields as numeric with validation rules (ranges), or dropdown categorical values. You should minimize free text fields, as these are difficult to analyze and can lead to inconsistencies in the data. If you need to collect free text data, consider using a separate field for comments or notes, rather than including it in the main data fields. You should also try to minimize missing data, by avoiding sensitive fields and requiring completion of all data fields. You can find a REDCap best practices document here. There are many helpful guides and videos on the internet to help you get started, and your REDCap administrator can help you with any questions you have about building your database.

50.4 The REDCap API Key

Once your database is built, you can request an API key - open your database and click on the button on the lower left to request an API key from your administrator. The API key is a long string of letters and numbers that is used to authenticate your access to the database. It is important to keep this key secure, as it allows you to access the data in the database. You should store the API key in a secure location, such as a password manager or a secure file on your computer. You can also store it in your .Rprofile file, which is a hidden file in your home directory that is used to store R configuration settings.

To open and edit the .Rprofile file, you need to have the {usethis} package installed. When this is available, you can use the following command in R: usethis::edit_r_profile(). This will open up your .Rprofile file in your default text editor, where you can add the following line:

redcap_safety_study_api_key = "2E732F39822D600FA09C07D861887823"

Note that you need to assign a name to your api_key, and it should be specific to the particular redcap database that you will be using (each redcap database will get a unique API, even if you are the user for each one). Then save and close your .Rprofile file. This will set the environment variable redcap_safety_study_api_key to the value of your API key each time your start (or restart) R. Take a moment to restart R now. After this restart, your R session will run the updated .Rprofile file. The api key will now be available in your global environment. This api key can now be accessed in your R code using the Sys.getenv() function. You can then access the API key in your R code by using the following command:

api_key <- Sys.getenv("redcap_safety_study_api_key")

This will retrieve the value of the redcap_safety_study_api_key global environment variable and store it in a local api_key variable. You can then use this variable to access the REDCap database using the {REDCapTidieR} package.

50.5 Using REDCapTidieR to Extract Data from REDCap

Now you can use REDCapTidieR to extract data from your REDCap database. The {REDCapTidieR} package provides a set of functions that allow you to interact with REDCap projects from R, including functions for extracting data, managing projects, and working with data dictionaries. We are planning to do this on a firewalled Linux server that can store PHI from REDCap securely. Your local IT administrator will often be able to rent you a small server for a reasonable price per month. In our case, we set up a new server with a free instance of RStudioServer. A blog post (full of jargon) on how to set this up (which may be helpful to your local IT person) can be found here. We will assume that you have already installed the {REDCapTidieR} package, and have loaded it into your R session with the library(REDCapTidieR) command. If you have not installed the package yet, you can do so by running the following command:

# install.packages("REDCapTidieR")
library(REDCapTidieR)

You then need to set up a dedicated R project for this study on your R server with File?New Project. Within this new project, open a new R file names something like safety _study_redcap_extraction.R. This will be the file that contains the code to extract data from your REDCap database and and save it to a set of data tables.

Start this file by loading libraries, using

library(REDCapTidieR)
library(tidyverse)
library(here)

## here() starts at /Users/peterhiggins/Documents/RCode/rmrwr-book

library(gt)
library(gtsummary)

Then you will assign the api_key and the url of your local REDCap production website to variables.

api_key <- Sys.getenv("redcap_safety_study_api_key")
redcap_url <- "https://redcap.yourinstitution.edu/api/"

Note that you should check with your local REDCap administrator to get the right url for this step.

Now you can use the read_redcap() function to read in data from your REDCap database.

datfr <- read_redcap(redcap_uri = redcap_url,
                     token = api_key)
head(datfr)

If this is working, you should be able to run the code up to this point and get output that is a REDCapTidieR Supertibble. This supertibble contains all of the data from your REDCap database, with one row for each case report form (aka instrument, form).

You can then break up the Supertibble into individual tibbles, one for each form, using the code below:

datfr |> 
  make_labelled() |> 
  bind_tibbles()

Now you should see new dataframe objects popping up in your Environment tab, one for each form in the Supertibble. These will have the same names (labels) that you applied to the forms you created in REDCap, and will often have names like demographics, labs, medications, etc.

You can then save these data tables for future use & analysis, using code for each data table like the code block below.

saveRDS(demographics, here("data", "safety_study_demographics.Rd"))
saveRDS(labs, here("data", "safety_study_labs.Rd"))
saveRDS(labs, here("data", "safety_study_medications.Rd"))

Optionally, you may have data (often for race) from REDCap that uses checkboxes, which results in a separate variable for each race. These are less helpful for analysis, and for each checkbox variable, you may want to use the combine_checkboxes() function from {REDCapTidier} to make these into a single variable. You can do this with code like the code block below (note that this version includes the make_labelled and bind_tibbles functions to separate the output into distinct tibbles):

combine_checkboxes(
  supertbl = datfr,
  tbl = "demographics",
  cols = starts_with("race__"),
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_repair = "check_unique",
  multi_value_label = "multiple",
  values_fill = NA,
  raw_or_label = "label",
  keep = TRUE) |> 
  make_labelled() |>
  bind_tibbles()

)

50.6 Analyzing and Summarizing your Data

At this point, you have extracted data from your REDCap database and saved it to a set of data tables. You can now analyze and summarize your data using the {tidyverse} package in a Rmarkdown or quarto file.

For this example, we will assume that you are using a quarto file. Start a new document with File/New File/Quarto Document, and save it as safety_study_redcap_analysis.qmd. This will be the file that contains the code to analyze and summarize your data. This will be a mix of tables and plots to summarize the progress of your study. When you render the quarto document, this will produce an HTML file that can be viewed locally, emailed to collaborators (weekly or monthly updates), or regularly published to a website.

To start the quarto dashboard document, we will use a YAML header, with code that looks like this block:

title: "Study Dashboard"
author: " `r glue::glue('Your Name   ', 'Updated at: ', as.character(Sys.time))` " # this will stick together the author name and when the file was last updated
format:
  dashboard:
    logo: images/Signature-Vertical-Bug.png # uses a png file to put a logo on the page if desired
    orientation: columns

Next we will add a setup chunk to load libraries and set a theme for ggplots.

Next we will read in the data tables that we saved from the previous step. This will allow us to use the data in our analysis and summaries.

demographics <- readRDS(here("data", "safety_study_demographics.Rd"))
labs <- readRDS(here("data", "safety_study_labs.Rd"))
medications <- readRDS(here("data", "safety_study_medications.Rd"))

Next we will count the number of participants enrolled at each site, to provide data for a valuebox page of the dashboard. This will give us a quick overview of the number of participants enrolled at each site. Note that you may want to peek at the site field in the demographics table to get the exact names of the sites, as these may vary depending on how your REDCap database is set up. Sometimes REDCap will cut off the names after a certain number of characters, so you may need to adjust the site names in the code below to match the names in your demographics table.

n_enrolled_umich <- demographics |> 
  filter(site == "University of Michigan") |> 
  nrow()

n_enrolled_oregon <- demographics |> 
  filter(site == "Oregon Health & Science University") |> 
  nrow()

n_enrolled_ucsf <- demographics |>
  filter(site == "University of Cali-SF") |> 
  nrow()

n_enrolled_uw <- demographics |>
  filter(site == "University of Washington") |> 
  nrow()

Next we will make the first page for the websitem a set of value boxes on a single tab for the website, using the counts calculated above. We will build 4 valueboxes, one for each study site, in an appropriate color for each site, and set each one to 25% width. We will also start with a header label (# Counts), which will appear before the valueboxes on the web page. Note that I added backticks around # Counts below to make sure it would not start a new chapter in this e-book. For your quarto webpage, you want to start a new tab, so you would not use the backticks for this level 1 header, # Counts.

# Counts

list(
  icon = "person",
  color = "blue",
  value = n_enrolled_umich
)

list(
  icon = "person",
  color = "green",
  value = n_enrolled_oregon
)

list(
  icon = "person",
  color = "red",
  value = n_enrolled_ucsf
)

list(
  icon = "person",
  color = "purple",
  value = n_enrolled_uw
)

The code in these 4 chunks creates a single webpage with 4 columns. If you want to add more information, you might want to add extra tabs for additional pages. Each time you add a level 1 header (with a single hashtag, like # Counts above), this will create a new tab in the dashboard. You can also add additional content to each tab by adding more code chunks with the appropriate content. Let’s build a new tab for an # Enrollment Plot that will show the number of participants enrolled over time by study site. This will be a line plot that shows the number of participants enrolled over time, and will help us track the progress of the study.

# Enrollment Plot

screening |> 
  mutate(date = as.Date(data_entry_dt)) |> 
  filter(is.na(date)) |> 
  arrange(date) |> # sort by non-missing dates
  group_by(redcap_data_access_group) |> # the study sites
  mutate(c=1) |> # create a counter variable
  mutate(enrolled = cumsum(c)) |> # add 1 with each enrollment
  ggplot(aes(x = date, y = enrolled, group = redcap_data_access_group)) +
    geom_line() +
    geom_point(aes(fill = redcap_data_access_group),
               pch = 21, size = 2.5) + # select shape and size of points
    scale_x_date(limits = as.Date(c("2024-06-01", "2025-12-31")),
                 date_breaks = "2 months",
                 date_labels = "%b") + # set x-axis limits and labels
    labs(title = "Enrollment By Month", subtitle = "Started June 2024",
         x = "", fill = "Center",
         y = "Enrolled") +
  theme_linedraw(base_size = 18) + # use a linedraw theme with enlarged font
  scale_fill_manual(values = c("blue", "green", "red", "purple"),
                    labels = c("University of Michigan" , 
                                "Oregon Health & Science University",
                               "University of California - SF",
                                "University of Washington"                               )) + # set colors and labels for each site
    theme(legend.position = "top") + # set legend position
  annotate("text", x = as.Date("2025-11-01", y = 115,
            label = glue::glue("Total\nEnrolled: ", nrow(screening))))

This code chunk will produce a plot like the one in the website here. You can add additional plots and tables to the dashboard by adding additional code chunks with the appropriate content.

Let’s add a summary table, using the {gtsummary} package, to summarize the demographics of the participants enrolled in the study. This will be a table that shows the number of participants enrolled by race, gender, ethnicity, and IBD type. We will start with a first level header to create a new tab. For the purposes of this e-book, I will put backticks around it to avoid starting a new chapter.

# Enrollment Table

# first merge two tables
demo_ibd <- left_join(demographics, ibd_characteristics)

demo_ibd |> 
  select(gender, ethnicity, race, ibd_type) |> 
  janitor::remove_empty(which = "rows") |> 
  tbl_summary(
  label = list(gender ~ "Gender", # nicer labels
               ethnicity ~ "Ethnicity",
               ibd_type ~ "IBD Type"),
  include = c(gender, ethnicity, ibd_type), # rows by group
  by = race) |> # columns by race
  bold_labels()

This code chunk will produce a table like the one in the website here. You can add additional tables and plots to the dashboard by adding additional tabs and code chunks with the appropriate content.

50.7 Rendering the Quarto Document

Yu can save your Quarto Document, or add additional tabs and plots. There are many options for quarto dashboards, which can be found here. You can add additional tabs, plots, and tables to the dashboard by adding additional code chunks with the appropriate content. Look at the gallery for some good examples. Other good examples for inspiration can be found in this blog post on monitoring clinical trials. Quarto dashboards have more options than some of the examples shown, but the MOSAIC study and the “later study” (Insight-ICU) provide some nice ideas of what you can track.It is especially nice that the source code is available for both of these dashboards, so you can see how they were built and adapt them for your own use.

When you are ready to render the document, you can click on the Render button at the top center of the Source pane in the RStudio IDE, or use the command quarto::quarto_render("safety_study_redcap_analysis.qmd") in the console. This will produce an HTML file that can be viewed locally, emailed to collaborators, or published to a website.

50.8 Checking for PHI

Now that you have the rendered HTML, inspect it carefuly to make sure that there is no individual data nor PHI shared in the document. If you find any PHI in the document, you should remove it before publishing the document on a public website.

50.9 Publishing the Dashboard

You can stop here, and just view the HTML yourself, or email the generated HTML file to your collaborators. However, if you want to publish the dashboard to a website, you can use the quarto publish <filename.qmd> command to publish the dashboard to a website. This will allow you to share the dashboard with the public and yuor collaborators, while keeping individual patient health information (PHI) secure. Quarto Pub is probably the easiest approach for quarto dashboards, but there are a lot of publishing options available here. Note that you need to create a free quarto pub account at Quarto Pub before you can publish your dashboard. Once this is created, website creation can be done manually by running quarto publish filename.qmd each time you want to update the dashboard.

50.10 Emailing Update Notifications to Collaborators

When you update your dashboard, you probably want to notify your collaborators that you have made an update on the study progress, so that they will look at it and be inspired to enroll more participants. While you may be updating your dashboard daily or weekly, the emails with the link will probably be less frequent (weekly or monthly). You can do this manually of course, but you can also make this a programmatically generated email with the {blastula} package that is sent to a list of your collaborators.

This can be done with a R file. Start a new R file with File/New File/R Script, and save it as safety_study_redcap_email.R. This will be the file that contains the code to send an email to your collaborators with the link to the published dashboard. See the example below for how to set up the email. You will need to set up your email credentials in the {blastula} package, which can be done with the create_smtp_creds_key() function. This will allow you to send emails from R using your server account. You can find more information on how to set this up here. Note that you can use \ on an isolated line or <br> for line breaks in the email object.

library(blastula)

# generate the email object
my_email_object <- compose_email(
  body = md("Hello Collaborators,
  The Safety Study dashboard has been updated with the latest data from the REDCap database. You can view the dashboard at the following link: [Safety Study Dashboard](https://phiggins.quarto.pub/safety-study-dashboard/).
  \
  In today's news, we have now enrolled over 500 participants across the four sites.
  \
  thanks, <br>
  Study PI Bob"))

# generate the list of recipients
email_list <- c("abel@oregon.edu", "cal@ucsf.edu", "wally@uw.edu")

# execute the send 
blastula::smtp_send(my_email_object,
                    from = "bobs_server@umich.edu",
                    to = email_list,
                    subject = "Safety Study Dashboard Update",
                    credentials = creds_app(host = "localhost", 
                                      port=25, use_ssl = FALSE))
# you will probably need to ask your local IT expert for the right email credentials to send email from your server.

print(paste("The email script ran at ", Sys.time()))
# this last line confirms that the email script ran and will print the time that it ran.

50.11 Automating and Scheduling Your Dashboard Updates

If you are busy, or just lazy, you may want to go to the trouble of automating the steps of

extracting the data from redcap daily
analyzing the study progress in a quarto dashboard
rendering the dashboard and publishing to quarto pub
sending a weekly email to your collaborators with the link to the published dashboard.

This requires configuring several cron jobs on your server to run sequentially. You will do this with a local text editor on your server, like vim or nano. This will probably require some help and troubleshooting from your local IT expert who helped to set up the RStudio Server.

Start with the following steps:

log in to your R Studio Server
Open the Terminal tab in R Studio
At the cursor, type crontab -e to edit your crontab file
Each crontab line will be a separate cron job
Each cron job starts with the time and date that the job should run, followed by the command to run.
The time and date are in a specific format, which you can learn about here. The format is: minute hour day month weekday command. You will list 5 numbers with one space between each number, followed by the command to run. Asterisks can be used as wild cards, to run every month or weekday. Also note that Sunday = 0 For example, to run a command every day at 4:00 AM, you would use 0 4 * * * command.
For the commands, you want one for each program or script that you will run.
You will want to schedule them 1-3 minutes apart, depending on how long each will take to run.
The first file that you will want to run is the safety_study_redcap_extraction.R file, which will extract the data from REDCap and save it to the data tables. However, the file name alone is not a command.
You first have to track down the path to your R version (the Rscript executable) on your server. You can do this by running the command which Rscript in the terminal. This will give you the path to the Rscript executable, which you can then use in your cron job. This will look something like “/app/apps/rhel8/R/4.3.2/bin/Rscript”
You also need the path to your R file on the server, which will look something like “/nfs/corexfs/bob_pi/safety_study_redcap_extraction.R”. You can find this by navigating to the file in the terminal and running the command pwd to get the path to the current directory.
Now you can put this all together in your crontab file. The first line will look something like this, running at 4:00 AM:
0 4 * * * /app/apps/rhel8/R/4.3.2/bin/Rscript “/nfs/corexfs/bob_pi/safety_study_redcap_extraction.R”
The next line will be for the safety_study_redcap_analysis.qmd file, which will analyze the data and render the dashboard. Note that this will have a different path to reach quarto, and will use the quarto publish command, followed by the qmd file. This will look something like this, running at 4:03 AM:
3 4 * * * /app/apps/rhel8/quarto/1.5.54/bin/quarto publish quarto-pub “/nfs/corexfs/bob_pi/safety_study_redcap_analysis.qmd”
Note that sometimes the quarto-pub website will want you to manually authorize your access to the HTML webpage, which defeats the purpose of automating the process. If this happens, you will need to log in to the quarto-pub website and authorize your access the first time. You can also set up a personal access token (PAT) to avoid this issue, which can be done by following the instructions here. Once you have a token for quarto-pub, you can use it as your authorization as a distinct cron job, which will look something like this, running at 4:05 AM:
5 4 * * * (export QUARTO_PUB_AUTH_TOKEN=“your_personal_access_token” “/nfs/corexfs/bob_pi/safety_study_redcap_analysis.qmd”)
The last line will be for the safety_study_redcap_email.R file, which will send the email to your collaborators. This will look something like this, running at 4:08 AM:
8 4 * * * /app/apps/rhel8/R/4.3.2/bin/Rscript “/nfs/corexfs/bob_pi/safety_study_redcap_email.R”

50.11.1 Explore REDCapTidieR and Quarto Dashboards

You can find out a lot more about REDCapTidieR here.

There is great documentation on Quarto Dashboards here.

You can also find a lot of great examples of Quarto dashboards in the gallery.

Some good examples of clinical study monitoring with dashboards can be found here and here. These are both examples of clinical trials that are being monitored with dashboards, and they provide a good overview of the study progress and data collection. The links to the source code for these dashboards are available here.

Details on publishing webpages with Quarto Pub (and how to get authorization tokens) can be found here.

You can find out more about crontab here, and more generally about scheduling cron jobs here.