Chapter 39 Protecting PHI (Protected Health Information)
If you are working with medical data, there is a good chance that you are frequently working with Protected Health Information (PHI). PHI is any information in a medical record that can be used to identify an individual, and that was created, used, or disclosed in the course of providing a healthcare service, such as a diagnosis or treatment. PHI includes many common identifiers, such as names, addresses, dates of birth, and Social Security numbers. PHI also includes any other information that could be used to identify a patient, such as medical record numbers, photographs, and biometric data.
It is CRITICAL to maintaining your credibility as a data analyst, as well as your access to medical data, that you take the necessary steps to protect PHI. This includes not sharing PHI, encrypting data, and using synthetic data when possible.
39.1 Protecting (Not Inadvertently Sharing) PHI
The first important step in protecting PHI is to not share it. This means not sharing data that contains PHI, and not sharing code that contains PHI. This is especially important when backing up data to the cloud, or sharing data with collaborators. A very common way to back up data and collaborate on a shared project is to establish a repository (repo) on GitHub.
It is important that you know how to use the .gitignore
file to prevent PHI from being shared on GitHub. The .gitignore
file is a text file that tells Git which files or folders to ignore in a project. The .gitignore
file should be placed in the root directory of your project. You can include an R-specific .gitignore
file when you create the repository on GitHub (this is a dropdown on the Github page when creating a new repo), or you can add it later.
You can create a .gitignore
file by opening a text editor, and saving the file as .gitignore
. For this example, you can then add the following lines to the file:
*.csv
*.Rd
*.RData
data/
output/
These lines tell git to ignore the following kinds of files:
- anything that ends in
.csv
- anything that ends in
.Rd
- anything that ends in
.RData
- any files in the
data
directory - any files in the
output
directory
So that when you share your code on GitHub, you are not sharing any PHI.
39.2 Identifying PHI
Protected health information (PHI) is any information in the medical record or designated record set that can be used to identify an individual and that was created, used, or disclosed in the course of providing a health care service such as diagnosis or treatment. HIPAA regulations allow researchers to access and use PHI when necessary to conduct research. However, HIPAA applies only to research that uses, creates, or discloses PHI that enters the medical record or is used for healthcare services, such as treatment, payment, or operations.
Identifying PHI in your data files is the next step in protecting PHI. Per US law, there are 18 types of data that are considered PHI.
- Names;
- All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
- All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
- Phone numbers;
- Fax numbers;
- Electronic mail addresses;
- Social Security numbers;
- Medical record numbers;
- Health plan beneficiary numbers;
- Account numbers;
- Certificate/license numbers;
- Vehicle identifiers and serial numbers, including license plate numbers;
- Device identifiers and serial numbers;
- Web Universal Resource Locators (URLs);
- Internet Protocol (IP) address numbers;
- Biometric identifiers, including finger and voice prints;
- Full face photographic images and any comparable images; and
- Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by the investigator to code the data [aka study_id])
39.3 Selectively Deleting PHI
If this will not affect your data wrangling, you can use the select
function in {dplyr} to remove PHI from your data files. For example, if you have a data file called data
, you can use the following code to remove PHI:
data_no_phi <- data |>
select(-name, -address, -dob, -phone, -email, -ssn,
-mrn, -hpb, -account, -cert, -vehicle, -device,
-url, -ip, -biometric, -image, -date_admitted)
And then be sure that the data
file is in your .gitignore
file, and you can share the data_no_phi
file with collaborators.
You might even want to create a folder called data_with_phi
and a folder called data_no_phi
to keep track of your data files separately by PHI status. The data_with_phi
folder can show up in your .gitignore
file as data_with_phi/
, so that you are not sharing PHI.
39.4 Problems with PHI-free data
PHI-free data are great for sharing and collaborating, but may be problematic for data wrangling.
You may find that you need some PHI fields in your early steps of data wrangling to match and join data files from different sources.
It is not uncommon to need some PHI to be the unique IDs to conduct data wrangling and analyses. You may need a medical record number to join data from different data sources. You may need dates to determine the interval between events, like the last screening colonoscopy and a colon cancer diagnosis. You may need a patient’s age to determine if they are eligible for a study.
There are some workarounds. You can
- Keep data with PHI in a separate folder (a
data_with_phi
folder) which gets added to your.gitignore
file, then after the joins are done and PHI-free data are created, save the PHI-free data to adata_no_phi
folder. - Use phony dates shifted by a random number of days to that the dates are not real, but you can still calculate intervals (note that this is an export option in REDCap)
39.5 Encrypting PHI
You can also keep, but encrypt your data fields, using strong RSA (2048 bit) encryption, with the {encryptr} package. This package can be installed from CRAN, with install.packages("encryptr")
. You can find the full documentation at https://encrypt-r.org/
.
The basis of RSA encryption is a public/private key pair and is the method used of many modern encryption applications. The public key can be shared with collaborators and is used to encrypt the information.
The private key is sensitive and should not be shared. The private key requires a password to be set. This password should follow modern rules on password complexity. You know what you should do. If lost, it cannot be recovered.
39.5.1 Generating Public and Private keys
The genkeys()
function generates a public and private key pair. The public key id_rsa.pub
can then be shared with collaborators. The private key id_rsa
should be kept secure and not shared. It should be listed in your .gitignore
file. Set up a new project repository on your GitHub site, and copy the SSH to start a new project in RStudio using version control. Then run the code below (a novel password will be required), then open your .gitignore
file in the project, and add id_rsa
to the file.
library(encryptr)
genkeys()
# > Private key written with name 'id_rsa'
# > Public key written with name 'id_rsa.pub'
You can open a new project and test this with the gp dataset provided with the [encryptr} package. Try it out!
## # A tibble: 6 × 12
## organisation_code name address1 address2 address3 city
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 S10002 MUIRHE… LIFF RO… MUIRHEAD <NA> DUND…
## 2 S10017 THE BL… CRIEFF … KING ST… <NA> CRIE…
## 3 S10036 ABERFE… TAYBRID… <NA> <NA> ABER…
## 4 S10060 ABERFE… TAYBRID… <NA> <NA> ABER…
## 5 S10106 GROVE … 129 DUN… BROUGHT… <NA> DUND…
## 6 S10125 ALYTH … NEW ALY… ALYTH <NA> BLAI…
## # ℹ 6 more variables: county <chr>, postcode <chr>,
## # opendate <date>, closedate <date>, telephone <chr>,
## # practice_type <dbl>
You can see a listing of the 1212 general NHS practices in Scotland, and you can imagine that you might want to encrypt some fields (telephone, etc.) and delete some unneeded ones in this data before sharing it publicly on GitHub.
library(dplyr)
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone)
gp_encrypt
## # A tibble: 1,212 × 8
## organisation_code city county postcode opendate
## <chr> <chr> <chr> <chr> <date>
## 1 S10002 DUNDEE ANGUS 33a88bb… 1995-05-01
## 2 S10017 CRIEFF PERTHS… 52ae1be… 1996-04-06
## 3 S10036 ABERFELDY PERTHS… 60137e6… 2008-04-01
## 4 S10060 ABERFELDY PERTHS… 595ce94… 1975-04-01
## 5 S10106 DUNDEE ANGUS 23f4ce3… 1996-07-08
## 6 S10125 BLAIRGOWRIE PERTHS… 3b5b8ad… 1979-10-01
## 7 S10182 ARBROATH ANGUS 11ef4c8… 1977-10-01
## 8 S10233 ARBROATH ANGUS 908fd37… 1986-08-01
## 9 S10286 ARBROATH ANGUS 3eaed1f… 1975-08-01
## 10 S10322 ARBROATH ANGUS 1801648… 1971-10-01
## # ℹ 1,202 more rows
## # ℹ 3 more variables: closedate <date>, telephone <chr>,
## # practice_type <dbl>
You can see that postcode and telephone are now encrypted. You can share this data on GitHub, and collaborators can use the private key to decrypt the data when necessary.
Decryption requires the private key generated using genkeys() and the password set at the time. The password and file are not replaceable so need to be kept safe and secure. The code below will ask you for the password you set when you generated the keys before it provides the decrypted data.
## # A tibble: 1,212 × 8
## organisation_code city county postcode opendate
## <chr> <chr> <chr> <chr> <date>
## 1 S10002 DUNDEE ANGUS DD2 5NH 1995-05-01
## 2 S10017 CRIEFF PERTHS… PH7 3SA 1996-04-06
## 3 S10036 ABERFELDY PERTHS… PH15 2BL 2008-04-01
## 4 S10060 ABERFELDY PERTHS… PH15 2BH 1975-04-01
## 5 S10106 DUNDEE ANGUS DD5 1DU 1996-07-08
## 6 S10125 BLAIRGOWRIE PERTHS… PH11 8EQ 1979-10-01
## 7 S10182 ARBROATH ANGUS DD11 1AD 1977-10-01
## 8 S10233 ARBROATH ANGUS DD11 1EN 1986-08-01
## 9 S10286 ARBROATH ANGUS DD11 1ES 1975-08-01
## 10 S10322 ARBROATH ANGUS DD11 1ES 1971-10-01
## # ℹ 1,202 more rows
## # ℹ 3 more variables: closedate <date>, telephone <chr>,
## # practice_type <dbl>
As an alternative to increase data security, you can store the PHI encrypted data in a separate ‘lookup table’ that is not shared on GitHub. This lookup table can be used to decrypt the data when necessary.
This can be accomplished by adding a lookup
argument to the encrypt
function. The lookup
argument creates a data frame that contains the PHI data that was encrypted.
gp_encrypt <- gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone, lookup = TRUE)
# Lookup table object created with name 'lookup'
# Lookup table written to file with name 'lookup.csv'
## # A tibble: 1,212 × 8
## organisation_code city county postcode opendate
## <chr> <chr> <chr> <chr> <date>
## 1 S10002 DUNDEE ANGUS 33a88bb… 1995-05-01
## 2 S10017 CRIEFF PERTHS… 52ae1be… 1996-04-06
## 3 S10036 ABERFELDY PERTHS… 60137e6… 2008-04-01
## 4 S10060 ABERFELDY PERTHS… 595ce94… 1975-04-01
## 5 S10106 DUNDEE ANGUS 23f4ce3… 1996-07-08
## 6 S10125 BLAIRGOWRIE PERTHS… 3b5b8ad… 1979-10-01
## 7 S10182 ARBROATH ANGUS 11ef4c8… 1977-10-01
## 8 S10233 ARBROATH ANGUS 908fd37… 1986-08-01
## 9 S10286 ARBROATH ANGUS 3eaed1f… 1975-08-01
## 10 S10322 ARBROATH ANGUS 1801648… 1971-10-01
## # ℹ 1,202 more rows
## # ℹ 3 more variables: closedate <date>, telephone <chr>,
## # practice_type <dbl>
You can then add lookup.csv
to your .gitignore
file, and share the main file for collaborators. Decryption is performed by passing the lookup object or file to the decrypt() function.
## Error: object 'lookup' not found
Learn more about how to encrypt PHI-containing fields from the documentation of {encryptr} at https://encrypt-r.org/
.