8 Unique ID variable(s): checking ID variable(s) uniquely identify each row and tidying up any accidental repeats

Data Wrangling Recipes in R: Hilary Watt

8.1 Find ID variable(s) that uniquely identifies each row

This is important for understanding the structure of data, and making sure the same person isn’t inappropriately included twice. It is crucial prior to merging datasets.

It is important to know which ID variable or combination of ID variables uniquely identifies each row. Always keep ID variables in your dataset. Check for inappropriate duplicated ID variables and tidy up as necessary. When ID variable is missing, consider whether it is appropriate to delete observations.

Perhaps we have (or should have) one line of data per person. Or perhaps we have repeated measures data, with several lines per person, where personid and date (or personid and visitno) together uniquely identify patients. Perhaps each line represents a family, or a region, or hospital, with data on this larger unit. We may rows for each person, with people clustered within region. Occasionally patientid may be unique within region, but not unique across different regions. Then we need the combination of region ID and patientid to uniquely identify patients.

Thus, understanding what unique IDs we have, is a crucial part of understanding our data.

Making sure we do not inadvertently include the data on some people twice is really important. Hence checking for this is an important aspect of data cleaning. If we then merge data within R, there is a risk that any accidental duplicates get multiplied up, leading to an even worse situation.

# vector of TRUEs and/or FALSEs indicating whether or not each obs is missing
# (TRUE) or not – to demonstrate command
is.na(df$patientid) 
# counts number of missing (TRUE) and non-missing (FALSE) observations
table(is.na(df$patientid)) 

Show those with missing ID variables. The following 4 lines all use notation dataframe_name[row, column] – condition given is applied to determine which rows to keep from dataframe named df, and blank after comma in column position indicates to keep all columns, i.e. keep all variables.

# show all variables on anyone with missing id2 variable
anaemia[is.na(df$id2) == TRUE, ]
# keep only those without missing patientid in anaemia and save to anaemia2
anaemia2 <- anaemia[is.na(df$id2) == FALSE, ]

Alternative to two lines above, when 2 variables together form the “unique ID” (here patientid & date):

# lists those with missing patientid or missing date or both
df[is.na(df$patientid) == TRUE | is.na(df$date) == TRUE, ]
# keep only those without missing patientid or date –> save to df3
df3 <- df[complete.cases(df$patientid, df$date), ] 

Check to see whether ID variable uniquely identifies each observation/ is unique on each row.

# lists unique values of patientid; head shows the first six entries only!
head(unique(anaemia$id2))
## [1] "AB392" "AB840" "AB429" "AB877" "AB794" "AB346"
# counts number of unique observations for this variable
length(unique(anaemia$id2)) 
## [1] 1025
# if number of rows here agrees to number of unique values above => patientid 
# is unique
dim(df)
## [1] 6 2
# counts number of unique combinations of these 2 variables
length(unique(df$patientid, df$date)) 
## [1] 0
# if number of rows agrees to number of unique values above => patientid/ date 
# combinations are unique
dim(df)
## [1] 6 2

Assessing & removing lines with the same ID variable in case this is required.

# reports number of exact duplicated rows in dataframe named df
sum(duplicated(anaemia)) 
## [1] 9
# creates variable named “dup”, which is marker for duplicates
anaemia$dup <- duplicated (anaemia) 
# view duplicates (first line NOT shown for each, only following duplicated rows)
print(anaemia[anaemia$dup ==TRUE, ]) 
# keep only unique rows – save revised df to df4
df4 <- unique(df) 
# reports number of exact duplicated rows in df4 – now zero
sum(duplicated(df4))

What if observations are not exact duplicates? Perhaps we have 2 lines for a few patientids, and we need to take some info from each row to create one “correct” row for each patientid, that has as little missing data as possible. See Section 12.1 creating a summary dataset.

8.2 Exploring partial duplicates and finding summary statistics across such duplicates

Section 12.1 gives details of a way of collapsing two or more lines or data into one, which may be useful to tidy up duplicated ids into one line. This enables data to be extracted from both lines of data into one line containing appropriate summary information.

You need some fundamentals of tidyverse dplyr to understand this section (not taught here). This is merely a hint/ pointer that may help if you need more advanced stuff. You need library(tidyverse) or library(dplyr) for these to work.

Finding partial duplicates:

library(tidyverse)

dup_ids <- anaemia %>%  # create new data set named dup_ids from anaemia dataframe
    count (id2) %>%  # create count variable to count how many of each id2 there are
    filter(n>1) # keep only those with duplicate id2's
print( dup_ids)  # show dataset just created

filter(anaemia, id2 %in% dup_ids$id2) # print those where id2 is in the set of duplicated id2's

Removing partial duplicates:

anaemia %>%
    distinct(id2, .keep_all=TRUE)

Summarising partial duplicates:

anaemia.dup.sum <- anaemia %>%  # take aneamia dataset and save to anaemia.dup.sum dataset
    group_by(id2) %>%
    mutate(hb_pre.m=mean(hb_pre)) %>% # creates hb_pre.m var as mean across duplicates
    mutate(hb_post.m=mean(hb_post)) %>% # creates hb_post.m var as mean across duplicates
    distinct(id2, .keep_all=TRUE) %>%
    select(-hb_pre)  # removes original hb_pre variable
print(anaemia.dup.sum)

The following example uses a pair of IDs together (first_name and last_name, not applied to this dataset).

library(tidyverse)

dup_ids <- aaa %>%
    count (first_name, last_name) %>%
    filter(n>1)

aaa %>%
    filter(aaa, first_name %in% dup_ids$first_name, last_name %in%
               dup_ids$last_name)

Removing partial duplicates (with combination of 2 ID variables):

aaa %>%
    distinct(first_name, last_name, .keep_all=TRUE)

Summarising partial duplicates (with combination of 2 ID variables):

aaa %>%
    group_by(first_name, last_name) %>%
    mutate(mean_hb_pre=mean(hb_pre)) %>%
    distinct(first_name, last_name, .keep_all=TRUE) %>%
    select(-hb_pre)

The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.