11 Combining/ merging/ appending datasets
Data Wrangling Recipes in R: Hilary Watt and Tristan Naidoo
One PhD supervisor spent ages before they realised that the lack of an appropriate ID variable was the reason why their student couldn’t merge their data sets; too obvious for them to consider. Yet understandable from perspective of overwhelm when new to R. Hence the value of the list of preparation required, and key considerations for merging data for analysis, with instructions on how to undertake each step.
Firstly, decide whether you want to combine 2 datasets, by adding variables (merge
command) OR by adding one data set to the bottom of another data set (adding rows using “row bind” rbind
command). To add both rows and variables simultaneously, use merge
.
To combine more than 2 data sets, use a sequence of merge
and/or rbind
commands, since we can only combine two at a time.
11.1 Adding rows when combining datasets
Adds rows from data frame named df1
to bottom of data frame named df2
, to create data frame named df12.
This only works when both datasets have the same number of variables, with the same variable names.
df12 <- rbind(df1, df2) # add df2 to bottom of df1
dim(df12) # see resulting number of rows & columns
For example, data frames ihd_pain1
and ihd_pain2
can be combined with rbind.
This only works when both datasets have the same number of variables, with the same variable names. It is at least sometimes robust to differences in data-type between data-sets in one or more named variable. Naturally this requires R to select one data-format for the variable in the new appended dataset.
## [1] 6 5
## [1] 8 5
# rbind command adds the second files onto the bottom of the first file
ihd_pain12 <- rbind(ihd_pain1, ihd_pain2)
print(ihd_pain12)
## [1] 14 5
When the number of variables is different, we can use the merge command with option all=TRUE, to achieve this. This will combine columns from both data-sets into one column, provided both have the same name. It is necessary to specify an ID variable for matching - if the desire is to append one dataset to the bottom of the other, this needs to take distinct values in each dataset. For instance, we can specify to match by personid. When both data-sets contain different people with distinct personid’s, the merged dataset contains rows from each dataset as distinct rows in the new dataset, with each row representing a distinct personid.
The first two lines of code is the artificial introduction of a new variables for the purposes of illustrating that the merge command works in this situation.
ihd_pain_copy <- ihd_pain1 # copy to new dataset
ihd_pain_copy$newvar <- ihd_pain_copy$hserial # artificial for example - add new variable
# Note: rbind does not work in this situation since the columns no longer match
ihd_pain_copy_merged <- merge(ihd_pain_copy, ihd_pain2, by='pserial', all=TRUE)
str(ihd_pain_copy_merged)
## 'data.frame': 14 obs. of 10 variables:
## $ pserial : int 100101101 100103101 100104101 100104102 100105101 100106101 100115102 100202101 100202102 100202103 ...
## $ X.x : int 1 2 3 4 5 6 NA NA NA NA ...
## $ hserial.x : int 1001011 1001031 1001041 1001041 1001051 1001061 NA NA NA NA ...
## $ anxiety.x : chr "Not anxious or depressed" "Moderately anxious or depressed" "Moderately anxious or depressed" "" ...
## $ paingrade.x: chr "Grade II - High intensity" "Grade II - High intensity" "Grade 0" "Grade 0" ...
## $ newvar : int 1001011 1001031 1001041 1001041 1001051 1001061 NA NA NA NA ...
## $ X.y : int NA NA NA NA NA NA 1 2 3 4 ...
## $ hserial.y : int NA NA NA NA NA NA 1001151 1002021 1002021 1002021 ...
## $ anxiety.y : chr NA NA NA NA ...
## $ paingrade.y: chr NA NA NA NA ...
## [1] 14 10
If we want to append one dataset to the bottom of the other, then we may need to rename some variables in one dataset first, to make sure the names are the same in both data-sets, so that data from both variables in the initial data-sets ends up in one column in the combined dataset. Occasionally, we may want to rename one variable to be different to the name in the other dataset, to make sure they end up in different columns. But that is unlikely when both data-sets contain distinct people.
This code demonstrates copying dataset ihd_pain1
to ihd_pain_copy
, then renaming 3 variables simultaneously, since this may be useful for some different situations where data is merged or appended.
ihd_pain_copy <- ihd_pain1 # copy to new dataset
ihd_pain_copy <- rename(ihd_pain_copy, xxx= X, personid = pserial, household_id = hserial)
summary(ihd_pain_copy)
## xxx household_id personid anxiety
## Min. :1.00 Min. :1001011 Min. :100101101 Length:6
## 1st Qu.:2.25 1st Qu.:1001034 1st Qu.:100103351 Class :character
## Median :3.50 Median :1001041 Median :100104102 Mode :character
## Mean :3.50 Mean :1001039 Mean :100103934
## 3rd Qu.:4.75 3rd Qu.:1001049 3rd Qu.:100104851
## Max. :6.00 Max. :1001061 Max. :100106101
## paingrade
## Length:6
## Class :character
## Mode :character
##
##
##
AVOID: cbind( )
binds columns, but is dangerous for this purpose, since it doesn’t match on anything – rows in the combined dataset might erroneously contain data from different people!
11.2 Find ID(s) for merging
merge
command to add variables: find a variable/set of variables for matching rows. We often match by patient identifying variable, perhaps called patientid.
It is crucial to check you know which variable(s) are unique identifiers in each dataset. See Unique ID variables Chapter 8. Occasionally, we might want to divide a string ID into component parts, and use one component part for merging. See Chapter 5.4 on character/ string variables.
It is crucial distinguish between one-to-one & one-to-many matching & find matching variables.
When both data sets contain one line of data per patientid (or other matching variables), this is one-to-one matching by patientid. If both datasets contain repeated measurements, then perhaps the combination of patientid and date uniquely define rows and both are required for matching observations. This would be one-to-one matching using the ID combination patientid and date.
If one dataset contains one row per patient and the other dataset contains multiple lines per patient (for repeated measurements), this is one-to-many matching. Or perhaps one dataset contains one line per area/ family/ hospital and the other contains one line per patient, with several lines for each area/ family/ hospital; this is also one-to-many matching as shown:
WARNING: Many to many matching is dangerous and never useful in my personal experience (vary rarely useful for others). This can multiply up the number of observations in your dataset – if ID 1 is duplicated in both, then ID 1 will result in 4 rows of data in the merged dataset (see image below). Three copies of one ID in both datasets results in 9 copies in the merged dataset. If you don’t check for uniqueness of ID variables & tidy up as necessary prior to merging, the result may be erroneous duplication of data.
11.3 Comparable datatypes for ID variables required to merge
You need some consistency of storage-type (between datasets) for any variable that you want to merge on. If you have numeric IDs stored as numbers in one dataset, but stored as string in the other, merging on this ID is impossible.
If you want to merge by date (say by patientid and date combination for repeated measures data), then firstly convert dates to date format. It cannot work if date in one dataset is in string format and date format in the other. Similarly POSIX (date/time) format and R date format are incompatible and cannot merge. (AVOID - perhaps you can merge when both dates are in string format, but this makes matching impossible when there are any differences in precise character strings for each date).
It is also easier if the matching variables have the same name in each dataset, but there is a more advanced form of the merge commands that can cope when this not the case.
Matching using string variables: We might need to tidy these up before we merge, by dropping leading and trailing blanks (that might have been introduced when the variables are merged). See Chapter 5.2 .
Merging with numeric data: it is a good idea to specify as integer data-type if possible. Otherwise, I’m no idea why but we might find that 7 is saved as 6.999999999999999999999 and the merge doesn’t happen. Use round( )
if necessary before converting to integer, although probably this rounding happens automatically during the conversion.
11.4 Why specify variables to merge on?
In R, we don’t need to specify variables for merging. This would explain why I recently heard of someone who struggled to merge for ages because she did not specify the merging variable(s). It didn’t occur to her supervisor that this information might be lacking, resulting in a lot of wasted time.
If we don’t specify them, R finds variables which have the same name in each, and merges on them. However, this might include cholesterol (or other continuous variables), and issues with rounding might mean it doesn’t merge, even when numbers are effectively the same. It might include age, taken at 2 different time points. When people are questioned twice about their ethnic group, they may give different answers each time (particularly if mixed race). There might be many NA
entries for one variable in one dataset.
It is far better to specify the matching variables.
The merged dataset then contains 2 versions of any variable that has the same name in each. This gives us the opportunity to check them to consistency if we want. It gives us the opportunity to use the most complete, or to use data from both versions.
11.5 Choosing what to keep after merging
For R, we don’t specify whether we want 1:1 or 1: many merging (avoiding many:many merging). In R, we don’t get error messages when things go wrong (such as rows being duplicated with accidental repeats in each dataset). We don’t need to specify what variables to merge on.
In R, there is a focus on specifying which observations to keep – either those present in both merging datasets, or present in either one, or else we choose which dataset they need to be present in. Whilst it is useful to make such choices, I dislike R’s focus on this as if it is the most important thing to focus on when merging.
To merge dataframes df_a
and df_b
, matching on variable ID
, which must be present in both df_a
and df_b.
Inner join: Keeping only ids that are in both df_a
and df_b
:
Outer join: Keeping all ids that are in EITHER df_a
OR df_b
OR both:
Left join: Keeping all ids that are in df_a
:
Right join: Merge on combination of variables, keeping all ID/ gender combinations that are in df_b
:
The commands inner_join( )
, left_join( )
, outer_join( )
are equivalent to the above versions of merge( )
.
EXAMPLE: merging the following 2 datasets with inner join (keeping only hserial
and pserial
ids from BOTH dataset), using by=c(“hserial”, “pserial”)
:
This results in the following, keeping only those with data present in BOTH above datasets:
Notice that X
is repeated in the merged dataset. X.x
and X.y
represent the X
variable from ihd_pain12
and ihd_pain3
respectively. This is because X
is a variable in both datasets, but we don’t use it as part of the merging ID. In such cases, if we want to avoid the duplication, we need to drop the variable from one of the input datasets. Dropping variables is discussed in Section 1.6.
11.6 Reconciling number of rows and inspecting data that has not merged
anti_join( )
is useful to inspect what data has not merged. For each merge, there are 2 versions of this, since specifying the datasets to merge in a different order gives a different result.
anti_join( )
gives data from the first specified dataset, that do not appear in the merged (inner join) dataset. Below is the result for our example in the previous section. Notice, we need to specify the same ID columns in anti_join
as you did in our merge
command in the previous section.
# print rows from ihd_pain12, but not merged
print(anti_join(x=ihd_pain12, y=ihd_pain3, by=c("hserial", "pserial")))
# print rows ihd_pain3, but not merged
print(anti_join(x=ihd_pain3, y=ihd_pain12, by=c("hserial", "pserial")))
Because R merge can inadvertently create extra rows, it is good practice to reconcile total number of observations, to make sure this has not happened.
## [1] 14 5
## [1] 12 5
## [1] 10 8
dim(anti_join(x=ihd_pain12, y=ihd_pain3, by=c("hserial", "pserial"))) # from ihd_pain12, but not merged
## [1] 4 5
dim(anti_join(x=ihd_pain3, y=ihd_pain12, by=c("hserial", "pserial"))) # from ihd_pain3, but not merged
## [1] 2 5
Number of rows in first original datasets (ihd_pain12
) \(\small{= 14}\) which (without any accidental duplication introduced) equals:
SUM of number of rows in inner merged dataset, plus (ihd_pain12
)’s anti_joins \(\small{= 10 + 4 = 14}\) – agrees to above
Number of rows in first original datasets (ihd_pain3
) = 12 which (without any accidental duplication introduced) equals:
SUM of number of rows in inner merged dataset, plus (ihd_pain3
)’s anti_joins = \(\small{10 + 2 = 12}\)
It is valuable to reconcile numbers as above. If they don’t, this implies accidental duplication of observations, which implies that the ID variable(s) used for matching were not unique in either dataset. To avoid this, make sure you remove duplicate IDs prior to merging.
“inner join” implies keeping only rows with merged data from both datasets. For merges where more rows are kept, it is still useful to do the above reconciliation of numbers with inner_join
, to make sure merge is not accidentally duplicating observations. Left joins keeping all those from first named dataset and outer joins keeping all rows from any dataset
11.7 Using merge to compare variables between data sets
If we do not specify which variables to merge on, then R selects all variables with the same name in each data set. If this results in a data set with one row per personid (when original data sets have one row per personid), this proves that values on other variables agree. However, minor rounding issues might cause them to apparently not agree, when really the agreement is pretty good.
11.8 Exercises
Combine ihd_pain_first.csv and ihd_pain_more.csv, and save the result as ihd_pain.csv. Can you use merge or rbind?
Combine ihd_pain.csv, ihddata.csv and ihd_household.csv. Can you use merge or rbind?
Datasets are available here: https://github.com/hcwatt/data_wrangling_open.
Data Wrangling Recipes in R: Hilary Watt and Tristan Naidoo. Imperial College London.