10.2 merge(): Combining data

Argument Description
x, y Two dataframes to be merged
by A string vector of 1 or more columns to match the data by. For example, by = "id" will combine columns that have matching values in a column called "id". by = c("last.name", "first.name") will combine columns that have matching values in both "last.name" and "first.name"
all A logical value indicating whether or not to include rows with non-matching values of by.

One of the most common data management tasks is merging (aka combining) two data sets together. For example, imagine you conduct a study where 5 participants are given a score from 1 to 5 on a risk assessment task. We can represent these data in a dataframe called risk.survey:

Table 10.1: Results from a survey on risk.
participant risk.score
1 3
2 4
3 5
4 3
5 1
# Results from a risk survey
risk.survey <- data.frame(
  "participant" = c(1, 2, 3, 4, 5),
  "risk.score" = c(3, 4, 5, 3, 1))

Now, imagine that in a second study, you have participants complete a survey about their level of happiness (on a scale of 0 to 100). We can represent these data in a new dataframe called happiness.survey:

happiness.survey <- data.frame(
  "participant" = c(4, 2, 5, 1, 3),
  "happiness.score" = c(20, 40, 50, 90, 53))

Now, we’d like to combine these data into one data frame so that the two survey scores for each participant are contained in one object. To do this, use merge().

When you merge two dataframes, the result is a new dataframe that contains data from both dataframes. The key argument in merge() is by. The by argument specifies how rows should be matched during the merge. Usually, this will be something like an name, id number, or some other unique identifier.

Let’s combine our risk and happiness survey using merge(). Because we want to match rows by the participant.id column, we’ll specify by = "participant.id". Additionally, because we want to include rows with potentially non-matching values, we’ll include all = TRUE

# Combine the risk and happiness surveys by matching participant.id
combined.survey <- merge(x = risk.survey,
                         y = happiness.survey,
                         by = "participant",
                         all = TRUE)

# Print the result
combined.survey
##   participant risk.score happiness.score
## 1           1          3              90
## 2           2          4              40
## 3           3          5              53
## 4           4          3              20
## 5           5          1              50

For the rest of the chapter, we’ll cover data aggregation functions. These functions allow you to quickly and easily calculate aggregated summary statistics over groups of data in a data frame. For example, you can use them to answer questions such as “What was the mean crew age for each ship?”, or “What percentage of participants completed an attention check for each study condition?” We’ll start by going over the aggregate() function.