8.2 Essential join commands
As the functions for joining two tables x
and y
are implemented in the dplyr package (Wickham, François, Henry, & Müller, 2022), the two-table verbs use the same terminology as the one-table verbs discussed in Chapter 3 on Data transformation:
Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.
In the following, we distinguish between three types of joins:
Mutating joins combine the variables (columns) of two tables (Section 8.2.1);
Filtering joins affect only the cases or observations (rows) of two tables (Section 8.2.2);
Set operations on two tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section 8.2.3).
Finally, we will see that the merge()
function from base R can also be used to accomplish mutating joins (Section 8.2.4).
8.2.1 Mutating joins
Mutating joins combine the variables (columns) of two tables x
and y
.
The term mutating is used in analogy to the dplyr::mutate()
command: Just as mutate()
creates new variables (columns),
a mutating join adds new variables (columns) to a table x
from matching cases (rows) of another table y
.
Key functions
There are four types of mutating joins to combine the cases (rows) and variables (columns) of tables x
and y
:
left_join(x, y)
: Return all rows fromx
, and all columns fromx
andy
. Rows inx
with no match iny
will haveNA
values in the new columns. If there are multiple matches betweenx
andy
, all combinations of the matches are returned.right_join(x, y)
: Same asleft_join(y, x)
, and hence redundant.full_join(x, y)
: Return all rows and all columns that exist inx
or iny
. If there are non-matching values, returnNA
for the ones missing.inner_join(x, y)
: Return all rows fromx
where there are matching values iny
, and all columns fromx
andy
. If there are multiple matches betweenx
andy
, all combination of the matches are returned.
In contrast to an inner_join(x, y)
, the first three joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.
Inner vs. outer joins
All four types of mutating joins combine variables (columns) from two tables, but differ in preserving in the cases (rows) in the output table.
An inner join inner_join(x, y)
keeps only the cases (or rows) that appear in both tables (i.e., x
and y
).
By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins.
The three types of outer joins also differ with respect to the cases they preserve:
left_join(x, y)
keeps all cases ofx
.right_join(x, y)
keeps all cases ofy
.full_join(x, y)
keeps all cases ofx
ory
.
Terminology of key variables
Beyond specifying the two tables to join, all *_join()
commands take a by
argument:
by
specifies a character vector of variables by which we want to join the tables (so-called keys).
In the r4ds textbook (Wickham & Grolemund, 2017), Section 13.3 on keys makes a distinction between primary and foreign keys, that stems from database terminology. These terms can be confusing, though the distinction behind it is actually quite simple:
A key is simply a variable that occurs in two distinct tables (
x
andy
) and by which we want to join both tables.Importantly, the variable of interest does not always need to have the same name in both tables. For instance, a variable may be called
surname
inx
andfamily_name
iny
, but both these variables actually have the same referent. (i.e., the name of a specific person). The distinction between primary vs. secondary now depends on our perspective:When looking from table
x
at tabley
,surname
is a primary key andfamily_name
a secondary key.When looking from table
y
at tablex
,family_name
is a primary key andsurname
a secondary key.
The primary key (in table
x
) and the corresponding foreign key in another table (y
) form a so-called relation. Such relations are typically one-to-many. For instance, eachname
orid
variable in one table corresponds to a particular person, but each person can contribute data to several measurement occasions in another table (see, e.g., the variablesid
andintervention
in the tableposPsy_p_info
vs. in the tableposPsy_long
of the ds4psy package). A special case occurs when there is a one-to-one relationship (see, e.g., the variablesid
andintervention
in the tablesposPsy_p_info
andposPsy_wide
of the ds4psy package).
Actually, the distinction between two different types of keys is irrelevant, as long as we know how we want to join the two tables (e.g., which table is our primary table) and which relation(s) exist between both tables (i.e., which variables occur in both tables).
The by
argument of any *_join()
command can either be NULL
or a named vector:
If
by = NULL
(as by default)*_join
will perform a natural join, using all variables with common names across the two tables. A message lists the variables that are used, so that we can verify that this is what we wanted.To join by differently named variables on
x
andy
, we can provide a named vector. For instance,by = c("a" = "b")
will match a (primary) key variablea
in tablex
to a (secondary) key variableb
in tabley
.
Examples of joining tables
To test these commands, we load the following data files:
- data_t1.csv, available as
data_t1
in the ds4psy package
- data_t2.csv, available as
data_t2
in the ds4psy package
# Data:
<- ds4psy::data_t1 # from ds4psy package
data_t1 <- ds4psy::data_t2 # from ds4psy package
data_t2
# Copy data:
<- data_t1
t1 <- data_t2
t2
# Inspect data:
head(t1)
#> # A tibble: 6 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 R.S. female 2 NA
#> 2 M.Y. male 2 4
#> 3 K.R. male 4 1
#> 4 A.V. female 3 2
#> 5 Z.Y. female NA NA
#> 6 X.D. female 4 4
head(t2)
#> # A tibble: 6 × 4
#> name gender like_2 bnt_2
#> <chr> <chr> <dbl> <dbl>
#> 1 X.D. female 5 1
#> 2 T.D. male 3 NA
#> 3 Z.D. male 7 4
#> 4 L.S. male 7 4
#> 5 H.V. female 5 2
#> 6 Q.N. female 6 2
Note that the data in t1
and t2
are both 20 x 4 tibbles and appear to contain the responses or test scores of people.
This could suggest that they may be two measurements of the same people.
However, before jumping into joining these tables, we should check whether they really contain the same people:
# Do both tables contain common variables?
sum(names(t1) %in% names(t2)) # Number of common variables:
#> [1] 2
names(t1)[names(t1) %in% names(t2)] # Names of common variables:
#> [1] "name" "gender"
# Arrange the rows of tables by common variables:
<- t1 %>% arrange(name, gender)
t1 <- t2 %>% arrange(name, gender)
t2
# Are they equal in both tables?
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE
This confirms our intuition: t1
and t2
contain two variables of demographic information (name
and gender
) of 20 people and each table contains two additional variables (measurements or test scores).
Joining two tables with identical cases
When both tables contain identical cases, the four mutating join commands all yield the same result:
# 1. left_join:
<- left_join(t1, t2)
m1
m1#> # A tibble: 20 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A.F. female 5 1 4 1
#> 2 A.V. female 3 2 5 3
#> 3 C.N. female 4 3 2 2
#> 4 H.V. female 1 4 5 2
#> 5 K.R. male 4 1 6 1
#> 6 K.X. male 3 3 5 4
#> 7 L.S. male 4 4 7 4
#> 8 L.V. female 4 2 2 4
#> 9 M.O. male 4 1 NA NA
#> 10 M.Y. male 2 4 6 2
#> 11 Q.N. female 6 1 6 2
#> 12 R.S. female 2 NA 7 3
#> 13 T.D. male 5 2 3 NA
#> 14 T.V. male 6 4 4 4
#> 15 X.D. female 4 4 5 1
#> 16 X.P. female 4 1 4 1
#> 17 X.Y. female 5 4 5 2
#> 18 Y.G. male 4 4 4 4
#> 19 Z.D. male 5 4 7 4
#> 20 Z.Y. female NA NA 7 1
# 2. right_join:
<- right_join(t1, t2)
m2
m2#> # A tibble: 20 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A.F. female 5 1 4 1
#> 2 A.V. female 3 2 5 3
#> 3 C.N. female 4 3 2 2
#> 4 H.V. female 1 4 5 2
#> 5 K.R. male 4 1 6 1
#> 6 K.X. male 3 3 5 4
#> 7 L.S. male 4 4 7 4
#> 8 L.V. female 4 2 2 4
#> 9 M.O. male 4 1 NA NA
#> 10 M.Y. male 2 4 6 2
#> 11 Q.N. female 6 1 6 2
#> 12 R.S. female 2 NA 7 3
#> 13 T.D. male 5 2 3 NA
#> 14 T.V. male 6 4 4 4
#> 15 X.D. female 4 4 5 1
#> 16 X.P. female 4 1 4 1
#> 17 X.Y. female 5 4 5 2
#> 18 Y.G. male 4 4 4 4
#> 19 Z.D. male 5 4 7 4
#> 20 Z.Y. female NA NA 7 1
# 3. full_join:
<- full_join(t1, t2)
m3
m3#> # A tibble: 20 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A.F. female 5 1 4 1
#> 2 A.V. female 3 2 5 3
#> 3 C.N. female 4 3 2 2
#> 4 H.V. female 1 4 5 2
#> 5 K.R. male 4 1 6 1
#> 6 K.X. male 3 3 5 4
#> 7 L.S. male 4 4 7 4
#> 8 L.V. female 4 2 2 4
#> 9 M.O. male 4 1 NA NA
#> 10 M.Y. male 2 4 6 2
#> 11 Q.N. female 6 1 6 2
#> 12 R.S. female 2 NA 7 3
#> 13 T.D. male 5 2 3 NA
#> 14 T.V. male 6 4 4 4
#> 15 X.D. female 4 4 5 1
#> 16 X.P. female 4 1 4 1
#> 17 X.Y. female 5 4 5 2
#> 18 Y.G. male 4 4 4 4
#> 19 Z.D. male 5 4 7 4
#> 20 Z.Y. female NA NA 7 1
# 4. inner_join:
<- inner_join(t1, t2)
m4
m4#> # A tibble: 20 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A.F. female 5 1 4 1
#> 2 A.V. female 3 2 5 3
#> 3 C.N. female 4 3 2 2
#> 4 H.V. female 1 4 5 2
#> 5 K.R. male 4 1 6 1
#> 6 K.X. male 3 3 5 4
#> 7 L.S. male 4 4 7 4
#> 8 L.V. female 4 2 2 4
#> 9 M.O. male 4 1 NA NA
#> 10 M.Y. male 2 4 6 2
#> 11 Q.N. female 6 1 6 2
#> 12 R.S. female 2 NA 7 3
#> 13 T.D. male 5 2 3 NA
#> 14 T.V. male 6 4 4 4
#> 15 X.D. female 4 4 5 1
#> 16 X.P. female 4 1 4 1
#> 17 X.Y. female 5 4 5 2
#> 18 Y.G. male 4 4 4 4
#> 19 Z.D. male 5 4 7 4
#> 20 Z.Y. female NA NA 7 1
# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE
Note that the join commands automatically used the two common variables as key
variables to join the tables (i.e., used
by = c("name", "gender")
by default). This raises the question: What would happen if we specified only one of these variables in the by
argument?
Let’s check to find out:
left_join(t1, t2, by = "name") # => 2 gender columns, distinguished by suffixes .x and .y
#> # A tibble: 20 × 7
#> name gender.x like_1 bnt_1 gender.y like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 A.F. female 5 1 female 4 1
#> 2 A.V. female 3 2 female 5 3
#> 3 C.N. female 4 3 female 2 2
#> 4 H.V. female 1 4 female 5 2
#> 5 K.R. male 4 1 male 6 1
#> 6 K.X. male 3 3 male 5 4
#> 7 L.S. male 4 4 male 7 4
#> 8 L.V. female 4 2 female 2 4
#> 9 M.O. male 4 1 male NA NA
#> 10 M.Y. male 2 4 male 6 2
#> 11 Q.N. female 6 1 female 6 2
#> 12 R.S. female 2 NA female 7 3
#> 13 T.D. male 5 2 male 3 NA
#> 14 T.V. male 6 4 male 4 4
#> 15 X.D. female 4 4 female 5 1
#> 16 X.P. female 4 1 female 4 1
#> 17 X.Y. female 5 4 female 5 2
#> 18 Y.G. male 4 4 male 4 4
#> 19 Z.D. male 5 4 male 7 4
#> 20 Z.Y. female NA NA female 7 1
left_join(t1, t2, by = "gender") # => 2 name columns, distinguished by suffixes .x and .y
#> # A tibble: 202 × 7
#> name.x gender like_1 bnt_1 name.y like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 A.F. female 5 1 A.F. 4 1
#> 2 A.F. female 5 1 A.V. 5 3
#> 3 A.F. female 5 1 C.N. 2 2
#> 4 A.F. female 5 1 H.V. 5 2
#> 5 A.F. female 5 1 L.V. 2 4
#> 6 A.F. female 5 1 Q.N. 6 2
#> 7 A.F. female 5 1 R.S. 7 3
#> 8 A.F. female 5 1 X.D. 5 1
#> 9 A.F. female 5 1 X.P. 4 1
#> 10 A.F. female 5 1 X.Y. 5 2
#> # … with 192 more rows
We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix.
We can control the suffix of duplicate variables by providing a suffix
argument:
# To control the suffix of duplicate variables:
left_join(t1, t2, by = "name", suffix = c("_001", "_002"))
#> # A tibble: 20 × 7
#> name gender_001 like_1 bnt_1 gender_002 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 A.F. female 5 1 female 4 1
#> 2 A.V. female 3 2 female 5 3
#> 3 C.N. female 4 3 female 2 2
#> 4 H.V. female 1 4 female 5 2
#> 5 K.R. male 4 1 male 6 1
#> 6 K.X. male 3 3 male 5 4
#> 7 L.S. male 4 4 male 7 4
#> 8 L.V. female 4 2 female 2 4
#> 9 M.O. male 4 1 male NA NA
#> 10 M.Y. male 2 4 male 6 2
#> 11 Q.N. female 6 1 female 6 2
#> 12 R.S. female 2 NA female 7 3
#> 13 T.D. male 5 2 male 3 NA
#> 14 T.V. male 6 4 male 4 4
#> 15 X.D. female 4 4 female 5 1
#> 16 X.P. female 4 1 female 4 1
#> 17 X.Y. female 5 4 female 5 2
#> 18 Y.G. male 4 4 male 4 4
#> 19 Z.D. male 5 4 male 7 4
#> 20 Z.Y. female NA NA female 7 1
Joining two tables with different cases
To see and appreciate the differences between the four mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:51
# Create t3 and t4: ------
<- data_t1 # copy
t3 <- data_t2 # copy
t4
# Modify both tables: ------
# (a) Draw n random rows (samples):
set.seed(12) # for replicability
<- 8
n <- sample(1:nrow(t3), size = n, replace = FALSE) # n random values from 1 to nrow(t3)
s1 <- sample(1:nrow(t4), size = n, replace = FALSE) # n random values from 1 to nrow(t4)
s2
# Reduce rows and arrange both tables:
<- t3[s1, ]
t3 <- t4[s2, ]
t4
# (b) Duplicate m random rows:
set.seed(23) # for replicability
<- 2
m <- sample(1:nrow(t3), size = m, replace = FALSE) # n random values from 1 to nrow(t3)
d1 <- sample(1:nrow(t4), size = m, replace = FALSE) # n random values from 1 to nrow(t4)
d2
# Duplicate sampled rows:
<- rbind(t3, t3[d1, ])
t3 <- rbind(t4, t4[d2, ])
t4
# Arrange tables:
<- t3 %>% arrange(name, gender)
t3 <- t4 %>% arrange(name, gender) t4
The resulting tables t3
and t4
are both 10 x 4 tibbles and contain some common and some unique cases (people).
In addition, each table contains two people twice:
# Inspect resulting tables: ------
# 10 x 4
t3 #> # A tibble: 10 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 H.V. female 1 4
#> 2 L.V. female 4 2
#> 3 M.Y. male 2 4
#> 4 Q.N. female 6 1
#> 5 Q.N. female 6 1
#> 6 T.D. male 5 2
#> 7 X.P. female 4 1
#> 8 X.Y. female 5 4
#> 9 Z.Y. female NA NA
#> 10 Z.Y. female NA NA
# 10 x 4
t4 #> # A tibble: 10 × 4
#> name gender like_2 bnt_2
#> <chr> <chr> <dbl> <dbl>
#> 1 A.F. female 4 1
#> 2 A.V. female 5 3
#> 3 C.N. female 2 2
#> 4 L.V. female 2 4
#> 5 Q.N. female 6 2
#> 6 Q.N. female 6 2
#> 7 T.V. male 4 4
#> 8 X.P. female 4 1
#> 9 Y.G. male 4 4
#> 10 Y.G. male 4 4
# Common cases:
$name[t3$name %in% t4$name] # people in t3 that also appear in t4
t3#> [1] "L.V." "Q.N." "Q.N." "X.P."
$name[t4$name %in% t3$name] # people in t4 that also appear in t3
t4#> [1] "L.V." "Q.N." "Q.N." "X.P."
# Unique cases:
$name[!t3$name %in% t4$name] # people in t3 that are NOT in t4
t3#> [1] "H.V." "M.Y." "T.D." "X.Y." "Z.Y." "Z.Y."
$name[!t4$name %in% t3$name] # people in t4 that are NOT in t3
t4#> [1] "A.F." "A.V." "C.N." "T.V." "Y.G." "Y.G."
Now let’s re-do the four mutating joins with t3
and t4
and inspect the resulting tables:
# 1. left_join:
<- left_join(t3, t4)
m5 # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
m5 #> # A tibble: 12 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 H.V. female 1 4 NA NA
#> 2 L.V. female 4 2 2 4
#> 3 M.Y. male 2 4 NA NA
#> 4 Q.N. female 6 1 6 2
#> 5 Q.N. female 6 1 6 2
#> 6 Q.N. female 6 1 6 2
#> 7 Q.N. female 6 1 6 2
#> 8 T.D. male 5 2 NA NA
#> 9 X.P. female 4 1 4 1
#> 10 X.Y. female 5 4 NA NA
#> 11 Z.Y. female NA NA NA NA
#> 12 Z.Y. female NA NA NA NA
# 2. right_join:
<- right_join(t3, t4)
m6 # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
m6 #> # A tibble: 12 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 L.V. female 4 2 2 4
#> 2 Q.N. female 6 1 6 2
#> 3 Q.N. female 6 1 6 2
#> 4 Q.N. female 6 1 6 2
#> 5 Q.N. female 6 1 6 2
#> 6 X.P. female 4 1 4 1
#> 7 A.F. female NA NA 4 1
#> 8 A.V. female NA NA 5 3
#> 9 C.N. female NA NA 2 2
#> 10 T.V. male NA NA 4 4
#> 11 Y.G. male NA NA 4 4
#> 12 Y.G. male NA NA 4 4
all.equal(m5, m6) # shows differences, but
#> [1] "Component \"name\": 10 string mismatches"
#> [2] "Component \"gender\": 5 string mismatches"
#> [3] "Component \"like_1\": 'is.NA' value mismatch: 6 in current 2 in target"
#> [4] "Component \"bnt_1\": 'is.NA' value mismatch: 6 in current 2 in target"
#> [5] "Component \"like_2\": 'is.NA' value mismatch: 0 in current 6 in target"
#> [6] "Component \"bnt_2\": 'is.NA' value mismatch: 0 in current 6 in target"
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] "Names: 4 string mismatches"
#> [2] "Component \"name\": 7 string mismatches"
#> [3] "Component \"gender\": 2 string mismatches"
#> [4] "Component 3: 'is.NA' value mismatch: 6 in current 2 in target"
#> [5] "Component 4: 'is.NA' value mismatch: 6 in current 2 in target"
#> [6] "Component 5: 'is.NA' value mismatch: 2 in current 6 in target"
#> [7] "Component 6: 'is.NA' value mismatch: 2 in current 6 in target"
# 3. full_join:
<- full_join(t3, t4)
m7 # => 17 x 6 table, i.e. all cases of t3 and of t4
m7 #> # A tibble: 18 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 H.V. female 1 4 NA NA
#> 2 L.V. female 4 2 2 4
#> 3 M.Y. male 2 4 NA NA
#> 4 Q.N. female 6 1 6 2
#> 5 Q.N. female 6 1 6 2
#> 6 Q.N. female 6 1 6 2
#> 7 Q.N. female 6 1 6 2
#> 8 T.D. male 5 2 NA NA
#> 9 X.P. female 4 1 4 1
#> 10 X.Y. female 5 4 NA NA
#> 11 Z.Y. female NA NA NA NA
#> 12 Z.Y. female NA NA NA NA
#> 13 A.F. female NA NA 4 1
#> 14 A.V. female NA NA 5 3
#> 15 C.N. female NA NA 2 2
#> 16 T.V. male NA NA 4 4
#> 17 Y.G. male NA NA 4 4
#> 18 Y.G. male NA NA 4 4
# 4. inner_join:
<- inner_join(t3, t4)
m8 # => 7 x 6 table, i.e. all cases that were in both t3 and t4
m8 #> # A tibble: 6 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 L.V. female 4 2 2 4
#> 2 Q.N. female 6 1 6 2
#> 3 Q.N. female 6 1 6 2
#> 4 Q.N. female 6 1 6 2
#> 5 Q.N. female 6 1 6 2
#> 6 X.P. female 4 1 4 1
Thus, when joining two tables with different cases, it really matters which join-command we are using.
8.2.2 Filtering joins
Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are two types of filtering joins:
semi_join(x, y)
keeps all observations inx
that have a match iny
(i.e., observations inx
without a match iny
are dropped).anti_join(x, y)
drops all observations inx
that have a match iny
(i.e., observations inx
without a match iny
are kept).
See Section 13.5 on filtering joins of the r4ds textbook (Wickham & Grolemund, 2017) for graphical illustrations.
Notes
anti_join(x, y)
yields the complement ofsemi_join(x, y)
(with respect tox
).Both types of filtering joins concern only the cases of
x
and filter them by the cases ofy
.The resulting table of a filtering join preserves the variables of
x
, not the variables ofy
.
Examples
semi_join(t3, t4) # keeps cases in t3 that are also in t4
#> # A tibble: 4 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 L.V. female 4 2
#> 2 Q.N. female 6 1
#> 3 Q.N. female 6 1
#> 4 X.P. female 4 1
anti_join(t3, t4) # keeps cases in t3 that are NOT in t4
#> # A tibble: 6 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 H.V. female 1 4
#> 2 M.Y. male 2 4
#> 3 T.D. male 5 2
#> 4 X.Y. female 5 4
#> 5 Z.Y. female NA NA
#> 6 Z.Y. female NA NA
# Note:
full_join(semi_join(t3, t4), anti_join(t3, t4))
#> # A tibble: 10 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 L.V. female 4 2
#> 2 Q.N. female 6 1
#> 3 Q.N. female 6 1
#> 4 X.P. female 4 1
#> 5 H.V. female 1 4
#> 6 M.Y. male 2 4
#> 7 T.D. male 5 2
#> 8 X.Y. female 5 4
#> 9 Z.Y. female NA NA
#> 10 Z.Y. female NA NA
# contains the same cases as t3.
Use cases
Filtering joins typically reduce the cases of x
to the cases contained in y
(semi_join()
) or to the cases not contained in y
(anti_join()
).
Typical use cases include:
- Semi-joins are useful for matching filtered summary tables back to the original rows:
## Data:
<- dplyr::starwars
sw # sw # 87 x 13
# Determine the top 3 homeworlds of Humans:
<- sw %>%
top_3_human_homes filter(species == "Human") %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n)) %>%
head(3)
top_3_human_homes#> # A tibble: 3 × 2
#> # Groups: homeworld [3]
#> homeworld n
#> <chr> <int>
#> 1 Tatooine 8
#> 2 Naboo 5
#> 3 <NA> 5
# Which individuals (human and non-human) are from these homeworlds?
semi_join(sw, top_3_human_homes, by = "homeworld") # => 31 individuals
#> # A tibble: 31 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 6 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 7 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 8 Biggs D… 183 84 black light brown 24 male mascu…
#> 9 Anakin … 188 84 blond fair blue 41.9 male mascu…
#> 10 Yoda 66 17 white green brown 896 male mascu…
#> # … with 21 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
- Anti-joins are useful for diagnosing join mismatches:
left_join(t3, t4) # adds variables (columns) of t4 to cases (rows) of t3
#> # A tibble: 12 × 6
#> name gender like_1 bnt_1 like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 H.V. female 1 4 NA NA
#> 2 L.V. female 4 2 2 4
#> 3 M.Y. male 2 4 NA NA
#> 4 Q.N. female 6 1 6 2
#> 5 Q.N. female 6 1 6 2
#> 6 Q.N. female 6 1 6 2
#> 7 Q.N. female 6 1 6 2
#> 8 T.D. male 5 2 NA NA
#> 9 X.P. female 4 1 4 1
#> 10 X.Y. female 5 4 NA NA
#> 11 Z.Y. female NA NA NA NA
#> 12 Z.Y. female NA NA NA NA
anti_join(t3, t4) # shows cases in t3 NOT in t4 (with NA values in new variables)
#> # A tibble: 6 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 H.V. female 1 4
#> 2 M.Y. male 2 4
#> 3 T.D. male 5 2
#> 4 X.Y. female 5 4
#> 5 Z.Y. female NA NA
#> 6 Z.Y. female NA NA
anti_join(t4, t3) # shows cases in t4 NOT in t3 (missing from joined table)
#> # A tibble: 6 × 4
#> name gender like_2 bnt_2
#> <chr> <chr> <dbl> <dbl>
#> 1 A.F. female 4 1
#> 2 A.V. female 5 3
#> 3 C.N. female 2 2
#> 4 T.V. male 4 4
#> 5 Y.G. male 4 4
#> 6 Y.G. male 4 4
See Section 13.6 of the r4ds textbook (Wickham & Grolemund, 2017) for common join problems and ways of dealing with them.
8.2.3 Set operations
Set operations on two tables x
and y
expect both tables to have the same variables.
They always compare complete rows of x
and y
, and compare the values of every variable.
Treating the observations like sets yields three different commands:
intersect(x, y)
returns only observations that occur in bothx
and iny
.union(x, y)
returns all unique observations inx
andy
.setdiff(x, y)
return those observations that occur inx
, but not iny
.
Examples
# Data:
<- dplyr::starwars %>%
sw select(name:species) # only first 10 columns of starwars
# 87 x 10
sw #> # A tibble: 87 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke S… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth … 202 136 none white yellow 41.9 male mascu…
#> 5 Leia O… 150 49 brown light brown 19 fema… femin…
#> 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
#> 7 Beru W… 165 75 brown light blue 47 fema… femin…
#> 8 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 9 Biggs … 183 84 black light brown 24 male mascu…
#> 10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
#> # … with 77 more rows, and 2 more variables: homeworld <chr>, species <chr>
# Create 2 filtered sub-sets:
<- sw %>% filter(species == "Human", height < 180)
s1 # 15 indivduals
s1 #> # A tibble: 15 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 4 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 5 Wedge A… 170 77 brown fair hazel 21 male mascu…
#> 6 Palpati… 170 75 grey pale yellow 82 male mascu…
#> 7 Lando C… 177 79 black dark brown 31 male mascu…
#> 8 Lobot 175 79 none light blue 37 male mascu…
#> 9 Mon Mot… 150 NA auburn fair blue 48 fema… femin…
#> 10 Finis V… 170 NA blond fair blue 91 male mascu…
#> 11 Shmi Sk… 163 NA black fair brown 72 fema… femin…
#> 12 Cordé 157 NA brown light brown NA fema… femin…
#> 13 Dormé 165 NA brown light brown NA fema… femin…
#> 14 Jocasta… 167 NA white fair blue NA fema… femin…
#> 15 Padmé A… 165 45 brown light brown 46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
<- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2 # 21 individuals
s2 #> # A tibble: 21 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 6 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 7 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 8 Biggs D… 183 84 black light brown 24 male mascu…
#> 9 Anakin … 188 84 blond fair blue 41.9 male mascu…
#> 10 Palpati… 170 75 grey pale yellow 82 male mascu…
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>
# Set operations:
intersect(s1, s2) # 8 individuals are in both s1 and s2
#> # A tibble: 8 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 Owen La… 178 120 brown, grey light blue 52 male mascu…
#> 3 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 4 Palpati… 170 75 grey pale yellow 82 male mascu…
#> 5 Shmi Sk… 163 NA black fair brown 72 fema… femin…
#> 6 Cordé 157 NA brown light brown NA fema… femin…
#> 7 Dormé 165 NA brown light brown NA fema… femin…
#> 8 Padmé A… 165 45 brown light brown 46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
union(s1, s2) # 28 individuals are in the union of both sets
#> # A tibble: 28 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 4 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 5 Wedge A… 170 77 brown fair hazel 21 male mascu…
#> 6 Palpati… 170 75 grey pale yellow 82 male mascu…
#> 7 Lando C… 177 79 black dark brown 31 male mascu…
#> 8 Lobot 175 79 none light blue 37 male mascu…
#> 9 Mon Mot… 150 NA auburn fair blue 48 fema… femin…
#> 10 Finis V… 170 NA blond fair blue 91 male mascu…
#> # … with 18 more rows, and 2 more variables: homeworld <chr>, species <chr>
setdiff(s1, s2) # 7 individuals in s1 and not in s2
#> # A tibble: 7 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Leia Or… 150 49 brown light brown 19 female femin…
#> 2 Wedge A… 170 77 brown fair hazel 21 male mascu…
#> 3 Lando C… 177 79 black dark brown 31 male mascu…
#> 4 Lobot 175 79 none light blue 37 male mascu…
#> 5 Mon Mot… 150 NA auburn fair blue 48 female femin…
#> 6 Finis V… 170 NA blond fair blue 91 male mascu…
#> 7 Jocasta… 167 NA white fair blue NA female femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
setdiff(s2, s1) # 13 individuals in s2 and not in s1
#> # A tibble: 13 × 11
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 3 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 4 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 5 Biggs D… 183 84 black light brown 24 male mascu…
#> 6 Anakin … 188 84 blond fair blue 41.9 male mascu…
#> 7 Jar Jar… 196 66 none orange orange 52 male mascu…
#> 8 Roos Ta… 224 82 none grey orange NA male mascu…
#> 9 Rugor N… 206 NA none green orange NA male mascu…
#> 10 Ric Olié 183 NA brown fair blue NA <NA> <NA>
#> 11 Quarsh … 183 NA black dark brown 62 <NA> <NA>
#> 12 Gregar … 185 85 black dark brown NA male mascu…
#> 13 Cliegg … 183 NA brown fair blue 82 male mascu…
#> # … with 2 more variables: homeworld <chr>, species <chr>
8.2.4 Using merge()
The base R function base::merge()
can perform all four types of mutating joins:
dplyr: | base R: |
---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE) |
full_join(x, y) |
merge(x, y, all.x = TRUE, all.y = TRUE) |
Note that merge()
also accepts various by
commands (see ?merge
for documentation).
The advantages of the dplyr verbs are that:
they more clearly convey the intended join (by verbally expressing what is implied by the arguments of
merge()
);they are faster and preserve the order of cases (rows).
See Chapter 13 of the r4ds textbook (Wickham & Grolemund, 2017) for other implementations of joins (e.g., in SQL).
References
It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.↩︎