8.2 Essential join commands

As the commands for joining 2 tables x and y are implemented in the dplyr package (Wickham et al., 2019b), the 2-table verbs use the same terminology as the 1-table verbs discussed in Chapter 3 on Data transformation: Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.

In the following, we distinguish between 3 types of joins:

  1. Mutating joins combine the variables (columns) of 2 tables (Section 8.2.1);

  2. Filtering joins affect only the cases or observations (rows) of 2 tables (Section 8.2.2);

  3. Set operations on 2 tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section&nbsp8.2.3).

Finally, we will see that the base R command merge can also be used to accomplish mutating joins (Section 8.2.4).

8.2.1 Mutating joins

Mutating joins combine the variables (columns) of 2 tables x and y. The term mutating is used in analogy to the dplyr::mutate command: Just as mutate creates new variables (columns), a mutating join adds new variables (columns) to a table x from matching cases (rows) of another table y.

Commands

There are 4 types of mutating joins to combine the cases (rows) and variables (columns) of tables x and y:

  1. left_join(x, y): Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  2. right_join(x, y): Same as left_join(y, x), and hence redundant.

  3. full_join(x, y): Return all rows and all columns that exist in x or in y. If there are non-matching values, return NA for the ones missing.

  4. inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In contrast to the inner_join, the first 3 joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.

Inner vs. outer joins

All 4 types of mutating joins combine variables (columns) from 2 tables, but differ in preserving in the cases (rows) in the output table. An inner join only keeps the cases (or rows) that appear in both tables. By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins. The 3 types of outer joins also differ with respect to the cases they preserve:

  1. left_join keeps all cases of x.
  2. right_join keeps all cases of y.
  3. full_join keeps all cases of x or y.

Key variables

Beyond specifying the 2 tables to join, all commands take a by argument:

  • by specifies a character vector of variables by which we want to join the tables (so-called keys).
    (See Section 13.3 on keys for the distinction between primary and foreign keys.)

  • If by = NULL (as by default) *_join will perform a natural join, using all variables with common names across the 2 tables. A message lists the variables that are used, so that we can verify that this is what we wanted.

  • To join by differently named variables on x and y, we can provide a named vector. For instance, by = c("a" = "b") will match a key variable a in table x to a key variable b in table y.

Examples

To test these commands, we load the following data files:

  • data_t1 (also available as data_t1 in the ds4psy package)
  • data_t2 (also available as data_t2 in the ds4psy package)
# Data:
data_t1 <- ds4psy::data_t1  # from ds4psy package
data_t2 <- ds4psy::data_t2  # from ds4psy package
# data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv")  # online
# data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv")  # online

# Copy data:
t1 <- data_t1
t2 <- data_t2

# Inspect data:
t1
#> # A tibble: 20 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 K.X.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2
t2
#> # A tibble: 20 x 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 X.D.  female      5     1
#>  2 T.D.  male        3    NA
#>  3 Z.D.  male        7     4
#>  4 L.S.  male        7     4
#>  5 H.V.  female      5     2
#>  6 Q.N.  female      6     2
#>  7 X.P.  female      4     1
#>  8 K.R.  male        6     1
#>  9 R.S.  female      7     3
#> 10 C.N.  female      2     2
#> 11 Z.Y.  female      7     1
#> 12 T.V.  male        4     4
#> 13 Y.G.  male        4     4
#> 14 M.O.  male       NA    NA
#> 15 M.Y.  male        6     2
#> 16 A.F.  female      4     1
#> 17 K.X.  male        5     4
#> 18 L.V.  female      2     4
#> 19 X.Y.  female      5     2
#> 20 A.V.  female      5     3

Note that t1 and t2 are both 20 x 4 tibbles and appear to contain the responses or test scores of people. This suggests that they may be 2 measurements of the same people. However, before jumping into joining the tables, we should check whether they really contain the same people:

# Do both tables contain common variables?
sum(names(t1) %in% names(t2))        # Number of common variables: 
#> [1] 2
names(t1)[names(t1) %in% names(t2)]  # Names of common variables: 
#> [1] "name"   "gender"

# Arrange the rows of tables by common variables:
t1 <- t1 %>% arrange(name, gender) 
t2 <- t2 %>% arrange(name, gender)

# Are they equal in both tables?
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE

This confirms our intuition: t1 and t2 contain 2 variables of demographic information (name and gender) of 20 people and each table contains 2 additional variables (measurements or test scores).

Joining 2 tables with identical cases

When both tables contain identical cases, the 4 mutating join commands all yield the same result:

# 1. left_join:
m1 <- left_join(t1, t2)
m1
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 2. right_join:
m2 <- right_join(t1, t2)
m2
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 3. full_join:
m3 <- full_join(t1, t2)
m3
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 4. inner_join:
m4 <- inner_join(t1, t2)
m4
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE

Note that the join commands automatically used the 2 common variables as key variables to join the tables (i.e., used by = c("name", "gender")). This raises the question: What would happen if we specified only one of them? Let’s check:

left_join(t1, t2, by = "name")   # => 2 gender columns, distinguished by suffixes .x and .y
#> # A tibble: 20 x 7
#>    name  gender.x like_1 bnt_1 gender.y like_2 bnt_2
#>    <chr> <chr>     <dbl> <dbl> <chr>     <dbl> <dbl>
#>  1 A.F.  female        5     1 female        4     1
#>  2 A.V.  female        3     2 female        5     3
#>  3 C.N.  female        4     3 female        2     2
#>  4 H.V.  female        1     4 female        5     2
#>  5 K.R.  male          4     1 male          6     1
#>  6 K.X.  male          3     3 male          5     4
#>  7 L.S.  male          4     4 male          7     4
#>  8 L.V.  female        4     2 female        2     4
#>  9 M.O.  male          4     1 male         NA    NA
#> 10 M.Y.  male          2     4 male          6     2
#> 11 Q.N.  female        6     1 female        6     2
#> 12 R.S.  female        2    NA female        7     3
#> 13 T.D.  male          5     2 male          3    NA
#> 14 T.V.  male          6     4 male          4     4
#> 15 X.D.  female        4     4 female        5     1
#> 16 X.P.  female        4     1 female        4     1
#> 17 X.Y.  female        5     4 female        5     2
#> 18 Y.G.  male          4     4 male          4     4
#> 19 Z.D.  male          5     4 male          7     4
#> 20 Z.Y.  female       NA    NA female        7     1
left_join(t1, t2, by = "gender") # => 2 name columns, distinguished by suffixes .x and .y
#> # A tibble: 202 x 7
#>    name.x gender like_1 bnt_1 name.y like_2 bnt_2
#>    <chr>  <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.   female      5     1 A.F.        4     1
#>  2 A.F.   female      5     1 A.V.        5     3
#>  3 A.F.   female      5     1 C.N.        2     2
#>  4 A.F.   female      5     1 H.V.        5     2
#>  5 A.F.   female      5     1 L.V.        2     4
#>  6 A.F.   female      5     1 Q.N.        6     2
#>  7 A.F.   female      5     1 R.S.        7     3
#>  8 A.F.   female      5     1 X.D.        5     1
#>  9 A.F.   female      5     1 X.P.        4     1
#> 10 A.F.   female      5     1 X.Y.        5     2
#> # … with 192 more rows

# To control the suffix:
left_join(t1, t2, by = "name", suffix = c("_1", "_2"))
#> # A tibble: 20 x 7
#>    name  gender_1 like_1 bnt_1 gender_2 like_2 bnt_2
#>    <chr> <chr>     <dbl> <dbl> <chr>     <dbl> <dbl>
#>  1 A.F.  female        5     1 female        4     1
#>  2 A.V.  female        3     2 female        5     3
#>  3 C.N.  female        4     3 female        2     2
#>  4 H.V.  female        1     4 female        5     2
#>  5 K.R.  male          4     1 male          6     1
#>  6 K.X.  male          3     3 male          5     4
#>  7 L.S.  male          4     4 male          7     4
#>  8 L.V.  female        4     2 female        2     4
#>  9 M.O.  male          4     1 male         NA    NA
#> 10 M.Y.  male          2     4 male          6     2
#> 11 Q.N.  female        6     1 female        6     2
#> 12 R.S.  female        2    NA female        7     3
#> 13 T.D.  male          5     2 male          3    NA
#> 14 T.V.  male          6     4 male          4     4
#> 15 X.D.  female        4     4 female        5     1
#> 16 X.P.  female        4     1 female        4     1
#> 17 X.Y.  female        5     4 female        5     2
#> 18 Y.G.  male          4     4 male          4     4
#> 19 Z.D.  male          5     4 male          7     4
#> 20 Z.Y.  female       NA    NA female        7     1

We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix.

Joining 2 tables with different cases

To see and appreciate the differences between the 4 mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:33

# Create t3 and t4: ------ 
t3 <- data_t1  # copy
t4 <- data_t2  # copy 

# Modify both tables: ------ 
# (a) Draw n random rows (samples):
set.seed(12)  # for replicability
n <- 8
s1 <- sample(1:nrow(t3), size = n, replace = FALSE)  # n random values from 1 to nrow(t3)
s2 <- sample(1:nrow(t4), size = n, replace = FALSE)  # n random values from 1 to nrow(t4)

# Reduce rows and arrange both tables:
t3 <- t3[s1, ]
t4 <- t4[s2, ]

# (b) Duplicate m random rows:
set.seed(23)  # for replicability
m <- 2
d1 <- sample(1:nrow(t3), size = m, replace = FALSE)  # n random values from 1 to nrow(t3)
d2 <- sample(1:nrow(t4), size = m, replace = FALSE)  # n random values from 1 to nrow(t4)

# Duplicate sampled rows:
t3 <- rbind(t3, t3[d1, ])
t4 <- rbind(t4, t4[d2, ])

# Arrange tables:
t3 <- t3 %>% arrange(name, gender)
t4 <- t4 %>% arrange(name, gender)

The resulting tables t3 and t4 are both 10 x 4 tibbles and contain some common and some unique cases (people). In addition, each table contains 2 people twice:

# Inspect resulting tables: ------ 
t3  # 10 x 4
#> # A tibble: 10 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 H.V.  female      1     4
#>  2 L.V.  female      4     2
#>  3 M.Y.  male        2     4
#>  4 Q.N.  female      6     1
#>  5 Q.N.  female      6     1
#>  6 T.D.  male        5     2
#>  7 X.P.  female      4     1
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
t4  # 10 x 4
#> # A tibble: 10 x 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.F.  female      4     1
#>  2 A.V.  female      5     3
#>  3 C.N.  female      2     2
#>  4 L.V.  female      2     4
#>  5 Q.N.  female      6     2
#>  6 Q.N.  female      6     2
#>  7 T.V.  male        4     4
#>  8 X.P.  female      4     1
#>  9 Y.G.  male        4     4
#> 10 Y.G.  male        4     4

# Common cases:
t3$name[t3$name %in% t4$name]  # people in t3 that also appear in t4
#> [1] "L.V." "Q.N." "Q.N." "X.P."
t4$name[t4$name %in% t3$name]  # people in t4 that also appear in t3
#> [1] "L.V." "Q.N." "Q.N." "X.P."

# Unique cases:
t3$name[!t3$name %in% t4$name]  # people in t3 that are NOT in t4
#> [1] "H.V." "M.Y." "T.D." "X.Y." "Z.Y." "Z.Y."
t4$name[!t4$name %in% t3$name]  # people in t4 that are NOT in t3
#> [1] "A.F." "A.V." "C.N." "T.V." "Y.G." "Y.G."

Now let’s re-do the 4 mutating joins with t3 and t4 and inspect the resulting tables:

# 1. left_join:
m5 <- left_join(t3, t4)
m5  # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

# 2. right_join:
m6 <- right_join(t3, t4)
m6  # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female     NA    NA      4     1
#>  2 A.V.  female     NA    NA      5     3
#>  3 C.N.  female     NA    NA      2     2
#>  4 L.V.  female      4     2      2     4
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 Q.N.  female      6     1      6     2
#>  9 T.V.  male       NA    NA      4     4
#> 10 X.P.  female      4     1      4     1
#> 11 Y.G.  male       NA    NA      4     4
#> 12 Y.G.  male       NA    NA      4     4

all.equal(m5, m6)  # shows differences, but 
#> [1] "Rows in x but not y: 11, 10, 8, 3, 1. Rows in y but not x: 11, 9, 3, 2, 1. "
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] TRUE

# 3. full_join:
m7 <- full_join(t3, t4)
m7  # => 17 x 6 table, i.e. all cases of t3 and of t4
#> # A tibble: 18 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA
#> 13 A.F.  female     NA    NA      4     1
#> 14 A.V.  female     NA    NA      5     3
#> 15 C.N.  female     NA    NA      2     2
#> 16 T.V.  male       NA    NA      4     4
#> 17 Y.G.  male       NA    NA      4     4
#> 18 Y.G.  male       NA    NA      4     4

# 4. inner_join:
m8 <- inner_join(t3, t4)
m8  # => 7 x 6 table, i.e. all cases that were in both t3 and t4
#> # A tibble: 6 x 6
#>   name  gender like_1 bnt_1 like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#> 1 L.V.  female      4     2      2     4
#> 2 Q.N.  female      6     1      6     2
#> 3 Q.N.  female      6     1      6     2
#> 4 Q.N.  female      6     1      6     2
#> 5 Q.N.  female      6     1      6     2
#> 6 X.P.  female      4     1      4     1

Thus, when joining 2 tables with different cases, it really matters which join-command we’re using.

8.2.2 Filtering joins

Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are 2 types of filtering joins:

  1. semi_join(x, y) keeps all observations in x that have a match in y (i.e., observations in x without a match in y are dropped).

  2. anti_join(x, y) drops all observations in x that have a match in y (i.e., observations in x without a match in y are kept).

See Section 13.5 on filtering joins (Wickham & Grolemund, 2017) for graphical illustrations.

Notes

  • anti_join(x, y) yields the complement of semi_join(x, y) (with respect to x).

  • Both types of filtering joins concern only the cases of x and filter them by the cases of y.

  • The resulting table of a filtering join preserves the variables of x, not the variables of y.

Examples

semi_join(t3, t4) # keeps cases in t3 that are also in t4
#> # A tibble: 4 x 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 L.V.  female      4     2
#> 2 Q.N.  female      6     1
#> 3 Q.N.  female      6     1
#> 4 X.P.  female      4     1

anti_join(t3, t4) # keeps cases in t3 that are NOT in t4
#> # A tibble: 6 x 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA

# Note: 
full_join(semi_join(t3, t4), anti_join(t3, t4))
#> # A tibble: 10 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 L.V.  female      4     2
#>  2 Q.N.  female      6     1
#>  3 Q.N.  female      6     1
#>  4 X.P.  female      4     1
#>  5 H.V.  female      1     4
#>  6 M.Y.  male        2     4
#>  7 T.D.  male        5     2
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
# contains the same cases as t3.

Use cases

Filtering joins typically reduce the cases of x to the cases contained in y (semi_join) or to the cases not contained in y (anti_join). Typical use cases include:

  • Semi-joins are useful for matching filtered summary tables back to the original rows:
## Data: 
sw <- dplyr::starwars
# sw  # 87 x 13

# Determine the top 3 homeworlds of Humans:
top_3_human_homes <- sw %>% 
  filter(species == "Human") %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(3)
top_3_human_homes
#> # A tibble: 3 x 2
#> # Groups:   homeworld [3]
#>   homeworld     n
#>   <chr>     <int>
#> 1 Tatooine      8
#> 2 Naboo         5
#> 3 <NA>          5

# Which individuals (human and non-human) are from these homeworlds?
semi_join(sw, top_3_human_homes, by = "homeworld")  # => 31 individuals
#> # A tibble: 31 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Owen…    178   120 brown, gr… light      blue            52   male  
#>  6 Beru…    165    75 brown      light      blue            47   female
#>  7 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  8 Bigg…    183    84 black      light      brown           24   male  
#>  9 Anak…    188    84 blond      fair       blue            41.9 male  
#> 10 Yoda      66    17 white      green      brown          896   male  
#> # … with 21 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
  • Anti-joins are useful for diagnosing join mismatches:
left_join(t3, t4)  # adds variables (columns) of t4 to cases (rows) of t3
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

anti_join(t3, t4)  # shows cases in t3 NOT in t4 (with NA values in new variables)
#> # A tibble: 6 x 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA
anti_join(t4, t3)  # shows cases in t4 NOT in t3 (missing from joined table)
#> # A tibble: 6 x 4
#>   name  gender like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>
#> 1 A.F.  female      4     1
#> 2 A.V.  female      5     3
#> 3 C.N.  female      2     2
#> 4 T.V.  male        4     4
#> 5 Y.G.  male        4     4
#> 6 Y.G.  male        4     4

See Section 13.6 for common join problems (Wickham & Grolemund, 2017) and ways of dealing with them.

8.2.3 Set operations

Set operations on 2 tables x and y expect both tables to have the same variables. They always compare complete rows of x and y, and compare the values of every variable. Treating the observations like sets yields 3 different commands:

  1. intersect(x, y) returns only observations that occur in both x and in y.

  2. union(x, y) returns all unique observations in x and y.

  3. setdiff(x, y) return those observations that occur in x, but not in y.

Examples

# Data: 
sw <- dplyr::starwars %>%
  select(name:species)  # only first 10 columns of starwars
sw  # 87 x 10
#> # A tibble: 87 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # … with 77 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Create 2 filtered sub-sets:
s1 <- sw %>% filter(species == "Human", height < 180)
s1  # 15 indivduals
#> # A tibble: 15 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  
#>  2 Leia…    150    49 brown      light      brown             19 female
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  
#>  4 Beru…    165    75 brown      light      blue              47 female
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  
#>  6 Palp…    170    75 grey       pale       yellow            82 male  
#>  7 Land…    177    79 black      dark       brown             31 male  
#>  8 Lobot    175    79 none       light      blue              37 male  
#>  9 Mon …    150    NA auburn     fair       blue              48 female
#> 10 Fini…    170    NA blond      fair       blue              91 male  
#> 11 Shmi…    163    NA black      fair       brown             72 female
#> 12 Cordé    157    NA brown      light      brown             NA female
#> 13 Dormé    165    NA brown      light      brown             NA female
#> 14 Joca…    167    NA white      fair       blue              NA female
#> 15 Padm…    165    45 brown      light      brown             46 female
#> # … with 2 more variables: homeworld <chr>, species <chr>

s2 <- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2  # 21 individuals
#> # A tibble: 21 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Owen…    178   120 brown, gr… light      blue            52   male  
#>  6 Beru…    165    75 brown      light      blue            47   female
#>  7 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  8 Bigg…    183    84 black      light      brown           24   male  
#>  9 Anak…    188    84 blond      fair       blue            41.9 male  
#> 10 Palp…    170    75 grey       pale       yellow          82   male  
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Set operations:
intersect(s1, s2)  #  8 individuals are in both s1 and s2
#> # A tibble: 0 x 0
union(s1, s2)      # 28 individuals are in the union of both sets
#> [[1]]
#>  [1] "Luke Skywalker"     "Leia Organa"        "Owen Lars"         
#>  [4] "Beru Whitesun lars" "Wedge Antilles"     "Palpatine"         
#>  [7] "Lando Calrissian"   "Lobot"              "Mon Mothma"        
#> [10] "Finis Valorum"      "Shmi Skywalker"     "Cordé"             
#> [13] "Dormé"              "Jocasta Nu"         "Padmé Amidala"     
#> 
#> [[2]]
#>  [1] 172 150 178 165 170 170 177 175 150 170 163 157 165 167 165
#> 
#> [[3]]
#>  [1]  77  49 120  75  77  75  79  79  NA  NA  NA  NA  NA  NA  45
#> 
#> [[4]]
#>  [1] "blond"       "brown"       "brown, grey" "brown"       "brown"      
#>  [6] "grey"        "black"       "none"        "auburn"      "blond"      
#> [11] "black"       "brown"       "brown"       "white"       "brown"      
#> 
#> [[5]]
#>  [1] "fair"  "light" "light" "light" "fair"  "pale"  "dark"  "light" "fair" 
#> [10] "fair"  "fair"  "light" "light" "fair"  "light"
#> 
#> [[6]]
#>  [1] "blue"   "brown"  "blue"   "blue"   "hazel"  "yellow" "brown"  "blue"  
#>  [9] "blue"   "blue"   "brown"  "brown"  "brown"  "blue"   "brown" 
#> 
#> [[7]]
#>  [1] 19 19 52 47 21 82 31 37 48 91 72 NA NA NA 46
#> 
#> [[8]]
#>  [1] "male"   "female" "male"   "female" "male"   "male"   "male"   "male"  
#>  [9] "female" "male"   "female" "female" "female" "female" "female"
#> 
#> [[9]]
#>  [1] "Tatooine"  "Alderaan"  "Tatooine"  "Tatooine"  "Corellia"  "Naboo"    
#>  [7] "Socorro"   "Bespin"    "Chandrila" "Coruscant" "Tatooine"  "Naboo"    
#> [13] "Naboo"     "Coruscant" "Naboo"    
#> 
#> [[10]]
#>  [1] "Human" "Human" "Human" "Human" "Human" "Human" "Human" "Human" "Human"
#> [10] "Human" "Human" "Human" "Human" "Human" "Human"
#> 
#> [[11]]
#>  [1] "Luke Skywalker"     "C-3PO"              "R2-D2"             
#>  [4] "Darth Vader"        "Owen Lars"          "Beru Whitesun lars"
#>  [7] "R5-D4"              "Biggs Darklighter"  "Anakin Skywalker"  
#> [10] "Palpatine"          "Jar Jar Binks"      "Roos Tarpals"      
#> [13] "Rugor Nass"         "Ric Olié"           "Quarsh Panaka"     
#> [16] "Shmi Skywalker"     "Gregar Typho"       "Cordé"             
#> [19] "Cliegg Lars"        "Dormé"              "Padmé Amidala"     
#> 
#> [[12]]
#>  [1] 172 167  96 202 178 165  97 183 188 170 196 224 206 183 183 163 185 157 183
#> [20] 165 165
#> 
#> [[13]]
#>  [1]  77  75  32 136 120  75  32  84  84  75  66  82  NA  NA  NA  NA  85  NA  NA
#> [20]  NA  45
#> 
#> [[14]]
#>  [1] "blond"       NA            NA            "none"        "brown, grey"
#>  [6] "brown"       NA            "black"       "blond"       "grey"       
#> [11] "none"        "none"        "none"        "brown"       "black"      
#> [16] "black"       "black"       "brown"       "brown"       "brown"      
#> [21] "brown"      
#> 
#> [[15]]
#>  [1] "fair"        "gold"        "white, blue" "white"       "light"      
#>  [6] "light"       "white, red"  "light"       "fair"        "pale"       
#> [11] "orange"      "grey"        "green"       "fair"        "dark"       
#> [16] "fair"        "dark"        "light"       "fair"        "light"      
#> [21] "light"      
#> 
#> [[16]]
#>  [1] "blue"   "yellow" "red"    "yellow" "blue"   "blue"   "red"    "brown" 
#>  [9] "blue"   "yellow" "orange" "orange" "orange" "blue"   "brown"  "brown" 
#> [17] "brown"  "brown"  "blue"   "brown"  "brown" 
#> 
#> [[17]]
#>  [1]  19.0 112.0  33.0  41.9  52.0  47.0    NA  24.0  41.9  82.0  52.0    NA
#> [13]    NA    NA  62.0  72.0    NA    NA  82.0    NA  46.0
#> 
#> [[18]]
#>  [1] "male"   NA       NA       "male"   "male"   "female" NA       "male"  
#>  [9] "male"   "male"   "male"   "male"   "male"   "male"   "male"   "female"
#> [17] "male"   "female" "male"   "female" "female"
#> 
#> [[19]]
#>  [1] "Tatooine" "Tatooine" "Naboo"    "Tatooine" "Tatooine" "Tatooine"
#>  [7] "Tatooine" "Tatooine" "Tatooine" "Naboo"    "Naboo"    "Naboo"   
#> [13] "Naboo"    "Naboo"    "Naboo"    "Tatooine" "Naboo"    "Naboo"   
#> [19] "Tatooine" "Naboo"    "Naboo"   
#> 
#> [[20]]
#>  [1] "Human"  "Droid"  "Droid"  "Human"  "Human"  "Human"  "Droid"  "Human" 
#>  [9] "Human"  "Human"  "Gungan" "Gungan" "Gungan" NA       NA       "Human" 
#> [17] "Human"  "Human"  "Human"  "Human"  "Human"
setdiff(s1, s2)    #  7 individuals in s1 and not in s2
#> # A tibble: 15 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  
#>  2 Leia…    150    49 brown      light      brown             19 female
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  
#>  4 Beru…    165    75 brown      light      blue              47 female
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  
#>  6 Palp…    170    75 grey       pale       yellow            82 male  
#>  7 Land…    177    79 black      dark       brown             31 male  
#>  8 Lobot    175    79 none       light      blue              37 male  
#>  9 Mon …    150    NA auburn     fair       blue              48 female
#> 10 Fini…    170    NA blond      fair       blue              91 male  
#> 11 Shmi…    163    NA black      fair       brown             72 female
#> 12 Cordé    157    NA brown      light      brown             NA female
#> 13 Dormé    165    NA brown      light      brown             NA female
#> 14 Joca…    167    NA white      fair       blue              NA female
#> 15 Padm…    165    45 brown      light      brown             46 female
#> # … with 2 more variables: homeworld <chr>, species <chr>
setdiff(s2, s1)    # 13 individuals in s2 and not in s1
#> # A tibble: 21 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Owen…    178   120 brown, gr… light      blue            52   male  
#>  6 Beru…    165    75 brown      light      blue            47   female
#>  7 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  8 Bigg…    183    84 black      light      brown           24   male  
#>  9 Anak…    188    84 blond      fair       blue            41.9 male  
#> 10 Palp…    170    75 grey       pale       yellow          82   male  
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>

8.2.4 Using merge

The base R command base::merge() can perform all 4 types of mutating join:

dplyr: base R merge:
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

Note that merge also accepts various by commands (see ?merge for documentation).

The advantages of the dplyr verbs are that:

  1. they more clearly convey the intended join (by verbally expressing what is implied by the arguments of merge);

  2. they are faster and preserve the order of cases (rows).

(See Section 13.4.7 for other implementations (Wickham & Grolemund, 2017) of joins, e.g., in SQL.)

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

Wickham, H., François, R., Henry, L., & Müller, K. (2019b). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr


  1. It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.