8.2 Essential join commands

As the functions for joining two tables x and y are implemented in the dplyr package (Wickham, François, Henry, & Müller, 2022), the two-table verbs use the same terminology as the one-table verbs discussed in Chapter 3 on Data transformation: Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.

In the following, we distinguish between three types of joins:

  1. Mutating joins combine the variables (columns) of two tables (Section 8.2.1);

  2. Filtering joins affect only the cases or observations (rows) of two tables (Section 8.2.2);

  3. Set operations on two tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section 8.2.3).

Finally, we will see that the merge() function from base R can also be used to accomplish mutating joins (Section 8.2.4).

8.2.1 Mutating joins

Mutating joins combine the variables (columns) of two tables x and y. The term mutating is used in analogy to the dplyr::mutate() command: Just as mutate() creates new variables (columns), a mutating join adds new variables (columns) to a table x from matching cases (rows) of another table y.

Key functions

There are four types of mutating joins to combine the cases (rows) and variables (columns) of tables x and y:

  1. left_join(x, y): Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  2. right_join(x, y): Same as left_join(y, x), and hence redundant.

  3. full_join(x, y): Return all rows and all columns that exist in x or in y. If there are non-matching values, return NA for the ones missing.

  4. inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In contrast to an inner_join(x, y), the first three joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.

Inner vs. outer joins

All four types of mutating joins combine variables (columns) from two tables, but differ in preserving in the cases (rows) in the output table. An inner join inner_join(x, y) keeps only the cases (or rows) that appear in both tables (i.e., x and y). By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins. The three types of outer joins also differ with respect to the cases they preserve:

  1. left_join(x, y) keeps all cases of x.
  2. right_join(x, y) keeps all cases of y.
  3. full_join(x, y) keeps all cases of x or y.

Terminology of key variables

Beyond specifying the two tables to join, all *_join() commands take a by argument: by specifies a character vector of variables by which we want to join the tables (so-called keys).

In the r4ds textbook (Wickham & Grolemund, 2017), Section 13.3 on keys makes a distinction between primary and foreign keys, that stems from database terminology. These terms can be confusing, though the distinction behind it is actually quite simple:

  • A key is simply a variable that occurs in two distinct tables (x and y) and by which we want to join both tables.

  • Importantly, the variable of interest does not always need to have the same name in both tables. For instance, a variable may be called surname in x and family_name in y, but both these variables actually have the same referent. (i.e., the name of a specific person). The distinction between primary vs. secondary now depends on our perspective:

    • When looking from table x at table y, surname is a primary key and family_name a secondary key.

    • When looking from table y at table x, family_name is a primary key and surname a secondary key.

  • The primary key (in table x) and the corresponding foreign key in another table (y) form a so-called relation. Such relations are typically one-to-many. For instance, each name or id variable in one table corresponds to a particular person, but each person can contribute data to several measurement occasions in another table (see, e.g., the variables id and intervention in the table posPsy_p_info vs. in the table posPsy_long of the ds4psy package). A special case occurs when there is a one-to-one relationship (see, e.g., the variables id and intervention in the tables posPsy_p_info and posPsy_wide of the ds4psy package).

Actually, the distinction between two different types of keys is irrelevant, as long as we know how we want to join the two tables (e.g., which table is our primary table) and which relation(s) exist between both tables (i.e., which variables occur in both tables).

The by argument of any *_join() command can either be NULL or a named vector:

  • If by = NULL (as by default) *_join will perform a natural join, using all variables with common names across the two tables. A message lists the variables that are used, so that we can verify that this is what we wanted.

  • To join by differently named variables on x and y, we can provide a named vector. For instance, by = c("a" = "b") will match a (primary) key variable a in table x to a (secondary) key variable b in table y.

Examples of joining tables

To test these commands, we load the following data files:

  • data_t1.csv, available as data_t1 in the ds4psy package
  • data_t2.csv, available as data_t2 in the ds4psy package
# Data:
data_t1 <- ds4psy::data_t1  # from ds4psy package
data_t2 <- ds4psy::data_t2  # from ds4psy package

# Copy data:
t1 <- data_t1
t2 <- data_t2

# Inspect data:
head(t1)
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 R.S.  female      2    NA
#> 2 M.Y.  male        2     4
#> 3 K.R.  male        4     1
#> 4 A.V.  female      3     2
#> 5 Z.Y.  female     NA    NA
#> 6 X.D.  female      4     4
head(t2)
#> # A tibble: 6 × 4
#>   name  gender like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>
#> 1 X.D.  female      5     1
#> 2 T.D.  male        3    NA
#> 3 Z.D.  male        7     4
#> 4 L.S.  male        7     4
#> 5 H.V.  female      5     2
#> 6 Q.N.  female      6     2

Note that the data in t1 and t2 are both 20 x 4 tibbles and appear to contain the responses or test scores of people. This could suggest that they may be two measurements of the same people. However, before jumping into joining these tables, we should check whether they really contain the same people:

# Do both tables contain common variables?
sum(names(t1) %in% names(t2))        # Number of common variables: 
#> [1] 2
names(t1)[names(t1) %in% names(t2)]  # Names of common variables: 
#> [1] "name"   "gender"

# Arrange the rows of tables by common variables:
t1 <- t1 %>% arrange(name, gender) 
t2 <- t2 %>% arrange(name, gender)

# Are they equal in both tables?
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE

This confirms our intuition: t1 and t2 contain two variables of demographic information (name and gender) of 20 people and each table contains two additional variables (measurements or test scores).

Joining two tables with identical cases

When both tables contain identical cases, the four mutating join commands all yield the same result:

# 1. left_join:
m1 <- left_join(t1, t2)
m1
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 2. right_join:
m2 <- right_join(t1, t2)
m2
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 3. full_join:
m3 <- full_join(t1, t2)
m3
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 4. inner_join:
m4 <- inner_join(t1, t2)
m4
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE

Note that the join commands automatically used the two common variables as key variables to join the tables (i.e., used by = c("name", "gender") by default). This raises the question: What would happen if we specified only one of these variables in the by argument? Let’s check to find out:

left_join(t1, t2, by = "name")   # => 2 gender columns, distinguished by suffixes .x and .y
#> # A tibble: 20 × 7
#>    name  gender.x like_1 bnt_1 gender.y like_2 bnt_2
#>    <chr> <chr>     <dbl> <dbl> <chr>     <dbl> <dbl>
#>  1 A.F.  female        5     1 female        4     1
#>  2 A.V.  female        3     2 female        5     3
#>  3 C.N.  female        4     3 female        2     2
#>  4 H.V.  female        1     4 female        5     2
#>  5 K.R.  male          4     1 male          6     1
#>  6 K.X.  male          3     3 male          5     4
#>  7 L.S.  male          4     4 male          7     4
#>  8 L.V.  female        4     2 female        2     4
#>  9 M.O.  male          4     1 male         NA    NA
#> 10 M.Y.  male          2     4 male          6     2
#> 11 Q.N.  female        6     1 female        6     2
#> 12 R.S.  female        2    NA female        7     3
#> 13 T.D.  male          5     2 male          3    NA
#> 14 T.V.  male          6     4 male          4     4
#> 15 X.D.  female        4     4 female        5     1
#> 16 X.P.  female        4     1 female        4     1
#> 17 X.Y.  female        5     4 female        5     2
#> 18 Y.G.  male          4     4 male          4     4
#> 19 Z.D.  male          5     4 male          7     4
#> 20 Z.Y.  female       NA    NA female        7     1
left_join(t1, t2, by = "gender") # => 2 name columns, distinguished by suffixes .x and .y
#> # A tibble: 202 × 7
#>    name.x gender like_1 bnt_1 name.y like_2 bnt_2
#>    <chr>  <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.   female      5     1 A.F.        4     1
#>  2 A.F.   female      5     1 A.V.        5     3
#>  3 A.F.   female      5     1 C.N.        2     2
#>  4 A.F.   female      5     1 H.V.        5     2
#>  5 A.F.   female      5     1 L.V.        2     4
#>  6 A.F.   female      5     1 Q.N.        6     2
#>  7 A.F.   female      5     1 R.S.        7     3
#>  8 A.F.   female      5     1 X.D.        5     1
#>  9 A.F.   female      5     1 X.P.        4     1
#> 10 A.F.   female      5     1 X.Y.        5     2
#> # … with 192 more rows

We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix. We can control the suffix of duplicate variables by providing a suffix argument:

# To control the suffix of duplicate variables:
left_join(t1, t2, by = "name", suffix = c("_001", "_002"))
#> # A tibble: 20 × 7
#>    name  gender_001 like_1 bnt_1 gender_002 like_2 bnt_2
#>    <chr> <chr>       <dbl> <dbl> <chr>       <dbl> <dbl>
#>  1 A.F.  female          5     1 female          4     1
#>  2 A.V.  female          3     2 female          5     3
#>  3 C.N.  female          4     3 female          2     2
#>  4 H.V.  female          1     4 female          5     2
#>  5 K.R.  male            4     1 male            6     1
#>  6 K.X.  male            3     3 male            5     4
#>  7 L.S.  male            4     4 male            7     4
#>  8 L.V.  female          4     2 female          2     4
#>  9 M.O.  male            4     1 male           NA    NA
#> 10 M.Y.  male            2     4 male            6     2
#> 11 Q.N.  female          6     1 female          6     2
#> 12 R.S.  female          2    NA female          7     3
#> 13 T.D.  male            5     2 male            3    NA
#> 14 T.V.  male            6     4 male            4     4
#> 15 X.D.  female          4     4 female          5     1
#> 16 X.P.  female          4     1 female          4     1
#> 17 X.Y.  female          5     4 female          5     2
#> 18 Y.G.  male            4     4 male            4     4
#> 19 Z.D.  male            5     4 male            7     4
#> 20 Z.Y.  female         NA    NA female          7     1

Joining two tables with different cases

To see and appreciate the differences between the four mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:51

# Create t3 and t4: ------ 
t3 <- data_t1  # copy
t4 <- data_t2  # copy 

# Modify both tables: ------ 
# (a) Draw n random rows (samples):
set.seed(12)  # for replicability
n <- 8
s1 <- sample(1:nrow(t3), size = n, replace = FALSE)  # n random values from 1 to nrow(t3)
s2 <- sample(1:nrow(t4), size = n, replace = FALSE)  # n random values from 1 to nrow(t4)

# Reduce rows and arrange both tables:
t3 <- t3[s1, ]
t4 <- t4[s2, ]

# (b) Duplicate m random rows:
set.seed(23)  # for replicability
m <- 2
d1 <- sample(1:nrow(t3), size = m, replace = FALSE)  # n random values from 1 to nrow(t3)
d2 <- sample(1:nrow(t4), size = m, replace = FALSE)  # n random values from 1 to nrow(t4)

# Duplicate sampled rows:
t3 <- rbind(t3, t3[d1, ])
t4 <- rbind(t4, t4[d2, ])

# Arrange tables:
t3 <- t3 %>% arrange(name, gender)
t4 <- t4 %>% arrange(name, gender)

The resulting tables t3 and t4 are both 10 x 4 tibbles and contain some common and some unique cases (people). In addition, each table contains two people twice:

# Inspect resulting tables: ------ 
t3  # 10 x 4
#> # A tibble: 10 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 H.V.  female      1     4
#>  2 L.V.  female      4     2
#>  3 M.Y.  male        2     4
#>  4 Q.N.  female      6     1
#>  5 Q.N.  female      6     1
#>  6 T.D.  male        5     2
#>  7 X.P.  female      4     1
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
t4  # 10 x 4
#> # A tibble: 10 × 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.F.  female      4     1
#>  2 A.V.  female      5     3
#>  3 C.N.  female      2     2
#>  4 L.V.  female      2     4
#>  5 Q.N.  female      6     2
#>  6 Q.N.  female      6     2
#>  7 T.V.  male        4     4
#>  8 X.P.  female      4     1
#>  9 Y.G.  male        4     4
#> 10 Y.G.  male        4     4

# Common cases:
t3$name[t3$name %in% t4$name]  # people in t3 that also appear in t4
#> [1] "L.V." "Q.N." "Q.N." "X.P."
t4$name[t4$name %in% t3$name]  # people in t4 that also appear in t3
#> [1] "L.V." "Q.N." "Q.N." "X.P."

# Unique cases:
t3$name[!t3$name %in% t4$name]  # people in t3 that are NOT in t4
#> [1] "H.V." "M.Y." "T.D." "X.Y." "Z.Y." "Z.Y."
t4$name[!t4$name %in% t3$name]  # people in t4 that are NOT in t3
#> [1] "A.F." "A.V." "C.N." "T.V." "Y.G." "Y.G."

Now let’s re-do the four mutating joins with t3 and t4 and inspect the resulting tables:

# 1. left_join:
m5 <- left_join(t3, t4)
m5  # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

# 2. right_join:
m6 <- right_join(t3, t4)
m6  # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 L.V.  female      4     2      2     4
#>  2 Q.N.  female      6     1      6     2
#>  3 Q.N.  female      6     1      6     2
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 X.P.  female      4     1      4     1
#>  7 A.F.  female     NA    NA      4     1
#>  8 A.V.  female     NA    NA      5     3
#>  9 C.N.  female     NA    NA      2     2
#> 10 T.V.  male       NA    NA      4     4
#> 11 Y.G.  male       NA    NA      4     4
#> 12 Y.G.  male       NA    NA      4     4

all.equal(m5, m6)  # shows differences, but 
#> [1] "Component \"name\": 10 string mismatches"                              
#> [2] "Component \"gender\": 5 string mismatches"                             
#> [3] "Component \"like_1\": 'is.NA' value mismatch: 6 in current 2 in target"
#> [4] "Component \"bnt_1\": 'is.NA' value mismatch: 6 in current 2 in target" 
#> [5] "Component \"like_2\": 'is.NA' value mismatch: 0 in current 6 in target"
#> [6] "Component \"bnt_2\": 'is.NA' value mismatch: 0 in current 6 in target"
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] "Names: 4 string mismatches"                                   
#> [2] "Component \"name\": 7 string mismatches"                      
#> [3] "Component \"gender\": 2 string mismatches"                    
#> [4] "Component 3: 'is.NA' value mismatch: 6 in current 2 in target"
#> [5] "Component 4: 'is.NA' value mismatch: 6 in current 2 in target"
#> [6] "Component 5: 'is.NA' value mismatch: 2 in current 6 in target"
#> [7] "Component 6: 'is.NA' value mismatch: 2 in current 6 in target"

# 3. full_join:
m7 <- full_join(t3, t4)
m7  # => 17 x 6 table, i.e. all cases of t3 and of t4
#> # A tibble: 18 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA
#> 13 A.F.  female     NA    NA      4     1
#> 14 A.V.  female     NA    NA      5     3
#> 15 C.N.  female     NA    NA      2     2
#> 16 T.V.  male       NA    NA      4     4
#> 17 Y.G.  male       NA    NA      4     4
#> 18 Y.G.  male       NA    NA      4     4

# 4. inner_join:
m8 <- inner_join(t3, t4)
m8  # => 7 x 6 table, i.e. all cases that were in both t3 and t4
#> # A tibble: 6 × 6
#>   name  gender like_1 bnt_1 like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#> 1 L.V.  female      4     2      2     4
#> 2 Q.N.  female      6     1      6     2
#> 3 Q.N.  female      6     1      6     2
#> 4 Q.N.  female      6     1      6     2
#> 5 Q.N.  female      6     1      6     2
#> 6 X.P.  female      4     1      4     1

Thus, when joining two tables with different cases, it really matters which join-command we are using.

8.2.2 Filtering joins

Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are two types of filtering joins:

  1. semi_join(x, y) keeps all observations in x that have a match in y (i.e., observations in x without a match in y are dropped).

  2. anti_join(x, y) drops all observations in x that have a match in y (i.e., observations in x without a match in y are kept).

See Section 13.5 on filtering joins of the r4ds textbook (Wickham & Grolemund, 2017) for graphical illustrations.

Notes

  • anti_join(x, y) yields the complement of semi_join(x, y) (with respect to x).

  • Both types of filtering joins concern only the cases of x and filter them by the cases of y.

  • The resulting table of a filtering join preserves the variables of x, not the variables of y.

Examples

semi_join(t3, t4) # keeps cases in t3 that are also in t4
#> # A tibble: 4 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 L.V.  female      4     2
#> 2 Q.N.  female      6     1
#> 3 Q.N.  female      6     1
#> 4 X.P.  female      4     1

anti_join(t3, t4) # keeps cases in t3 that are NOT in t4
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA

# Note: 
full_join(semi_join(t3, t4), anti_join(t3, t4))
#> # A tibble: 10 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 L.V.  female      4     2
#>  2 Q.N.  female      6     1
#>  3 Q.N.  female      6     1
#>  4 X.P.  female      4     1
#>  5 H.V.  female      1     4
#>  6 M.Y.  male        2     4
#>  7 T.D.  male        5     2
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
# contains the same cases as t3.

Use cases

Filtering joins typically reduce the cases of x to the cases contained in y (semi_join()) or to the cases not contained in y (anti_join()). Typical use cases include:

  • Semi-joins are useful for matching filtered summary tables back to the original rows:
## Data: 
sw <- dplyr::starwars
# sw  # 87 x 13

# Determine the top 3 homeworlds of Humans:
top_3_human_homes <- sw %>% 
  filter(species == "Human") %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(3)
top_3_human_homes
#> # A tibble: 3 × 2
#> # Groups:   homeworld [3]
#>   homeworld     n
#>   <chr>     <int>
#> 1 Tatooine      8
#> 2 Naboo         5
#> 3 <NA>          5

# Which individuals (human and non-human) are from these homeworlds?
semi_join(sw, top_3_human_homes, by = "homeworld")  # => 31 individuals
#> # A tibble: 31 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  6 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  7 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  8 Biggs D…    183    84 black      light      brown           24   male  mascu…
#>  9 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
#> 10 Yoda         66    17 white      green      brown          896   male  mascu…
#> # … with 21 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
  • Anti-joins are useful for diagnosing join mismatches:
left_join(t3, t4)  # adds variables (columns) of t4 to cases (rows) of t3
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

anti_join(t3, t4)  # shows cases in t3 NOT in t4 (with NA values in new variables)
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA
anti_join(t4, t3)  # shows cases in t4 NOT in t3 (missing from joined table)
#> # A tibble: 6 × 4
#>   name  gender like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>
#> 1 A.F.  female      4     1
#> 2 A.V.  female      5     3
#> 3 C.N.  female      2     2
#> 4 T.V.  male        4     4
#> 5 Y.G.  male        4     4
#> 6 Y.G.  male        4     4

See Section 13.6 of the r4ds textbook (Wickham & Grolemund, 2017) for common join problems and ways of dealing with them.

8.2.3 Set operations

Set operations on two tables x and y expect both tables to have the same variables. They always compare complete rows of x and y, and compare the values of every variable. Treating the observations like sets yields three different commands:

  1. intersect(x, y) returns only observations that occur in both x and in y.

  2. union(x, y) returns all unique observations in x and y.

  3. setdiff(x, y) return those observations that occur in x, but not in y.

Examples

# Data: 
sw <- dplyr::starwars %>%
  select(name:species)  # only first 10 columns of starwars
sw  # 87 x 10
#> # A tibble: 87 × 11
#>    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
#>    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
#>  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu…
#>  3 R2-D2       96    32 <NA>        white, bl… red             33   none  mascu…
#>  4 Darth …    202   136 none        white      yellow          41.9 male  mascu…
#>  5 Leia O…    150    49 brown       light      brown           19   fema… femin…
#>  6 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
#>  7 Beru W…    165    75 brown       light      blue            47   fema… femin…
#>  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu…
#>  9 Biggs …    183    84 black       light      brown           24   male  mascu…
#> 10 Obi-Wa…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
#> # … with 77 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Create 2 filtered sub-sets:
s1 <- sw %>% filter(species == "Human", height < 180)
s1  # 15 indivduals
#> # A tibble: 15 × 11
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
#>  2 Leia Or…    150    49 brown      light      brown             19 fema… femin…
#>  3 Owen La…    178   120 brown, gr… light      blue              52 male  mascu…
#>  4 Beru Wh…    165    75 brown      light      blue              47 fema… femin…
#>  5 Wedge A…    170    77 brown      fair       hazel             21 male  mascu…
#>  6 Palpati…    170    75 grey       pale       yellow            82 male  mascu…
#>  7 Lando C…    177    79 black      dark       brown             31 male  mascu…
#>  8 Lobot       175    79 none       light      blue              37 male  mascu…
#>  9 Mon Mot…    150    NA auburn     fair       blue              48 fema… femin…
#> 10 Finis V…    170    NA blond      fair       blue              91 male  mascu…
#> 11 Shmi Sk…    163    NA black      fair       brown             72 fema… femin…
#> 12 Cordé       157    NA brown      light      brown             NA fema… femin…
#> 13 Dormé       165    NA brown      light      brown             NA fema… femin…
#> 14 Jocasta…    167    NA white      fair       blue              NA fema… femin…
#> 15 Padmé A…    165    45 brown      light      brown             46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>

s2 <- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2  # 21 individuals
#> # A tibble: 21 × 11
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  6 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  7 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  8 Biggs D…    183    84 black      light      brown           24   male  mascu…
#>  9 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
#> 10 Palpati…    170    75 grey       pale       yellow          82   male  mascu…
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Set operations:
intersect(s1, s2)  #  8 individuals are in both s1 and s2
#> # A tibble: 8 × 11
#>   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
#>   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sk…    172    77 blond       fair       blue              19 male  mascu…
#> 2 Owen La…    178   120 brown, grey light      blue              52 male  mascu…
#> 3 Beru Wh…    165    75 brown       light      blue              47 fema… femin…
#> 4 Palpati…    170    75 grey        pale       yellow            82 male  mascu…
#> 5 Shmi Sk…    163    NA black       fair       brown             72 fema… femin…
#> 6 Cordé       157    NA brown       light      brown             NA fema… femin…
#> 7 Dormé       165    NA brown       light      brown             NA fema… femin…
#> 8 Padmé A…    165    45 brown       light      brown             46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
union(s1, s2)      # 28 individuals are in the union of both sets
#> # A tibble: 28 × 11
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
#>  2 Leia Or…    150    49 brown      light      brown             19 fema… femin…
#>  3 Owen La…    178   120 brown, gr… light      blue              52 male  mascu…
#>  4 Beru Wh…    165    75 brown      light      blue              47 fema… femin…
#>  5 Wedge A…    170    77 brown      fair       hazel             21 male  mascu…
#>  6 Palpati…    170    75 grey       pale       yellow            82 male  mascu…
#>  7 Lando C…    177    79 black      dark       brown             31 male  mascu…
#>  8 Lobot       175    79 none       light      blue              37 male  mascu…
#>  9 Mon Mot…    150    NA auburn     fair       blue              48 fema… femin…
#> 10 Finis V…    170    NA blond      fair       blue              91 male  mascu…
#> # … with 18 more rows, and 2 more variables: homeworld <chr>, species <chr>
setdiff(s1, s2)    #  7 individuals in s1 and not in s2
#> # A tibble: 7 × 11
#>   name     height  mass hair_color skin_color eye_color birth_year sex    gender
#>   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
#> 1 Leia Or…    150    49 brown      light      brown             19 female femin…
#> 2 Wedge A…    170    77 brown      fair       hazel             21 male   mascu…
#> 3 Lando C…    177    79 black      dark       brown             31 male   mascu…
#> 4 Lobot       175    79 none       light      blue              37 male   mascu…
#> 5 Mon Mot…    150    NA auburn     fair       blue              48 female femin…
#> 6 Finis V…    170    NA blond      fair       blue              91 male   mascu…
#> 7 Jocasta…    167    NA white      fair       blue              NA female femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
setdiff(s2, s1)    # 13 individuals in s2 and not in s1
#> # A tibble: 13 × 11
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  2 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  4 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  5 Biggs D…    183    84 black      light      brown           24   male  mascu…
#>  6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
#>  7 Jar Jar…    196    66 none       orange     orange          52   male  mascu…
#>  8 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
#>  9 Rugor N…    206    NA none       green      orange          NA   male  mascu…
#> 10 Ric Olié    183    NA brown      fair       blue            NA   <NA>  <NA>  
#> 11 Quarsh …    183    NA black      dark       brown           62   <NA>  <NA>  
#> 12 Gregar …    185    85 black      dark       brown           NA   male  mascu…
#> 13 Cliegg …    183    NA brown      fair       blue            82   male  mascu…
#> # … with 2 more variables: homeworld <chr>, species <chr>

8.2.4 Using merge()

The base R function base::merge() can perform all four types of mutating joins:

Table 8.1: Mapping the base R function merge() to the join commands of dplyr.
dplyr: base R:
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

Note that merge() also accepts various by commands (see ?merge for documentation).

The advantages of the dplyr verbs are that:

  1. they more clearly convey the intended join (by verbally expressing what is implied by the arguments of merge());

  2. they are faster and preserve the order of cases (rows).

See Chapter 13 of the r4ds textbook (Wickham & Grolemund, 2017) for other implementations of joins (e.g., in SQL).

References

Wickham, H., François, R., Henry, L., & Müller, K. (2022). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

  1. It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.↩︎