8.2 Essential join commands

As the functions for joining two tables x and y are implemented in the dplyr package (Wickham, François, et al., 2023), the two-table verbs use the same terminology as the one-table verbs discussed in Chapter 3 on Data transformation: Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.

In the following, we distinguish between three types of joins:

  1. Mutating joins combine the variables (columns) of two tables (Section 8.2.1);

  2. Filtering joins affect only the cases or observations (rows) of two tables (Section 8.2.2);

  3. Set operations on two tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section 8.2.3).

Finally, we will see that the merge() function from base R can also be used to accomplish mutating joins (Section 8.2.4).

8.2.1 Mutating joins

Mutating joins combine the variables (columns) of two tables x and y. The term mutating is used in analogy to the dplyr::mutate() command: Just as mutate() creates new variables (columns), a mutating join adds new variables (columns) to a table x from matching cases (rows) of another table y.

Key functions

There are four types of mutating joins to combine the cases (rows) and variables (columns) of tables x and y:

  1. left_join(x, y): Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  2. right_join(x, y): Same as left_join(y, x), and hence redundant.

  3. full_join(x, y): Return all rows and all columns that exist in x or in y. If there are non-matching values, return NA for the ones missing.

  4. inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In contrast to an inner_join(x, y), the first three joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.

Inner vs. outer joins

All four types of mutating joins combine variables (columns) from two tables, but differ in preserving in the cases (rows) in the output table. An inner join inner_join(x, y) keeps only the cases (or rows) that appear in both tables (i.e., x and y). By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins. The three types of outer joins also differ with respect to the cases they preserve:

  1. left_join(x, y) keeps all cases of x.
  2. right_join(x, y) keeps all cases of y.
  3. full_join(x, y) keeps all cases of x or y.

Terminology of key variables

Beyond specifying the two tables to join, all *_join() commands take a by argument: by specifies a character vector of variables by which we want to join the tables (so-called keys).

In the r4ds textbook (Wickham & Grolemund, 2017), Section 13.3 on keys makes a distinction between primary and foreign keys, that stems from database terminology. These terms can be confusing, though the distinction behind it is actually quite simple:

  • A key is simply a variable that occurs in two distinct tables (x and y) and by which we want to join both tables.

  • Importantly, the variable of interest does not always need to have the same name in both tables. For instance, a variable may be called surname in x and family_name in y, but both these variables actually have the same referent. (i.e., the name of a specific person). The distinction between primary vs. secondary now depends on our perspective:

    • When looking from table x at table y, surname is a primary key and family_name a secondary key.

    • When looking from table y at table x, family_name is a primary key and surname a secondary key.

  • The primary key (in table x) and the corresponding foreign key in another table (y) form a so-called relation. Such relations are typically one-to-many. For instance, each name or id variable in one table corresponds to a particular person, but each person can contribute data to several measurement occasions in another table (see, e.g., the variables id and intervention in the table posPsy_p_info vs. in the table posPsy_long of the ds4psy package). A special case occurs when there is a one-to-one relationship (see, e.g., the variables id and intervention in the tables posPsy_p_info and posPsy_wide of the ds4psy package).

Actually, the distinction between two different types of keys is irrelevant, as long as we know how we want to join the two tables (e.g., which table is our primary table) and which relation(s) exist between both tables (i.e., which variables occur in both tables).

The by argument of any *_join() command can either be NULL or a named vector:

  • If by = NULL (as by default) *_join will perform a natural join, using all variables with common names across the two tables. A message lists the variables that are used, so that we can verify that this is what we wanted.

  • To join by differently named variables on x and y, we can provide a named vector. For instance, by = c("a" = "b") will match a (primary) key variable a in table x to a (secondary) key variable b in table y.

Examples of joining tables

To test these commands, we load the following data files:

  • data_t1.csv, available as data_t1 in the ds4psy package
  • data_t2.csv, available as data_t2 in the ds4psy package
# Data:
data_t1 <- ds4psy::data_t1  # from ds4psy package
data_t2 <- ds4psy::data_t2  # from ds4psy package

# Copy data:
t1 <- data_t1
t2 <- data_t2

# Inspect data:
head(t1)
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 R.S.  female      2    NA
#> 2 M.Y.  male        2     4
#> 3 K.R.  male        4     1
#> 4 A.V.  female      3     2
#> 5 Z.Y.  female     NA    NA
#> 6 X.D.  female      4     4
head(t2)
#> # A tibble: 6 × 4
#>   name  gender like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>
#> 1 X.D.  female      5     1
#> 2 T.D.  male        3    NA
#> 3 Z.D.  male        7     4
#> 4 L.S.  male        7     4
#> 5 H.V.  female      5     2
#> 6 Q.N.  female      6     2

Note that the data in t1 and t2 are both 20 x 4 tibbles and appear to contain the responses or test scores of people. This could suggest that they may be two measurements of the same people. However, before jumping into joining these tables, we should check whether they really contain the same people:

# Do both tables contain common variables?
sum(names(t1) %in% names(t2))        # Number of common variables: 
#> [1] 2
names(t1)[names(t1) %in% names(t2)]  # Names of common variables: 
#> [1] "name"   "gender"

# Arrange the rows of tables by common variables:
t1 <- t1 %>% arrange(name, gender) 
t2 <- t2 %>% arrange(name, gender)

# Are they equal in both tables?
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE

This confirms our intuition: t1 and t2 contain two variables of demographic information (name and gender) of 20 people and each table contains two additional variables (measurements or test scores).

Joining two tables with identical cases

When both tables contain identical cases, the four mutating join commands all yield the same result:

# 1. left_join:
m1 <- left_join(t1, t2)
m1
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 2. right_join:
m2 <- right_join(t1, t2)
m2
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 3. full_join:
m3 <- full_join(t1, t2)
m3
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 4. inner_join:
m4 <- inner_join(t1, t2)
m4
#> # A tibble: 20 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE

Note that the join commands automatically used the two common variables as key variables to join the tables (i.e., used by = c("name", "gender") by default). This raises the question: What would happen if we specified only one of these variables in the by argument? Let’s check to find out:

left_join(t1, t2, by = "name")   # => 2 gender columns, distinguished by suffixes .x and .y
#> # A tibble: 20 × 7
#>    name  gender.x like_1 bnt_1 gender.y like_2 bnt_2
#>    <chr> <chr>     <dbl> <dbl> <chr>     <dbl> <dbl>
#>  1 A.F.  female        5     1 female        4     1
#>  2 A.V.  female        3     2 female        5     3
#>  3 C.N.  female        4     3 female        2     2
#>  4 H.V.  female        1     4 female        5     2
#>  5 K.R.  male          4     1 male          6     1
#>  6 K.X.  male          3     3 male          5     4
#>  7 L.S.  male          4     4 male          7     4
#>  8 L.V.  female        4     2 female        2     4
#>  9 M.O.  male          4     1 male         NA    NA
#> 10 M.Y.  male          2     4 male          6     2
#> 11 Q.N.  female        6     1 female        6     2
#> 12 R.S.  female        2    NA female        7     3
#> 13 T.D.  male          5     2 male          3    NA
#> 14 T.V.  male          6     4 male          4     4
#> 15 X.D.  female        4     4 female        5     1
#> 16 X.P.  female        4     1 female        4     1
#> 17 X.Y.  female        5     4 female        5     2
#> 18 Y.G.  male          4     4 male          4     4
#> 19 Z.D.  male          5     4 male          7     4
#> 20 Z.Y.  female       NA    NA female        7     1
left_join(t1, t2, by = "gender") # => 2 name columns, distinguished by suffixes .x and .y
#> # A tibble: 202 × 7
#>    name.x gender like_1 bnt_1 name.y like_2 bnt_2
#>    <chr>  <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.   female      5     1 A.F.        4     1
#>  2 A.F.   female      5     1 A.V.        5     3
#>  3 A.F.   female      5     1 C.N.        2     2
#>  4 A.F.   female      5     1 H.V.        5     2
#>  5 A.F.   female      5     1 L.V.        2     4
#>  6 A.F.   female      5     1 Q.N.        6     2
#>  7 A.F.   female      5     1 R.S.        7     3
#>  8 A.F.   female      5     1 X.D.        5     1
#>  9 A.F.   female      5     1 X.P.        4     1
#> 10 A.F.   female      5     1 X.Y.        5     2
#> # … with 192 more rows

We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix. We can control the suffix of duplicate variables by providing a suffix argument:

# To control the suffix of duplicate variables:
left_join(t1, t2, by = "name", suffix = c("_001", "_002"))
#> # A tibble: 20 × 7
#>    name  gender_001 like_1 bnt_1 gender_002 like_2 bnt_2
#>    <chr> <chr>       <dbl> <dbl> <chr>       <dbl> <dbl>
#>  1 A.F.  female          5     1 female          4     1
#>  2 A.V.  female          3     2 female          5     3
#>  3 C.N.  female          4     3 female          2     2
#>  4 H.V.  female          1     4 female          5     2
#>  5 K.R.  male            4     1 male            6     1
#>  6 K.X.  male            3     3 male            5     4
#>  7 L.S.  male            4     4 male            7     4
#>  8 L.V.  female          4     2 female          2     4
#>  9 M.O.  male            4     1 male           NA    NA
#> 10 M.Y.  male            2     4 male            6     2
#> 11 Q.N.  female          6     1 female          6     2
#> 12 R.S.  female          2    NA female          7     3
#> 13 T.D.  male            5     2 male            3    NA
#> 14 T.V.  male            6     4 male            4     4
#> 15 X.D.  female          4     4 female          5     1
#> 16 X.P.  female          4     1 female          4     1
#> 17 X.Y.  female          5     4 female          5     2
#> 18 Y.G.  male            4     4 male            4     4
#> 19 Z.D.  male            5     4 male            7     4
#> 20 Z.Y.  female         NA    NA female          7     1

Joining two tables with different cases

To see and appreciate the differences between the four mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:51

# Create t3 and t4: ------ 
t3 <- data_t1  # copy
t4 <- data_t2  # copy 

# Modify both tables: ------ 
# (a) Draw n random rows (samples):
set.seed(12)  # for replicability
n <- 8
s1 <- sample(1:nrow(t3), size = n, replace = FALSE)  # n random values from 1 to nrow(t3)
s2 <- sample(1:nrow(t4), size = n, replace = FALSE)  # n random values from 1 to nrow(t4)

# Reduce rows and arrange both tables:
t3 <- t3[s1, ]
t4 <- t4[s2, ]

# (b) Duplicate m random rows:
set.seed(23)  # for replicability
m <- 2
d1 <- sample(1:nrow(t3), size = m, replace = FALSE)  # n random values from 1 to nrow(t3)
d2 <- sample(1:nrow(t4), size = m, replace = FALSE)  # n random values from 1 to nrow(t4)

# Duplicate sampled rows:
t3 <- rbind(t3, t3[d1, ])
t4 <- rbind(t4, t4[d2, ])

# Arrange tables:
t3 <- t3 %>% arrange(name, gender)
t4 <- t4 %>% arrange(name, gender)

The resulting tables t3 and t4 are both 10 x 4 tibbles and contain some common and some unique cases (people). In addition, each table contains two people twice:

# Inspect resulting tables: ------ 
t3  # 10 x 4
#> # A tibble: 10 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 H.V.  female      1     4
#>  2 L.V.  female      4     2
#>  3 M.Y.  male        2     4
#>  4 Q.N.  female      6     1
#>  5 Q.N.  female      6     1
#>  6 T.D.  male        5     2
#>  7 X.P.  female      4     1
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
t4  # 10 x 4
#> # A tibble: 10 × 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.F.  female      4     1
#>  2 A.V.  female      5     3
#>  3 C.N.  female      2     2
#>  4 L.V.  female      2     4
#>  5 Q.N.  female      6     2
#>  6 Q.N.  female      6     2
#>  7 T.V.  male        4     4
#>  8 X.P.  female      4     1
#>  9 Y.G.  male        4     4
#> 10 Y.G.  male        4     4

# Common cases:
t3$name[t3$name %in% t4$name]  # people in t3 that also appear in t4
#> [1] "L.V." "Q.N." "Q.N." "X.P."
t4$name[t4$name %in% t3$name]  # people in t4 that also appear in t3
#> [1] "L.V." "Q.N." "Q.N." "X.P."

# Unique cases:
t3$name[!t3$name %in% t4$name]  # people in t3 that are NOT in t4
#> [1] "H.V." "M.Y." "T.D." "X.Y." "Z.Y." "Z.Y."
t4$name[!t4$name %in% t3$name]  # people in t4 that are NOT in t3
#> [1] "A.F." "A.V." "C.N." "T.V." "Y.G." "Y.G."

Now let’s re-do the four mutating joins with t3 and t4 and inspect the resulting tables:

# 1. left_join:
m5 <- left_join(t3, t4)
m5  # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

# 2. right_join:
m6 <- right_join(t3, t4)
m6  # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 L.V.  female      4     2      2     4
#>  2 Q.N.  female      6     1      6     2
#>  3 Q.N.  female      6     1      6     2
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 X.P.  female      4     1      4     1
#>  7 A.F.  female     NA    NA      4     1
#>  8 A.V.  female     NA    NA      5     3
#>  9 C.N.  female     NA    NA      2     2
#> 10 T.V.  male       NA    NA      4     4
#> 11 Y.G.  male       NA    NA      4     4
#> 12 Y.G.  male       NA    NA      4     4

all.equal(m5, m6)  # shows differences, but 
#> [1] "Component \"name\": 10 string mismatches"                              
#> [2] "Component \"gender\": 5 string mismatches"                             
#> [3] "Component \"like_1\": 'is.NA' value mismatch: 6 in current 2 in target"
#> [4] "Component \"bnt_1\": 'is.NA' value mismatch: 6 in current 2 in target" 
#> [5] "Component \"like_2\": 'is.NA' value mismatch: 0 in current 6 in target"
#> [6] "Component \"bnt_2\": 'is.NA' value mismatch: 0 in current 6 in target"
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] "Names: 4 string mismatches"                                   
#> [2] "Component \"name\": 7 string mismatches"                      
#> [3] "Component \"gender\": 2 string mismatches"                    
#> [4] "Component 3: 'is.NA' value mismatch: 6 in current 2 in target"
#> [5] "Component 4: 'is.NA' value mismatch: 6 in current 2 in target"
#> [6] "Component 5: 'is.NA' value mismatch: 2 in current 6 in target"
#> [7] "Component 6: 'is.NA' value mismatch: 2 in current 6 in target"

# 3. full_join:
m7 <- full_join(t3, t4)
m7  # => 17 x 6 table, i.e. all cases of t3 and of t4
#> # A tibble: 18 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA
#> 13 A.F.  female     NA    NA      4     1
#> 14 A.V.  female     NA    NA      5     3
#> 15 C.N.  female     NA    NA      2     2
#> 16 T.V.  male       NA    NA      4     4
#> 17 Y.G.  male       NA    NA      4     4
#> 18 Y.G.  male       NA    NA      4     4

# 4. inner_join:
m8 <- inner_join(t3, t4)
m8  # => 7 x 6 table, i.e. all cases that were in both t3 and t4
#> # A tibble: 6 × 6
#>   name  gender like_1 bnt_1 like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#> 1 L.V.  female      4     2      2     4
#> 2 Q.N.  female      6     1      6     2
#> 3 Q.N.  female      6     1      6     2
#> 4 Q.N.  female      6     1      6     2
#> 5 Q.N.  female      6     1      6     2
#> 6 X.P.  female      4     1      4     1

Thus, when joining two tables with different cases, it really matters which join-command we are using.

8.2.2 Filtering joins

Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are two types of filtering joins:

  1. semi_join(x, y) keeps all observations in x that have a match in y (i.e., observations in x without a match in y are dropped).

  2. anti_join(x, y) drops all observations in x that have a match in y (i.e., observations in x without a match in y are kept).

See Section 13.5 on filtering joins of the r4ds textbook (Wickham & Grolemund, 2017) for graphical illustrations.

Notes

  • anti_join(x, y) yields the complement of semi_join(x, y) (with respect to x).

  • Both types of filtering joins concern only the cases of x and filter them by the cases of y.

  • The resulting table of a filtering join preserves the variables of x, not the variables of y.

Examples

semi_join(t3, t4) # keeps cases in t3 that are also in t4
#> # A tibble: 4 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 L.V.  female      4     2
#> 2 Q.N.  female      6     1
#> 3 Q.N.  female      6     1
#> 4 X.P.  female      4     1

anti_join(t3, t4) # keeps cases in t3 that are NOT in t4
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA

# Note: 
full_join(semi_join(t3, t4), anti_join(t3, t4))
#> # A tibble: 10 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 L.V.  female      4     2
#>  2 Q.N.  female      6     1
#>  3 Q.N.  female      6     1
#>  4 X.P.  female      4     1
#>  5 H.V.  female      1     4
#>  6 M.Y.  male        2     4
#>  7 T.D.  male        5     2
#>  8 X.Y.  female      5     4
#>  9 Z.Y.  female     NA    NA
#> 10 Z.Y.  female     NA    NA
# contains the same cases as t3.

Use cases

Filtering joins typically reduce the cases of x to the cases contained in y (semi_join()) or to the cases not contained in y (anti_join()). Typical use cases include:

  • Semi-joins are useful for matching filtered summary tables back to the original rows:
## Data: 
sw <- dplyr::starwars
# sw  # 87 x 13

# Determine the top 3 homeworlds of Humans:
top_3_human_homes <- sw %>% 
  filter(species == "Human") %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(3)
top_3_human_homes
#> # A tibble: 3 × 2
#> # Groups:   homeworld [3]
#>   homeworld     n
#>   <chr>     <int>
#> 1 Tatooine      8
#> 2 Naboo         5
#> 3 <NA>          5

# Which individuals (human and non-human) are from these homeworlds?
semi_join(sw, top_3_human_homes, by = "homeworld")  # => 31 individuals
#> # A tibble: 31 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  6 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  7 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  8 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#>  9 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
#> 10 Yoda            66    17 white   green   brown     896   male  mascu… <NA>   
#> # … with 21 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
  • Anti-joins are useful for diagnosing join mismatches:
left_join(t3, t4)  # adds variables (columns) of t4 to cases (rows) of t3
#> # A tibble: 12 × 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

anti_join(t3, t4)  # shows cases in t3 NOT in t4 (with NA values in new variables)
#> # A tibble: 6 × 4
#>   name  gender like_1 bnt_1
#>   <chr> <chr>   <dbl> <dbl>
#> 1 H.V.  female      1     4
#> 2 M.Y.  male        2     4
#> 3 T.D.  male        5     2
#> 4 X.Y.  female      5     4
#> 5 Z.Y.  female     NA    NA
#> 6 Z.Y.  female     NA    NA
anti_join(t4, t3)  # shows cases in t4 NOT in t3 (missing from joined table)
#> # A tibble: 6 × 4
#>   name  gender like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>
#> 1 A.F.  female      4     1
#> 2 A.V.  female      5     3
#> 3 C.N.  female      2     2
#> 4 T.V.  male        4     4
#> 5 Y.G.  male        4     4
#> 6 Y.G.  male        4     4

See Section 13.6 of the r4ds textbook (Wickham & Grolemund, 2017) for common join problems and ways of dealing with them.

8.2.3 Set operations

Set operations on two tables x and y expect both tables to have the same variables. They always compare complete rows of x and y, and compare the values of every variable. Treating the observations like sets yields three different commands:

  1. intersect(x, y) returns only observations that occur in both x and in y.

  2. union(x, y) returns all unique observations in x and y.

  3. setdiff(x, y) return those observations that occur in x, but not in y.

Examples

# Data: 
sw <- dplyr::starwars %>%
  select(name:species)  # only first 10 columns of starwars
sw  # 87 x 10
#> # A tibble: 87 × 11
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 1 more variable: species <chr>, and abbreviated variable
#> #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

# Create 2 filtered sub-sets:
s1 <- sw %>% filter(species == "Human", height < 180)
s1  # 15 indivduals
#> # A tibble: 15 × 11
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
#>  2 Leia Organa    150    49 brown   light   brown        19 fema… femin… Aldera…
#>  3 Owen Lars      178   120 brown,… light   blue         52 male  mascu… Tatooi…
#>  4 Beru White…    165    75 brown   light   blue         47 fema… femin… Tatooi…
#>  5 Wedge Anti…    170    77 brown   fair    hazel        21 male  mascu… Corell…
#>  6 Palpatine      170    75 grey    pale    yellow       82 male  mascu… Naboo  
#>  7 Lando Calr…    177    79 black   dark    brown        31 male  mascu… Socorro
#>  8 Lobot          175    79 none    light   blue         37 male  mascu… Bespin 
#>  9 Mon Mothma     150    NA auburn  fair    blue         48 fema… femin… Chandr…
#> 10 Finis Valo…    170    NA blond   fair    blue         91 male  mascu… Corusc…
#> 11 Shmi Skywa…    163    NA black   fair    brown        72 fema… femin… Tatooi…
#> 12 Cordé          157    NA brown   light   brown        NA fema… femin… Naboo  
#> 13 Dormé          165    NA brown   light   brown        NA fema… femin… Naboo  
#> 14 Jocasta Nu     167    NA white   fair    blue         NA fema… femin… Corusc…
#> 15 Padmé Amid…    165    45 brown   light   brown        46 fema… femin… Naboo  
#> # … with 1 more variable: species <chr>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

s2 <- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2  # 21 individuals
#> # A tibble: 21 × 11
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  6 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  7 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  8 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#>  9 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
#> 10 Palpatine      170    75 grey    pale    yellow     82   male  mascu… Naboo  
#> # … with 11 more rows, 1 more variable: species <chr>, and abbreviated variable
#> #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

# Set operations:
intersect(s1, s2)  #  8 individuals are in both s1 and s2
#> # A tibble: 8 × 11
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Luke Skywal…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
#> 2 Owen Lars       178   120 brown,… light   blue         52 male  mascu… Tatooi…
#> 3 Beru Whites…    165    75 brown   light   blue         47 fema… femin… Tatooi…
#> 4 Palpatine       170    75 grey    pale    yellow       82 male  mascu… Naboo  
#> 5 Shmi Skywal…    163    NA black   fair    brown        72 fema… femin… Tatooi…
#> 6 Cordé           157    NA brown   light   brown        NA fema… femin… Naboo  
#> 7 Dormé           165    NA brown   light   brown        NA fema… femin… Naboo  
#> 8 Padmé Amida…    165    45 brown   light   brown        46 fema… femin… Naboo  
#> # … with 1 more variable: species <chr>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
union(s1, s2)      # 28 individuals are in the union of both sets
#> # A tibble: 28 × 11
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
#>  2 Leia Organa    150    49 brown   light   brown        19 fema… femin… Aldera…
#>  3 Owen Lars      178   120 brown,… light   blue         52 male  mascu… Tatooi…
#>  4 Beru White…    165    75 brown   light   blue         47 fema… femin… Tatooi…
#>  5 Wedge Anti…    170    77 brown   fair    hazel        21 male  mascu… Corell…
#>  6 Palpatine      170    75 grey    pale    yellow       82 male  mascu… Naboo  
#>  7 Lando Calr…    177    79 black   dark    brown        31 male  mascu… Socorro
#>  8 Lobot          175    79 none    light   blue         37 male  mascu… Bespin 
#>  9 Mon Mothma     150    NA auburn  fair    blue         48 fema… femin… Chandr…
#> 10 Finis Valo…    170    NA blond   fair    blue         91 male  mascu… Corusc…
#> # … with 18 more rows, 1 more variable: species <chr>, and abbreviated variable
#> #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
setdiff(s1, s2)    #  7 individuals in s1 and not in s2
#> # A tibble: 7 × 11
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Leia Organa     150    49 brown   light   brown        19 fema… femin… Aldera…
#> 2 Wedge Antil…    170    77 brown   fair    hazel        21 male  mascu… Corell…
#> 3 Lando Calri…    177    79 black   dark    brown        31 male  mascu… Socorro
#> 4 Lobot           175    79 none    light   blue         37 male  mascu… Bespin 
#> 5 Mon Mothma      150    NA auburn  fair    blue         48 fema… femin… Chandr…
#> 6 Finis Valor…    170    NA blond   fair    blue         91 male  mascu… Corusc…
#> 7 Jocasta Nu      167    NA white   fair    blue         NA fema… femin… Corusc…
#> # … with 1 more variable: species <chr>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
setdiff(s2, s1)    # 13 individuals in s2 and not in s1
#> # A tibble: 13 × 11
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  2 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  3 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  4 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  5 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#>  6 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
#>  7 Jar Jar Bi…    196    66 none    orange  orange     52   male  mascu… Naboo  
#>  8 Roos Tarpa…    224    82 none    grey    orange     NA   male  mascu… Naboo  
#>  9 Rugor Nass     206    NA none    green   orange     NA   male  mascu… Naboo  
#> 10 Ric Olié       183    NA brown   fair    blue       NA   <NA>  <NA>   Naboo  
#> 11 Quarsh Pan…    183    NA black   dark    brown      62   <NA>  <NA>   Naboo  
#> 12 Gregar Typ…    185    85 black   dark    brown      NA   male  mascu… Naboo  
#> 13 Cliegg Lars    183    NA brown   fair    blue       82   male  mascu… Tatooi…
#> # … with 1 more variable: species <chr>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

8.2.4 Using merge()

The base R function base::merge() can perform all four types of mutating joins:

Table 8.1: Mapping the base R function merge() to the join commands of dplyr.
dplyr: base R:
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

Note that merge() also accepts various by commands (see ?merge for documentation).

The advantages of the dplyr verbs are that:

  1. they more clearly convey the intended join (by verbally expressing what is implied by the arguments of merge());

  2. they are faster and preserve the order of cases (rows).

See Chapter 13 of the r4ds textbook (Wickham & Grolemund, 2017) for other implementations of joins (e.g., in SQL).

References

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

  1. It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.↩︎