8.2 Essential join commands

As the commands for joining 2 tables x and y are implemented in the dplyr package (Wickham et al., 2020), the 2-table verbs use the same terminology as the 1-table verbs discussed in Chapter 3 on Data transformation: Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.

In the following, we distinguish between 3 types of joins:

  1. Mutating joins combine the variables (columns) of 2 tables (Section 8.2.1);

  2. Filtering joins affect only the cases or observations (rows) of 2 tables (Section 8.2.2);

  3. Set operations on 2 tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section&nbsp8.2.3).

Finally, we will see that the base R command merge can also be used to accomplish mutating joins (Section 8.2.4).

8.2.1 Mutating joins

Mutating joins combine the variables (columns) of 2 tables x and y. The term mutating is used in analogy to the dplyr::mutate command: Just as mutate creates new variables (columns), a mutating join adds new variables (columns) to a table x from matching cases (rows) of another table y.

Commands

There are 4 types of mutating joins to combine the cases (rows) and variables (columns) of tables x and y:

  1. left_join(x, y): Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  2. right_join(x, y): Same as left_join(y, x), and hence redundant.

  3. full_join(x, y): Return all rows and all columns that exist in x or in y. If there are non-matching values, return NA for the ones missing.

  4. inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In contrast to the inner_join, the first 3 joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.

Inner vs. outer joins

All 4 types of mutating joins combine variables (columns) from 2 tables, but differ in preserving in the cases (rows) in the output table. An inner join only keeps the cases (or rows) that appear in both tables. By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins. The 3 types of outer joins also differ with respect to the cases they preserve:

  1. left_join keeps all cases of x.
  2. right_join keeps all cases of y.
  3. full_join keeps all cases of x or y.

Terminology of key variables

Beyond specifying the 2 tables to join, all *_join() commands take a by argument: by specifies a character vector of variables by which we want to join the tables (so-called keys).

In the r4ds textbook (Wickham & Grolemund, 2017), Section 13.3 on keys makes a distinction between primary and foreign keys, that stems from database terminology. These terms can be confusing, though the distinction behind it is actually quite simple:

  • A key is simply a variable that occurs in two distinct tables (x and y) and by which we want to join both tables.

  • Importantly, the variable of interest does not always need to have the same name in both tables. For instance, a variable may be called surname in x and family_name in y, but both these variables actually have the same referent. (i.e., the name of a specific person). The distinction between primary vs. secondary now depends on our perspective:

    • When looking from table x at table y, surname is a primary key and family_name a secondary key.

    • When looking from table y at table x, family_name is a primary key and surname a secondary key.

  • The primary key (in table x) and the corresponding foreign key in another table (y) form a so-called relation. Such relations are typically 1-to-many. For instance, each name or id variable in one table corresponds to a particular person, but each person can contribute data to several measurement occasions in another table (see, e.g., the variables id and intervention in the table posPsy_p_info vs. in the table posPsy_long of the ds4psy package). A special case occurs when there is a 1-to-1 relationship (see, e.g., the variables id and intervention in the tables posPsy_p_info and posPsy_wide of the ds4psy package).

Actually, the distinction between two different types of keys is irrelevant, as long as we know how we want to join the two tables (e.g., which table is our primary table) and which relation(s) exist between both tables (i.e., which variables occur in both tables).

The by argument of any *_join() command can either be NULL or a named vector:

  • If by = NULL (as by default) *_join will perform a natural join, using all variables with common names across the 2 tables. A message lists the variables that are used, so that we can verify that this is what we wanted.

  • To join by differently named variables on x and y, we can provide a named vector. For instance, by = c("a" = "b") will match a (primary) key variable a in table x to a (secondary) key variable b in table y.

Examples of joining tables

To test these commands, we load the following data files:

  • data_t1.csv, available as data_t1 in the ds4psy package
  • data_t2.csv, available as data_t2 in the ds4psy package

Note that t1 and t2 are both 20 x 4 tibbles and appear to contain the responses or test scores of people. This could suggest that they may be 2 measurements of the same people. However, before jumping into joining these tables, we should check whether they really contain the same people:

This confirms our intuition: t1 and t2 contain 2 variables of demographic information (name and gender) of 20 people and each table contains 2 additional variables (measurements or test scores).

Joining 2 tables with identical cases

When both tables contain identical cases, the 4 mutating join commands all yield the same result:

# 1. left_join:
m1 <- left_join(t1, t2)
m1
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 2. right_join:
m2 <- right_join(t1, t2)
m2
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 3. full_join:
m3 <- full_join(t1, t2)
m3
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 4. inner_join:
m4 <- inner_join(t1, t2)
m4
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE

Note that the join commands automatically used the 2 common variables as key variables to join the tables (i.e., used by = c("name", "gender") by default). This raises the question: What would happen if we specified only one of these variables in the by argument? Let’s check to find out:

We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix. We can control the suffix of duplicate variables by providing a suffix argument:

Joining 2 tables with different cases

To see and appreciate the differences between the 4 mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:35

The resulting tables t3 and t4 are both 10 x 4 tibbles and contain some common and some unique cases (people). In addition, each table contains 2 people twice:

Now let’s re-do the 4 mutating joins with t3 and t4 and inspect the resulting tables:

# 1. left_join:
m5 <- left_join(t3, t4)
m5  # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

# 2. right_join:
m6 <- right_join(t3, t4)
m6  # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female     NA    NA      4     1
#>  2 A.V.  female     NA    NA      5     3
#>  3 C.N.  female     NA    NA      2     2
#>  4 L.V.  female      4     2      2     4
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 Q.N.  female      6     1      6     2
#>  9 T.V.  male       NA    NA      4     4
#> 10 X.P.  female      4     1      4     1
#> 11 Y.G.  male       NA    NA      4     4
#> 12 Y.G.  male       NA    NA      4     4

all.equal(m5, m6)  # shows differences, but 
#> [1] "Rows in x but not y: 11, 10, 8, 3, 1. Rows in y but not x: 11, 3, 2, 1, 9. "
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] TRUE

# 3. full_join:
m7 <- full_join(t3, t4)
m7  # => 17 x 6 table, i.e. all cases of t3 and of t4
#> # A tibble: 18 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA
#> 13 A.F.  female     NA    NA      4     1
#> 14 A.V.  female     NA    NA      5     3
#> 15 C.N.  female     NA    NA      2     2
#> 16 T.V.  male       NA    NA      4     4
#> 17 Y.G.  male       NA    NA      4     4
#> 18 Y.G.  male       NA    NA      4     4

# 4. inner_join:
m8 <- inner_join(t3, t4)
m8  # => 7 x 6 table, i.e. all cases that were in both t3 and t4
#> # A tibble: 6 x 6
#>   name  gender like_1 bnt_1 like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#> 1 L.V.  female      4     2      2     4
#> 2 Q.N.  female      6     1      6     2
#> 3 Q.N.  female      6     1      6     2
#> 4 Q.N.  female      6     1      6     2
#> 5 Q.N.  female      6     1      6     2
#> 6 X.P.  female      4     1      4     1

Thus, when joining 2 tables with different cases, it really matters which join-command we’re using.

8.2.2 Filtering joins

Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are 2 types of filtering joins:

  1. semi_join(x, y) keeps all observations in x that have a match in y (i.e., observations in x without a match in y are dropped).

  2. anti_join(x, y) drops all observations in x that have a match in y (i.e., observations in x without a match in y are kept).

See Section 13.5 on filtering joins (Wickham & Grolemund, 2017) for graphical illustrations.

Notes

  • anti_join(x, y) yields the complement of semi_join(x, y) (with respect to x).

  • Both types of filtering joins concern only the cases of x and filter them by the cases of y.

  • The resulting table of a filtering join preserves the variables of x, not the variables of y.

Use cases

Filtering joins typically reduce the cases of x to the cases contained in y (semi_join) or to the cases not contained in y (anti_join). Typical use cases include:

  • Semi-joins are useful for matching filtered summary tables back to the original rows:
  • Anti-joins are useful for diagnosing join mismatches:

See Section 13.6 for common join problems (Wickham & Grolemund, 2017) and ways of dealing with them.

8.2.3 Set operations

Set operations on 2 tables x and y expect both tables to have the same variables. They always compare complete rows of x and y, and compare the values of every variable. Treating the observations like sets yields 3 different commands:

  1. intersect(x, y) returns only observations that occur in both x and in y.

  2. union(x, y) returns all unique observations in x and y.

  3. setdiff(x, y) return those observations that occur in x, but not in y.

Examples

# Data: 
sw <- dplyr::starwars %>%
  select(name:species)  # only first 10 columns of starwars
sw  # 87 x 10
#> # A tibble: 87 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # … with 77 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Create 2 filtered sub-sets:
s1 <- sw %>% filter(species == "Human", height < 180)
s1  # 15 indivduals
#> # A tibble: 15 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  
#>  2 Leia…    150    49 brown      light      brown             19 female
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  
#>  4 Beru…    165    75 brown      light      blue              47 female
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  
#>  6 Palp…    170    75 grey       pale       yellow            82 male  
#>  7 Land…    177    79 black      dark       brown             31 male  
#>  8 Lobot    175    79 none       light      blue              37 male  
#>  9 Mon …    150    NA auburn     fair       blue              48 female
#> 10 Fini…    170    NA blond      fair       blue              91 male  
#> 11 Shmi…    163    NA black      fair       brown             72 female
#> 12 Cordé    157    NA brown      light      brown             NA female
#> 13 Dormé    165    NA brown      light      brown             NA female
#> 14 Joca…    167    NA white      fair       blue              NA female
#> 15 Padm…    165    45 brown      light      brown             46 female
#> # … with 2 more variables: homeworld <chr>, species <chr>

s2 <- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2  # 21 individuals
#> # A tibble: 21 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Owen…    178   120 brown, gr… light      blue            52   male  
#>  6 Beru…    165    75 brown      light      blue            47   female
#>  7 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  8 Bigg…    183    84 black      light      brown           24   male  
#>  9 Anak…    188    84 blond      fair       blue            41.9 male  
#> 10 Palp…    170    75 grey       pale       yellow          82   male  
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Set operations:
intersect(s1, s2)  #  8 individuals are in both s1 and s2
#> # A tibble: 8 x 10
#>   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
#> 1 Luke…    172    77 blond      fair       blue              19 male   Tatooine 
#> 2 Owen…    178   120 brown, gr… light      blue              52 male   Tatooine 
#> 3 Beru…    165    75 brown      light      blue              47 female Tatooine 
#> 4 Palp…    170    75 grey       pale       yellow            82 male   Naboo    
#> 5 Shmi…    163    NA black      fair       brown             72 female Tatooine 
#> 6 Cordé    157    NA brown      light      brown             NA female Naboo    
#> 7 Dormé    165    NA brown      light      brown             NA female Naboo    
#> 8 Padm…    165    45 brown      light      brown             46 female Naboo    
#> # … with 1 more variable: species <chr>
union(s1, s2)      # 28 individuals are in the union of both sets
#> # A tibble: 28 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  
#>  2 Leia…    150    49 brown      light      brown             19 female
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  
#>  4 Beru…    165    75 brown      light      blue              47 female
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  
#>  6 Palp…    170    75 grey       pale       yellow            82 male  
#>  7 Land…    177    79 black      dark       brown             31 male  
#>  8 Lobot    175    79 none       light      blue              37 male  
#>  9 Mon …    150    NA auburn     fair       blue              48 female
#> 10 Fini…    170    NA blond      fair       blue              91 male  
#> # … with 18 more rows, and 2 more variables: homeworld <chr>, species <chr>
setdiff(s1, s2)    #  7 individuals in s1 and not in s2
#> # A tibble: 7 x 10
#>   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
#> 1 Leia…    150    49 brown      light      brown             19 female Alderaan 
#> 2 Wedg…    170    77 brown      fair       hazel             21 male   Corellia 
#> 3 Land…    177    79 black      dark       brown             31 male   Socorro  
#> 4 Lobot    175    79 none       light      blue              37 male   Bespin   
#> 5 Mon …    150    NA auburn     fair       blue              48 female Chandrila
#> 6 Fini…    170    NA blond      fair       blue              91 male   Coruscant
#> 7 Joca…    167    NA white      fair       blue              NA female Coruscant
#> # … with 1 more variable: species <chr>
setdiff(s2, s1)    # 13 individuals in s2 and not in s1
#> # A tibble: 13 x 10
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  2 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  3 Dart…    202   136 none       white      yellow          41.9 male  
#>  4 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  5 Bigg…    183    84 black      light      brown           24   male  
#>  6 Anak…    188    84 blond      fair       blue            41.9 male  
#>  7 Jar …    196    66 none       orange     orange          52   male  
#>  8 Roos…    224    82 none       grey       orange          NA   male  
#>  9 Rugo…    206    NA none       green      orange          NA   male  
#> 10 Ric …    183    NA brown      fair       blue            NA   male  
#> 11 Quar…    183    NA black      dark       brown           62   male  
#> 12 Greg…    185    85 black      dark       brown           NA   male  
#> 13 Clie…    183    NA brown      fair       blue            82   male  
#> # … with 2 more variables: homeworld <chr>, species <chr>

8.2.4 Using merge

The base R function base::merge() can perform all 4 types of mutating join:

Table 8.1: Mapping the base R function merge() to the join commands of dplyr.
dplyr: base R:
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

Note that merge also accepts various by commands (see ?merge for documentation).

The advantages of the dplyr verbs are that:

  1. they more clearly convey the intended join (by verbally expressing what is implied by the arguments of merge);

  2. they are faster and preserve the order of cases (rows).

See Chapter 13 of r4ds (Wickham & Grolemund, 2017) for other implementations of joins (e.g., in SQL).

References

Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz


  1. It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.