8.2 Essential join commands

As the commands for joining two tables x and y are implemented in the dplyr package (Wickham et al., 2020), the two-table verbs use the same terminology as the one-table verbs discussed in Chapter 3 on Data transformation: Mutating commands affect the variables (columns) of a table, whereas filtering commands affect the cases or observations (rows) of a table.

In the following, we distinguish between 3 types of joins:

  1. Mutating joins combine the variables (columns) of 2 tables (Section 8.2.1);

  2. Filtering joins affect only the cases or observations (rows) of 2 tables (Section 8.2.2);

  3. Set operations on 2 tables expect both tables to have the same variables and treat them like sets that can be combined or intersected (Section&nbsp8.2.3).

Finally, we will see that the base R command merge can also be used to accomplish mutating joins (Section 8.2.4).

8.2.1 Mutating joins

Mutating joins combine the variables (columns) of 2 tables x and y. The term mutating is used in analogy to the dplyr::mutate command: Just as mutate creates new variables (columns), a mutating join adds new variables (columns) to a table x from matching cases (rows) of another table y.

Commands

There are 4 types of mutating joins to combine the cases (rows) and variables (columns) of tables x and y:

  1. left_join(x, y): Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  2. right_join(x, y): Same as left_join(y, x), and hence redundant.

  3. full_join(x, y): Return all rows and all columns that exist in x or in y. If there are non-matching values, return NA for the ones missing.

  4. inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In contrast to the inner_join, the first 3 joins are called outer joins. See Section 13.4 on mutating joins (Wickham & Grolemund, 2017) for graphical illustrations of these joins.

Inner vs. outer joins

All 4 types of mutating joins combine variables (columns) from 2 tables, but differ in preserving in the cases (rows) in the output table. An inner join only keeps the cases (or rows) that appear in both tables. By contrast, the first three join commands keep cases (observations) that appear in at least one of the tables and are also called outer joins. The 3 types of outer joins also differ with respect to the cases they preserve:

  1. left_join keeps all cases of x.
  2. right_join keeps all cases of y.
  3. full_join keeps all cases of x or y.

Terminology of key variables

Beyond specifying the 2 tables to join, all *_join() commands take a by argument: by specifies a character vector of variables by which we want to join the tables (so-called keys).

In the r4ds textbook (Wickham & Grolemund, 2017), Section 13.3 on keys makes a distinction between primary and foreign keys, that stems from database terminology. These terms can be confusing, though the distinction behind it is actually quite simple:

  • A key is simply a variable that occurs in two distinct tables (x and y) and by which we want to join both tables.

  • Importantly, the variable of interest does not always need to have the same name in both tables. For instance, a variable may be called surname in x and family_name in y, but both these variables actually have the same referent. (i.e., the name of a specific person). The distinction between primary vs. secondary now depends on our perspective:

    • When looking from table x at table y, surname is a primary key and family_name a secondary key.

    • When looking from table y at table x, family_name is a primary key and surname a secondary key.

  • The primary key (in table x) and the corresponding foreign key in another table (y) form a so-called relation. Such relations are typically one-to-many. For instance, each name or id variable in one table corresponds to a particular person, but each person can contribute data to several measurement occasions in another table (see, e.g., the variables id and intervention in the table posPsy_p_info vs. in the table posPsy_long of the ds4psy package). A special case occurs when there is a one-to-one relationship (see, e.g., the variables id and intervention in the tables posPsy_p_info and posPsy_wide of the ds4psy package).

Actually, the distinction between two different types of keys is irrelevant, as long as we know how we want to join the two tables (e.g., which table is our primary table) and which relation(s) exist between both tables (i.e., which variables occur in both tables).

The by argument of any *_join() command can either be NULL or a named vector:

  • If by = NULL (as by default) *_join will perform a natural join, using all variables with common names across the 2 tables. A message lists the variables that are used, so that we can verify that this is what we wanted.

  • To join by differently named variables on x and y, we can provide a named vector. For instance, by = c("a" = "b") will match a (primary) key variable a in table x to a (secondary) key variable b in table y.

Examples of joining tables

To test these commands, we load the following data files:

  • data_t1.csv, available as data_t1 in the ds4psy package
  • data_t2.csv, available as data_t2 in the ds4psy package

Note that t1 and t2 are both 20 x 4 tibbles and appear to contain the responses or test scores of people. This could suggest that they may be 2 measurements of the same people. However, before jumping into joining these tables, we should check whether they really contain the same people:

This confirms our intuition: t1 and t2 contain 2 variables of demographic information (name and gender) of 20 people and each table contains 2 additional variables (measurements or test scores).

Joining 2 tables with identical cases

When both tables contain identical cases, the 4 mutating join commands all yield the same result:

# 1. left_join:
m1 <- left_join(t1, t2)
m1
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 2. right_join:
m2 <- right_join(t1, t2)
m2
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 3. full_join:
m3 <- full_join(t1, t2)
m3
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# 4. inner_join:
m4 <- inner_join(t1, t2)
m4
#> # A tibble: 20 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 A.F.  female      5     1      4     1
#>  2 A.V.  female      3     2      5     3
#>  3 C.N.  female      4     3      2     2
#>  4 H.V.  female      1     4      5     2
#>  5 K.R.  male        4     1      6     1
#>  6 K.X.  male        3     3      5     4
#>  7 L.S.  male        4     4      7     4
#>  8 L.V.  female      4     2      2     4
#>  9 M.O.  male        4     1     NA    NA
#> 10 M.Y.  male        2     4      6     2
#> 11 Q.N.  female      6     1      6     2
#> 12 R.S.  female      2    NA      7     3
#> 13 T.D.  male        5     2      3    NA
#> 14 T.V.  male        6     4      4     4
#> 15 X.D.  female      4     4      5     1
#> 16 X.P.  female      4     1      4     1
#> 17 X.Y.  female      5     4      5     2
#> 18 Y.G.  male        4     4      4     4
#> 19 Z.D.  male        5     4      7     4
#> 20 Z.Y.  female     NA    NA      7     1

# Verify equality:
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
all.equal(m1, m4)
#> [1] TRUE

Note that the join commands automatically used the 2 common variables as key variables to join the tables (i.e., used by = c("name", "gender") by default). This raises the question: What would happen if we specified only one of these variables in the by argument? Let’s check to find out:

We see that common variables not used for joining the tables get duplicated in the joint table and distinguished by a unique suffix. We can control the suffix of duplicate variables by providing a suffix argument:

Joining 2 tables with different cases

To see and appreciate the differences between the 4 mutating joins, we need tables that contain different and repeated cases. We can easily create these by selecting and duplicating some cases in each of our tables:41

The resulting tables t3 and t4 are both 10 x 4 tibbles and contain some common and some unique cases (people). In addition, each table contains 2 people twice:

Now let’s re-do the 4 mutating joins with t3 and t4 and inspect the resulting tables:

# 1. left_join:
m5 <- left_join(t3, t4)
m5  # => 12 x 6 table, containing all cases of t3, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA

# 2. right_join:
m6 <- right_join(t3, t4)
m6  # => 12 x 6 table, containing all cases of t4, plus additional cases for duplicate cases.
#> # A tibble: 12 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 L.V.  female      4     2      2     4
#>  2 Q.N.  female      6     1      6     2
#>  3 Q.N.  female      6     1      6     2
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 X.P.  female      4     1      4     1
#>  7 A.F.  female     NA    NA      4     1
#>  8 A.V.  female     NA    NA      5     3
#>  9 C.N.  female     NA    NA      2     2
#> 10 T.V.  male       NA    NA      4     4
#> 11 Y.G.  male       NA    NA      4     4
#> 12 Y.G.  male       NA    NA      4     4

all.equal(m5, m6)  # shows differences, but 
#> [1] "Component \"name\": 10 string mismatches"                              
#> [2] "Component \"gender\": 5 string mismatches"                             
#> [3] "Component \"like_1\": 'is.NA' value mismatch: 6 in current 2 in target"
#> [4] "Component \"bnt_1\": 'is.NA' value mismatch: 6 in current 2 in target" 
#> [5] "Component \"like_2\": 'is.NA' value mismatch: 0 in current 6 in target"
#> [6] "Component \"bnt_2\": 'is.NA' value mismatch: 0 in current 6 in target"
all.equal(m5, right_join(t4, t3)) # is TRUE
#> [1] "Names: 4 string mismatches"                                   
#> [2] "Component \"name\": 7 string mismatches"                      
#> [3] "Component \"gender\": 2 string mismatches"                    
#> [4] "Component 3: 'is.NA' value mismatch: 6 in current 2 in target"
#> [5] "Component 4: 'is.NA' value mismatch: 6 in current 2 in target"
#> [6] "Component 5: 'is.NA' value mismatch: 2 in current 6 in target"
#> [7] "Component 6: 'is.NA' value mismatch: 2 in current 6 in target"

# 3. full_join:
m7 <- full_join(t3, t4)
m7  # => 17 x 6 table, i.e. all cases of t3 and of t4
#> # A tibble: 18 x 6
#>    name  gender like_1 bnt_1 like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#>  1 H.V.  female      1     4     NA    NA
#>  2 L.V.  female      4     2      2     4
#>  3 M.Y.  male        2     4     NA    NA
#>  4 Q.N.  female      6     1      6     2
#>  5 Q.N.  female      6     1      6     2
#>  6 Q.N.  female      6     1      6     2
#>  7 Q.N.  female      6     1      6     2
#>  8 T.D.  male        5     2     NA    NA
#>  9 X.P.  female      4     1      4     1
#> 10 X.Y.  female      5     4     NA    NA
#> 11 Z.Y.  female     NA    NA     NA    NA
#> 12 Z.Y.  female     NA    NA     NA    NA
#> 13 A.F.  female     NA    NA      4     1
#> 14 A.V.  female     NA    NA      5     3
#> 15 C.N.  female     NA    NA      2     2
#> 16 T.V.  male       NA    NA      4     4
#> 17 Y.G.  male       NA    NA      4     4
#> 18 Y.G.  male       NA    NA      4     4

# 4. inner_join:
m8 <- inner_join(t3, t4)
m8  # => 7 x 6 table, i.e. all cases that were in both t3 and t4
#> # A tibble: 6 x 6
#>   name  gender like_1 bnt_1 like_2 bnt_2
#>   <chr> <chr>   <dbl> <dbl>  <dbl> <dbl>
#> 1 L.V.  female      4     2      2     4
#> 2 Q.N.  female      6     1      6     2
#> 3 Q.N.  female      6     1      6     2
#> 4 Q.N.  female      6     1      6     2
#> 5 Q.N.  female      6     1      6     2
#> 6 X.P.  female      4     1      4     1

Thus, when joining 2 tables with different cases, it really matters which join-command we’re using.

8.2.2 Filtering joins

Filtering joins match cases (or observations) in the same way as mutating joins, but only affect the cases (observations or rows), not the variables (columns). There are 2 types of filtering joins:

  1. semi_join(x, y) keeps all observations in x that have a match in y (i.e., observations in x without a match in y are dropped).

  2. anti_join(x, y) drops all observations in x that have a match in y (i.e., observations in x without a match in y are kept).

See Section 13.5 on filtering joins (Wickham & Grolemund, 2017) for graphical illustrations.

Notes

  • anti_join(x, y) yields the complement of semi_join(x, y) (with respect to x).

  • Both types of filtering joins concern only the cases of x and filter them by the cases of y.

  • The resulting table of a filtering join preserves the variables of x, not the variables of y.

Use cases

Filtering joins typically reduce the cases of x to the cases contained in y (semi_join) or to the cases not contained in y (anti_join). Typical use cases include:

  • Semi-joins are useful for matching filtered summary tables back to the original rows:
  • Anti-joins are useful for diagnosing join mismatches:

See Section 13.6 for common join problems (Wickham & Grolemund, 2017) and ways of dealing with them.

8.2.3 Set operations

Set operations on 2 tables x and y expect both tables to have the same variables. They always compare complete rows of x and y, and compare the values of every variable. Treating the observations like sets yields 3 different commands:

  1. intersect(x, y) returns only observations that occur in both x and in y.

  2. union(x, y) returns all unique observations in x and y.

  3. setdiff(x, y) return those observations that occur in x, but not in y.

Examples

# Data: 
sw <- dplyr::starwars %>%
  select(name:species)  # only first 10 columns of starwars
sw  # 87 x 10
#> # A tibble: 87 x 11
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Bigg…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # … with 77 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Create 2 filtered sub-sets:
s1 <- sw %>% filter(species == "Human", height < 180)
s1  # 15 indivduals
#> # A tibble: 15 x 11
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  mascu…
#>  2 Leia…    150    49 brown      light      brown             19 fema… femin…
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  mascu…
#>  4 Beru…    165    75 brown      light      blue              47 fema… femin…
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  mascu…
#>  6 Palp…    170    75 grey       pale       yellow            82 male  mascu…
#>  7 Land…    177    79 black      dark       brown             31 male  mascu…
#>  8 Lobot    175    79 none       light      blue              37 male  mascu…
#>  9 Mon …    150    NA auburn     fair       blue              48 fema… femin…
#> 10 Fini…    170    NA blond      fair       blue              91 male  mascu…
#> 11 Shmi…    163    NA black      fair       brown             72 fema… femin…
#> 12 Cordé    157    NA brown      light      brown             NA fema… femin…
#> 13 Dormé    165    NA brown      light      brown             NA fema… femin…
#> 14 Joca…    167    NA white      fair       blue              NA fema… femin…
#> 15 Padm…    165    45 brown      light      brown             46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>

s2 <- sw %>% filter(homeworld == "Naboo" | homeworld == "Tatooine")
s2  # 21 individuals
#> # A tibble: 21 x 11
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
#>  6 Beru…    165    75 brown      light      blue            47   fema… femin…
#>  7 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
#>  8 Bigg…    183    84 black      light      brown           24   male  mascu…
#>  9 Anak…    188    84 blond      fair       blue            41.9 male  mascu…
#> 10 Palp…    170    75 grey       pale       yellow          82   male  mascu…
#> # … with 11 more rows, and 2 more variables: homeworld <chr>, species <chr>

# Set operations:
intersect(s1, s2)  #  8 individuals are in both s1 and s2
#> # A tibble: 8 x 11
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke…    172    77 blond      fair       blue              19 male  mascu…
#> 2 Owen…    178   120 brown, gr… light      blue              52 male  mascu…
#> 3 Beru…    165    75 brown      light      blue              47 fema… femin…
#> 4 Palp…    170    75 grey       pale       yellow            82 male  mascu…
#> 5 Shmi…    163    NA black      fair       brown             72 fema… femin…
#> 6 Cordé    157    NA brown      light      brown             NA fema… femin…
#> 7 Dormé    165    NA brown      light      brown             NA fema… femin…
#> 8 Padm…    165    45 brown      light      brown             46 fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
union(s1, s2)      # 28 individuals are in the union of both sets
#> # A tibble: 28 x 11
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue              19 male  mascu…
#>  2 Leia…    150    49 brown      light      brown             19 fema… femin…
#>  3 Owen…    178   120 brown, gr… light      blue              52 male  mascu…
#>  4 Beru…    165    75 brown      light      blue              47 fema… femin…
#>  5 Wedg…    170    77 brown      fair       hazel             21 male  mascu…
#>  6 Palp…    170    75 grey       pale       yellow            82 male  mascu…
#>  7 Land…    177    79 black      dark       brown             31 male  mascu…
#>  8 Lobot    175    79 none       light      blue              37 male  mascu…
#>  9 Mon …    150    NA auburn     fair       blue              48 fema… femin…
#> 10 Fini…    170    NA blond      fair       blue              91 male  mascu…
#> # … with 18 more rows, and 2 more variables: homeworld <chr>, species <chr>
setdiff(s1, s2)    #  7 individuals in s1 and not in s2
#> # A tibble: 7 x 11
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia…    150    49 brown      light      brown             19 fema… femin…
#> 2 Wedg…    170    77 brown      fair       hazel             21 male  mascu…
#> 3 Land…    177    79 black      dark       brown             31 male  mascu…
#> 4 Lobot    175    79 none       light      blue              37 male  mascu…
#> 5 Mon …    150    NA auburn     fair       blue              48 fema… femin…
#> 6 Fini…    170    NA blond      fair       blue              91 male  mascu…
#> 7 Joca…    167    NA white      fair       blue              NA fema… femin…
#> # … with 2 more variables: homeworld <chr>, species <chr>
setdiff(s2, s1)    # 13 individuals in s2 and not in s1
#> # A tibble: 13 x 11
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
#>  2 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
#>  3 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  4 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
#>  5 Bigg…    183    84 black      light      brown           24   male  mascu…
#>  6 Anak…    188    84 blond      fair       blue            41.9 male  mascu…
#>  7 Jar …    196    66 none       orange     orange          52   male  mascu…
#>  8 Roos…    224    82 none       grey       orange          NA   male  mascu…
#>  9 Rugo…    206    NA none       green      orange          NA   male  mascu…
#> 10 Ric …    183    NA brown      fair       blue            NA   <NA>  <NA>  
#> 11 Quar…    183    NA black      dark       brown           62   <NA>  <NA>  
#> 12 Greg…    185    85 black      dark       brown           NA   male  mascu…
#> 13 Clie…    183    NA brown      fair       blue            82   male  mascu…
#> # … with 2 more variables: homeworld <chr>, species <chr>

8.2.4 Using merge

The base R function base::merge() can perform all 4 types of mutating join:

Table 8.1: Mapping the base R function merge() to the join commands of dplyr.
dplyr: base R:
inner_join(x, y) merge(x, y) |
left_join(x, y) merge(x, y, all.x = TRUE) |
right_join(x, y) merge(x, y, all.y = TRUE) |
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE) |

Note that merge also accepts various by commands (see ?merge for documentation).

The advantages of the dplyr verbs are that:

  1. they more clearly convey the intended join (by verbally expressing what is implied by the arguments of merge);

  2. they are faster and preserve the order of cases (rows).

See Chapter 13 of r4ds (Wickham & Grolemund, 2017) for other implementations of joins (e.g., in SQL).

References

Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz


  1. It’s not crucial to understand the following code. It only serves to select some random cases from each table and create some duplicate cases in each.