9 How to join two data tables

In my opinion, the need to join two data tables together, is one of indicators (of course there are others) that we are dealing with complex data analysis. In practice, sometimes we may need to join many, say five, or even ten tables together. Hadley Wickham’s dplyr package provides left_join(), right_join(), inner_join(), full_join(), semi_join() and anti_join(); among them, the most useful one probably is left_join(). Let me quote a couple of sentences from R for Data Science (by Garrett Grolemund and Hadley Wickham),

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

The syntax for left_join() is

left_join(table_1, table_2, by = c("ID_1" = "ID_2"))

The key point to keep in mind is that table_1 is to the left of table_2, which means table_1 is the main table, or to be more exact, which means all the rows and columns of table_1 will be kept.

Let’s have an example.

Example.

rm(list = ls())

# load packages
library(dplyr)

# create a fake data sets
table_1 <- data_frame(ID_1 = LETTERS[1:6], x = rep(1:3, each = 2))
table_2 <- data_frame(ID_2 = c(LETTERS[1:3], LETTERS[7:8], "A", "B"), y = 1:7)

# join tables 1 and 2
table_3 <- left_join(table_1, table_2, by = c("ID_1" = "ID_2"))

# shwo the three tables
list(table_1, table_2, table_3)
## [[1]]
## # A tibble: 6 x 2
##    ID_1     x
##   <chr> <int>
## 1     A     1
## 2     B     1
## 3     C     2
## 4     D     2
## 5     E     3
## 6     F     3
## 
## [[2]]
## # A tibble: 7 x 2
##    ID_2     y
##   <chr> <int>
## 1     A     1
## 2     B     2
## 3     C     3
## 4     G     4
## 5     H     5
## 6     A     6
## 7     B     7
## 
## [[3]]
## # A tibble: 8 x 3
##    ID_1     x     y
##   <chr> <int> <int>
## 1     A     1     1
## 2     A     1     6
## 3     B     1     2
## 4     B     1     7
## 5     C     2     3
## 6     D     2    NA
## 7     E     3    NA
## 8     F     3    NA

Exercises.

  1. Redo the above Example to recreate table_3 but using right_join.

  2. Try inner_join() and semi_join() on table_1 and table_2 in the above example, and observe the difference between the two resulted tables.

  3. Try full_join() on table_1 and table_2 in the above example.

  4. Try anti_join() on table_1 and table_2 in the above example.

Answer to the exercises:

rm(list = ls())

# load packages
library(dplyr)

# create a fake data sets
table_1 <- data_frame(ID_1 = LETTERS[1:6], x = rep(1:3, each = 2))
table_2 <- data_frame(ID_2 = c(LETTERS[1:3], LETTERS[7:8], "A", "B"), y = 1:7)

# Exercise 1
table_3 <- 
  table_2 %>% 
  right_join(table_1, by = c("ID_2" = "ID_1")) %>% 
  # change the order of columns
  select(ID_2, x, y)

# Exercise 2
table_4 <- inner_join(table_1, table_2,  by = c("ID_1" = "ID_2"))
table_5 <- semi_join(table_1, table_2,  by = c("ID_1" = "ID_2"))

# Exercise 3
table_6 <- full_join(table_1, table_2,  by = c("ID_1" = "ID_2"))

# Exercise 4
table_7 <- anti_join(table_1, table_2,  by = c("ID_1" = "ID_2"))

# show tables 4 to 7
list(table_4, table_5, table_6, table_7)