8.1 Introduction

All the data we considered so far appeared in the form of a single table (or multiple versions of the same table, some of which needed transforming and tidying etc.). In reality, however, many datasets are distributed over multiple tables. To analyze such data, an important first step is to combine those tables without messing up their contents. The dplyr package (Wickham et al., 2020b) contains some 2-table functions that allow just that.

8.1.1 Objectives

After working through this chapter, you should be able to use 2-table verbs of dplyr to:

  1. perform various mutating joins on 2 tables;
  2. perform various filtering joins on 2 tables;
  3. combine the cases of 2 tables with set operations; and
  4. know that base::merge also allows performing mutating joins.

8.1.2 Overview

Joining 2 tables is simple and straightforward when variables have common variables (names and types) and both tables describe the same cases (rows). However, it’s clear that we may encounter problems in which one table uses different names or types than the other one and that both tables can differ in the number of cases they contain (e.g., one may contain a sub-set of cases of the other one, or both contain some common but also some unique cases). In these instances, it’s absolutely crucial that we first understand each table by itself (its dimensions, cases and variables, contents) and their relation to each other (e.g., which cases and variables occur in both tables) before trying to join them.33

Terminology

To combine data from multiple tables, we need to specify how they are related (which is why this chapter is called “relational data”). A relation between tables typically is a variable that occurs in both tables: Provided that this variable has the same meaning in both tables, it is called a key (see Chapter 13.3 for details) and allows linking the observations (rows) from one table to those of the other one.

The verb “join” and the term “relational data” comes from relational databases (see Wikipedia and relational algebra (see Wikipedia).

Base R vs. tidyverse tools

When we are ready to join tables, base R offers basic base::rbind and base::cbind commands (to bind rows or columns of data), but these are intended for tables with identical variables or cases, respectively. Mutating joins of 2 data frames can be achieved by using base::merge (see below).

Alternatively, the tidyverse package dplyr, whose so-called 1-table verbs we have used for transforming data (see Chapter 3), contains additional 2-table verbs that allow joining 2 tables of data with 3 different types of commands:

  1. Mutating joins add new variables (columns) to one table from matching cases (rows) of another table (Section 8.2.1).

  2. Filtering joins filter observations (rows) from one table based on whether or not they match a case (row) in another table (Section 8.2.2).

  3. Set operations combine the observations in 2 tables (with the same variables) as if they were set elements (Section 8.2.4).

In this chapter, we will briefly describe the essential commands for each type of join. (See the vignette("two-table") and Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017) for additional details.)

8.1.3 Data used

In this chapter, we start practicing various join commands with some artificially generated data tables (available in the ds4psy package or from http://rpository.com/ds4psy/). In later sections and exercises we will use datasets contained in dplyr (e.g., the starwars dataset) and the real datasets described in Section B.1 and Section B.2 of Appendix B.

8.1.4 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 13 of r4ds is still recommended.

Please do the following to get started:

  • Create an R Markdown (.Rmd) document (for instructions, see Appendix E and the templates linked in Section E.2).

  • Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:

---
title: "Chapter 8: Joining data"
author: "Your name"
date: "2020 May 25"
output: html_document
---

Add text or code chunks here.

# Exercises (08: Joining data)
  
## Exercise 1
  
## Exercise 2
  
etc. 
  
<!-- The end (eof). -->
  • Create an initial code chunk below the header of your .Rmd file that loads the R packages of the tidyverse (and see Section E.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).

  • Save your file (e.g., as 08_join.Rmd in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.

So how can we join 2 tables of data?

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

Wickham, H., François, R., Henry, L., & Müller, K. (2020b). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr


  1. For this reason, most real-world joins are preceded by sequences of select, mutate, and arrange commands on one or both tables.