8.1 Introduction
All the data we considered so far appeared in the form of a single table (or multiple versions of the same table, some of which needed transforming and tidying etc.). In reality, however, many datasets are distributed over multiple tables. To analyze such data, an important first step is to combine those tables without messing up their contents. The dplyr package (Wickham, François, et al., 2023) contains some two-table functions that allow just that.
8.1.1 Objectives
After working through this chapter, you should be able to use two-table verbs of dplyr to:
- perform various mutating joins on two tables;
- perform various filtering joins on two tables;
- combine the cases of two tables with set operations; and
- know that the base R function
merge()
also allows performing mutating joins.
8.1.2 Overview
Joining data tables is simple and straightforward when their variables have common variables (names and types) and both tables describe the same cases (rows). However, it’s clear that we may encounter problems in which one table uses different names or types than the other one and that both tables can differ in the number of cases they contain (e.g., one may contain a sub-set of cases of the other one, or both contain some common but also some unique cases). In these instances, it’s absolutely crucial that we first understand each table by itself (its dimensions, cases and variables, contents) and their relation to each other (e.g., which cases and variables occur in both tables) before trying to join them.50
Terminology
To combine data from multiple tables, we need to specify how they are related (which is why this chapter is called “relational data”). A relation between tables typically is a variable that occurs in both tables: Provided that this variable has the same meaning in both tables, it is called a key (see Chapter 13.3 for details) and allows linking the observations (rows) from one table to those of the other one.
The verb “join” and the term “relational data” comes from relational databases (see Wikipedia and relational algebra (see Wikipedia).
Base R vs. tidyverse tools
When we are ready to join tables, base R offers basic base::rbind()
and base::cbind()
commands (to bind rows or columns of data), but these are intended for tables with identical variables or cases, respectively. Mutating joins of two data frames can be achieved by using base::merge()
(see below).
Alternatively, the tidyverse package dplyr, whose so-called one-table verbs we have used for transforming data (see Chapter 3), contains additional two-table verbs that allow joining two tables of data with three different types of commands:
Mutating joins add new variables (columns) to one table from matching cases (rows) of another table (Section 8.2.1).
Filtering joins filter observations (rows) from one table based on whether or not they match a case (row) in another table (Section 8.2.2).
Set operations combine the observations in two tables (with the same variables) as if they were set elements (Section 8.2.4).
In this chapter, we will briefly describe the essential commands for each type of join.
(See the vignette("two-table")
and Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017) for additional details.)
8.1.3 Data used
In this chapter, we start practicing various join commands with some artificially generated data tables of the ds4psy package (Neth, 2023). In later sections and exercises, we will use datasets contained in dplyr (e.g., the starwars
dataset) (Wickham, François, et al., 2023)
and the real datasets described in Section B.1 and Section B.2 of Appendix B.
8.1.4 Getting ready
This chapter formerly assumed that you have read and worked through Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 13 of r4ds is still recommended.
Please do the following to get started:
Create an R Markdown (
.Rmd
) document (for instructions, see Appendix F and the templates linked in Section F.2).Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:
---
title: "Chapter 8: Joining data"
author: "Your name"
date: "2024 October 20"
output: html_document
---
Add text or code chunks here.
# Exercises (08: Joining data)
## Exercise 1
## Exercise 2
etc.
<!-- The end (eof). -->
Create an initial code chunk below the header of your
.Rmd
file that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).Save your file (e.g., as
08_join.Rmd
in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
So how can we join two tables of data?
References
For this reason, real-world joins are often preceded by sequences of
select()
,mutate()
, andarrange()
functions on one or both tables.↩︎