All the data we considered so far appeared in the form of a single table (or multiple versions of the same table, some of which needed transforming and tidying etc.). In reality, however, many datasets are distributed over multiple tables. To analyze such data, an important first step is to combine those tables without messing up their contents. The dplyr package (Wickham, François, Henry, & Müller, 2022) contains some two-table functions that allow just that.
After working through this chapter, you should be able to use two-table verbs of dplyr to:
- perform various mutating joins on two tables;
- perform various filtering joins on two tables;
- combine the cases of two tables with set operations; and
- know that the base R function
merge()also allows performing mutating joins.
Joining data tables is simple and straightforward when their variables have common variables (names and types) and both tables describe the same cases (rows). However, it’s clear that we may encounter problems in which one table uses different names or types than the other one and that both tables can differ in the number of cases they contain (e.g., one may contain a sub-set of cases of the other one, or both contain some common but also some unique cases). In these instances, it’s absolutely crucial that we first understand each table by itself (its dimensions, cases and variables, contents) and their relation to each other (e.g., which cases and variables occur in both tables) before trying to join them.50
To combine data from multiple tables, we need to specify how they are related (which is why this chapter is called “relational data”). A relation between tables typically is a variable that occurs in both tables: Provided that this variable has the same meaning in both tables, it is called a key (see Chapter 13.3 for details) and allows linking the observations (rows) from one table to those of the other one.
Base R vs. tidyverse tools
When we are ready to join tables, base R offers basic
base::cbind() commands (to bind rows or columns of data), but these are intended for tables with identical variables or cases, respectively. Mutating joins of two data frames can be achieved by using
base::merge() (see below).
Alternatively, the tidyverse package dplyr, whose so-called one-table verbs we have used for transforming data (see Chapter 3), contains additional two-table verbs that allow joining two tables of data with three different types of commands:
Mutating joins add new variables (columns) to one table from matching cases (rows) of another table (Section 8.2.1).
Filtering joins filter observations (rows) from one table based on whether or not they match a case (row) in another table (Section 8.2.2).
Set operations combine the observations in two tables (with the same variables) as if they were set elements (Section 8.2.4).
In this chapter, we will briefly describe the essential commands for each type of join.
vignette("two-table") and Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017) for additional details.)
8.1.3 Data used
In this chapter, we start practicing various join commands with some artificially generated data tables of the ds4psy package (Neth, 2022). In later sections and exercises, we will use datasets contained in dplyr (e.g., the
starwars dataset) (Wickham, François, Henry, & Müller, 2022)
and the real datasets described in Section B.1 and Section B.2 of Appendix B.
8.1.4 Getting ready
This chapter formerly assumed that you have read and worked through Chapter 13: Relational data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 13 of r4ds is still recommended.
Please do the following to get started:
Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:
--- : "Chapter 8: Joining data" title: "Your name" author: "2022 July 15" date: html_document output--- Add text or code chunks here. # Exercises (08: Joining data) ## Exercise 1 ## Exercise 2 etc. <!-- The end (eof). -->
Create an initial code chunk below the header of your
.Rmdfile that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as
08_join.Rmdin the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
So how can we join two tables of data?
For this reason, real-world joins are often preceded by sequences of
arrange()functions on one or both tables.↩︎