3.3 Powerful pipes
Although each of the essential dplyr commands is useful, the examples above have shown that we rarely encounter them in isolation. Instead, we can think of these commands as verbs of a language for basic data manipulation.
The so-called pipe operator %>%
of the magrittr package (Bache & Wickham, 2022) allows chaining commands in which the current result is passed to the next command (from left to right). With such forward pipes, seemingly simple commands become so powerful that we will rarely use just one command.
Whereas our introduction of essential dplyr commands in Section 3.2 relied on an intuitive understanding of the pipe, we should take a brief look at how pipes work and when we should use or avoid them.
3.3.1 Ceci est un pipe
What is a pipe? While some people see pipes primarily as smoking devices, others are reminded about representational statements in art history (see Wikipedia: The treachery of images). In the applied sub-area of physics known as plumbing, a pipe is a device for directing fluid and solid substances from one location to another.
In R, the pipe is an operator that allows re-writing a nested call to multiple functions as a chain of individual functions.
Historically, the native forward pipe operator |>
of base R (introduced in R version 4.1.0, published on 2021-05-18) was preceded by the %>%
operator of the magrittr package (Bache & Wickham, 2022).
Despite some differences, both pipes allow turning a nested expression of function calls into a sequence of processing steps that is easier to understand and avoids the need for saving intermediate results.
In the following, we will use the %>%
operator of magrittr, but most examples would also work for the native pipe operator |>
.
Basic usage
For our present purposes, it is sufficient to think of the pipe operator %>%
as passing whatever is on its left (or left-hand-side, lhs
) to the first argument of the function on its right (rhs
):
Here, lhs
is an expression that yields some value (e.g., a number, vector, or table),
and rhs
is an expression that uses this value as an input (i.e., an R call expression or function).
This description sounds more complicated than it is in practice. Actually, we are quite familiar with R expressions that contain and combine multiple steps. But so far, we have been nesting them in arithmetic formulas (with parentheses indicating operator precedence) or in hierarchical function calls, as in:
Using the pipe operator %>%
of magrittr allows us to re-write the nested function calls into a linear chain of steps:
Thus, given three functions a()
, b()
and c()
, the following pipe would compute the result of the compound expression c(b(a(x)))
:
As the intermediate steps get longer and more complicated, we typically re-write the same pipe sequence as follows:
3.3.2 Example pipes
Whereas a description of the pipe operator may sound complicated, the underlying idea is quite simple: We often want to perform several operations in a row. This is familiar in arithmetic expressions. For instance, consider the following step-by-step instruction:
- Start with a number
x
(e.g.,x = 3
). Then,
- multiply it by 4,
- add 20 to the result,
- subtract~7 from the result, and finally
- take the result’s square root.
This instruction can easily be translated into the following R expression:
#> [1] 5
In this expression, the order of operations is determined by parentheses, arithmetic rules (e.g., left to right, multiplying before adding and substracting, etc.), and functions. Avoiding the infix operators *
and +
, we can re-write the expression as a sequence of R functions:
#> [1] 5
The order of function application is determined by their level of encapsulation in parentheses.
The pipe operator %>%
allows us re-writing the sequence of functions as a chain:
#> [1] 5
Note that this pipe is fairly close to the step-by-step instruction above, particularly when we re-format the pipe to span multiple lines:
#> [1] 5
Thus, the pipe operator lets us express chains of function applications in a way that matches their natural language description.
If we find the lack of an explicit representation of each step’s result on the right hand side of %>%
confusing, we can re-write the piped command as follows:
#> [1] 5
Here, the dot .
represents whatever was passed (or “piped”) from the left to the right (here: the current value of x
).
While the pipe initially may seem somewhat similar to our assignment operator <-
, they are actually quite different.
For instance, the pipe does not assign new objects, but rather apply functions to an existing objects that serves as the input.
The input object changes as functions are being applied and eventually result in an output object.
Assuming there is no function y()
, the following code would not assign anything to y
, but yield an error:
Thus, for assigning the result of a pipe to an object y
, we need to use our standard assignment function on the left (or at the beginning) of the pipe:
#> [1] 5
Overall, the pipe operator %>%
does not allow us to do anything we could not do before, but allows us re-writing chains of commands in a more natural fashion.
This is particularly useful when generating and transforming data objects (e.g., vectors or tables) by a series of functions that all share the same type of inputs and outputs (e.g., vectors or tables).
3.3.3 When (not) to pipe
Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. As such pipes do not only work with dplyr verbs, but with all R packages that define functions in which key data structures appear as the first argument of its functions, we will soon solve quite sophisticated tasks by chaining together simple commands. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (e.g., counts, mean, median, standard deviation, etc.), and provide details about specific variables (e.g., describe their statistics and visualize them) by a sequence of simple commands.
Unfortunately, not all functions in R can be used with the pipe.
While the tools in the tidyverse are designed to support piping (see The tidy tools manifesto, e.g., by evaluating vignette('manifesto')
), there are many functions in R that do not.
And before getting overly enthusiastic about pipes, we should realize that they are fundamentally linear and typically link one input to one output.
Thus, pipes are wonderful tools whenever we start with some data table and incrementally transform it to obtain some specific output object (e.g., some resulting value, table, or graph). However, when we are solving tasks that routinely involve multiple inputs or multiple outputs, pipes are probably not the best tools to use.
In summary, pipes are great tools for transforming vectors or data tables, but we mostly use them for linear sequences of tidyverse commands that have one input and one output. Let’s practice them further by drilling some pipes into our sws
data table.
Practice
Try to answer each of the following questions by a pipe of dplyr commands:
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller than the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Solution
Here are possible ways of answering these questions:
# What is the number and mean height and mass of individuals
# from Tatooine (filter) by species and gender (groups):
sws %>%
filter(homeworld == "Tatooine") %>%
group_by(species, gender) %>%
summarise(count = n(),
mn_height = mean(height),
mn_mass = mean(mass, na.rm = TRUE)
)
# Which humans are more than 5cm taller than
# the average human overall (filter humans,
# then compute mean and test for taller individuals):
sws %>%
filter(species == "Human") %>%
mutate(mn_height = mean(height, na.rm = TRUE),
taller = height > mn_height + 5) %>%
filter(taller == TRUE)
# Which humans (filter) are more than 5cm taller
# than the average human of their own gender
# (first group by gender, then mutate):
sws %>%
filter(species == "Human") %>%
group_by(gender) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE),
taller_2 = height > mn_height_2 + 5) %>%
filter(taller_2 == TRUE)
More about pipes
Additional details and opinions on the pipe operator are available at the following links:
- see
vignette('magrittr')
of the magrittr package (Bache & Wickham, 2022)
- read Chapter 18: Pipes of the r4ds textbook
- for programming with pipes, see this discussion at stackoverflow