3.3 Powerful pipes
Although each of the essential dplyr commands is useful, the examples above have shown that we rarely encounter them in isolation. Instead, we can think of these commands as verbs of a language for basic data manipulation.
The so-called pipe operator %>%
of the magrittr package (Bache & Wickham, 2022) allows chaining commands in which the current result is passed to the next command (from left to right). With such forward pipes, seemingly simple commands become so powerful that we will rarely use just one command.
Whereas our introduction of essential dplyr commands in Section 3.2 relied on an intuitive understanding of the pipe, we should take a brief look at how pipes work and when we should use or avoid them.
3.3.1 Ceci est un pipe
What is a pipe?
In the subdiscipline of physics known as plumbing, a pipe is a device for directing fluid and solid substances from one location to another.
In R, the pipe is an operator that allows re-writing multiple functions as a chain of individual functions.
For our present purposes, it is sufficient to think of the pipe operator %>%
as passing whatever is on its left to the first argument of the function on its right. For instance, given three functions a()
, b()
and c()
, the following pipe would compute the result of the compound expression c(b(a(x)))
:
# Apply a to x, then b, then c:
%>% a() %>% b() %>% c()
x
# typically written as:
%>%
x a() %>%
b() %>%
c()
To assigning the result of a pipe to some object y
, we need to use the assignment operator at the (top) left of the pipe:
# Apply a to x, then b, then c
# and assign the result to y:
<- x %>% a() %>% b() %>% c()
y
# typically written as:
<- x %>%
y a() %>%
b() %>%
c()
3.3.2 Example pipes
Whereas a description of the pipe operator may sound complicated, the underlying idea is quite simple: We often want to perform several operations in a row. This is familiar in arithmetic expressions. For instance, consider the following step-by-step instruction:
- Start with a number
x
(e.g.,x = 3
). Then,
- multiply it by 4,
- add 20 to the result,
- subtract~7 from the result, and finally
- take the result’s square root.
This instruction can easily be translated into the following R expression:
<- 3
x sqrt((x * 4) + 20 - 7)
#> [1] 5
In this expression, the order of operations is determined by parentheses, arithmetic rules (e.g., left to right, multiplying before adding and substracting, etc.), and functions. Avoiding the infix operators *
and +
, we can re-write the expression as a sequence of R functions:
sqrt(sum(prod(x, 4), 20, -7))
#> [1] 5
The order of function application is determined by their level of encapsulation in parentheses.
The pipe operator %>%
allows us re-writing the sequence of functions as a chain:
%>% prod(4) %>% sum(20, -7) %>% sqrt() x
#> [1] 5
Note that this pipe is fairly close to the step-by-step instruction above, particularly when we re-format the pipe to span multiple lines:
%>%
x prod(4) %>%
sum(20, -7) %>%
sqrt()
#> [1] 5
Thus, the pipe operator lets us express chains of function applications in a way that matches their natural language description.
If we find the lack of an explicit representation of each step’s result on the right hand side of %>%
confusing, we can re-write the piped command as follows:
%>% prod(., 4) %>% sum(., 20, -7) %>% sqrt(.) x
#> [1] 5
Here, the dot .
represents whatever was passed (or “piped”) from the left to the right (here: the current value of x
).
While the pipe initially may seem somewhat similar to our assignment operator <-
, they are actually quite different.
For instance, the pipe does not assign new objects, but rather apply functions to an existing objects that serves as the input.
The input object changes as functions are being applied and eventually result in an output object.
Assuming there is no function y()
, the following code would not assign anything to y
, but yield an error:
%>%
x prod(4) %>%
sum(20, -7) %>%
sqrt() %>%
y
Thus, for assigning the result of a pipe to an object y
, we need to use our standard assignment function on the left (or at the beginning) of the pipe:
<- x %>%
y prod(4) %>%
sum(20, -7) %>%
sqrt()
y
#> [1] 5
Overall, the pipe operator %>%
does not allow us to do anything we could not do before, but allows us re-writing chains of commands in a more natural fashion. This is particularly useful when generating and transforming data objects (e.g., vectors or tables) by a series of functions that all share the same type of inputs and outputs (e.g., vectors or tables).
3.3.3 When (not) to pipe
Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. As such pipes do not only work with dplyr verbs, but with all R packages that define functions in which key data structures appear as the first argument of its functions, we will soon solve quite sophisticated tasks by chaining together simple commands. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (e.g., counts, mean, median, standard deviation, etc.), and provide details about specific variables (e.g., describe their statistics and visualize them) by a sequence of simple commands.
Unfortunately, not all functions in R can be used with the pipe. While the tools in the tidyverse are designed to support piping (see The tidy tools manifesto, e.g., by evaluating vignette('manifesto')
), there are many functions in R that do not.
And before getting overly enthusiastic about pipes, we should realize that they are fundamentally linear and typically link one input to one output.
Thus, pipes are wonderful tools whenever we start with some data table and incrementally transform it to obtain some specific output object (e.g., some resulting value, table, or graph). However, when we are solving tasks that routinely involve multiple inputs or multiple outputs, pipes are probably not the best tools to use.
In summary, pipes are great tools for transforming vectors or data tables, but we mostly use them for linear sequences of tidyverse commands that have one input and one output. Let’s practice them further by drilling some pipes into our sws
data table.
Practice
Try to answer each of the following questions by a pipe of dplyr commands:
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller than the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Solution
Here are possible ways of answering these questions:
# What is the number and mean height and mass of individuals
# from Tatooine (filter) by species and gender (groups):
%>%
sws filter(homeworld == "Tatooine") %>%
group_by(species, gender) %>%
summarise(count = n(),
mn_height = mean(height),
mn_mass = mean(mass, na.rm = TRUE)
)
# Which humans are more than 5cm taller than
# the average human overall (filter humans,
# then compute mean and test for taller individuals):
%>%
sws filter(species == "Human") %>%
mutate(mn_height = mean(height, na.rm = TRUE),
taller = height > mn_height + 5) %>%
filter(taller == TRUE)
# Which humans (filter) are more than 5cm taller
# than the average human of their own gender
# (first group by gender, then mutate):
%>%
sws filter(species == "Human") %>%
group_by(gender) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE),
taller_2 = height > mn_height_2 + 5) %>%
filter(taller_2 == TRUE)
More about pipes
Additional details and opinions on the pipe operator are available at the following links:
- see
vignette('magrittr')
of the magrittr package (Bache & Wickham, 2022)
- read Chapter 18: Pipes of the r4ds textbook
- see this discussion at stackoverflow