3.3 Powerful pipes
Although each of the essential dplyr commands is useful, the examples above have shown that we rarely encounter them in isolation. Instead, we can think of these commands as verbs of a language for basic data manipulation. The so-called pipe operator
%>% of the package magrittr (Bache & Wickham, 2014) allows chaining commands in which the current result is passed to the next command (from left to right). With such forward pipes, seemingly simple commands become so powerful that we will rarely use just one command. Whereas our introduction of essential dplyr commands in Section 3.2 relied on an intuitive understanding of the pipe, we should take a brief look at how pipes work and when we should use or avoid them.
3.3.1 Ceci est un pipe
What is a pipe? For our present purposes, it is sufficient to think of the pipe operator
%>% as passing whatever is on its left to the first argument of the function on its right. For instance, given 3 functions
c() the following pipe would compute the result of
# Apply a to x, then b, then c: x %>% a() %>% b() %>% c() # typically written as: x %>% a() %>% b() %>% c()
To assigning the result of a pipe to some object
y, we need to use the assignment operator at the (top) left of the pipe:
# Apply a to x, then b, then c # and assign the result to y: y <- x %>% a() %>% b() %>% c() # typically written as: y <- x %>% a() %>% b() %>% c()
The pipe in action can be demonstrated by the following example:
x <- 4 x %>% sum(5) %>% sqrt()
#>  3
If you find the lack of an explicit representation of
x on the right hand side of
%>% confusing, you could re-write the piped command as follows:
x %>% sum(., 5) %>% sqrt(.)
#>  3
Here, the dot
. represents whatever was passed (or “piped”) from the left to the right (here: the current value of
While the pipe initially may seem somewhat similar to our assignment operator
<-, they are actually quite different. For instance, the pipe does not assign new objects, but rather apply functions to an existing objects that serves as the input. The input object changes as functions are being applied and eventually result in an output object. Assuming there is no function
y(), the following code would not assign anything to
y, but yield an error:
x %>% sum(., 5) %>% sqrt(.) %>% y
Thus, for assigning the result of a pipe to an object
y, we need to use our standard assignment function on the left (or at the beginning) of the pipe:
y <- x %>% sum(., 5) %>% sqrt(.) y
#>  3
3.3.3 When (not) to pipe
Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. As such pipes do not only work with dplyr verbs, but with all packages of the tidyverse, we will soon solve quite sophisticated tasks by chaining together simple commands. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (e.g., counts, mean, median, standard deviation, etc.), and provide details about specific variables (e.g., describe their statistics and visualize them) by a sequence of simple commands.
Unfortunately, not all functions in R can be used with the pipe. While the tools in the tidyverse are designed to support piping (see The tidy tools manifesto, e.g., by evaluating
vignette('manifesto')), there are many functions in R that do not. And before getting overly enthusiastic about pipes, we should realize that they are fundamentally linear and typically link one input to one output. Thus, pipes are wonderful tools whenever we start with some data table and incrementally transform it to obtain some specific output object (e.g., some resulting value, table, or graph). However, when we are solving tasks that routinely involve multiple inputs or multiple outputs, pipes are probably not the best tools.
In summary, pipes are great tools for transforming a data table, but we mostly use them for linear sequences of tidyverse commands that have one input and one output. Let’s practice them further by drilling some pipes into our
sws data table.
Try to answer each of the following questions by a pipe of dplyr commands:
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller than the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Here are possible ways of answering these questions:
# What is the number and mean height and mass of individuals # from Tatooine (filter) by species and gender (groups): sws %>% filter(homeworld == "Tatooine") %>% group_by(species, gender) %>% summarise(count = n(), mn_height = mean(height), mn_mass = mean(mass, na.rm = TRUE) ) # Which humans are more than 5cm taller than # the average human overall (filter humans, # then compute mean and test for taller individuals): sws %>% filter(species == "Human") %>% mutate(mn_height = mean(height, na.rm = TRUE), taller = height > mn_height + 5) %>% filter(taller == TRUE) # Which humans (filter) are more than 5cm taller # than the average human of their own gender # (first group by gender, then mutate): sws %>% filter(species == "Human") %>% group_by(gender) %>% mutate(mn_height_2 = mean(height, na.rm = TRUE), taller_2 = height > mn_height_2 + 5) %>% filter(taller_2 == TRUE)
Bache, S. M., & Wickham, H. (2014). magrittr: A forward-pipe operator for R. Retrieved from https://CRAN.R-project.org/package=magrittr