6.4 Transforming tables with tidyr
Tidy datasets are all alike
but every messy dataset is messy in its own way.Hadley Wickham (2014b, p. 2)
This quote by Hadley Wickham is a variation of the initial sentence of Tolstoy’s novel Anna Karenina. In both version, it is implied that all entities (i.e., happy families or tidy datasets) share some characteristic feature, whereas there is a variety of ways in which things can go wrong (i.e., become unhappy or messy). Both sentences convey great insights, of course, but explaining why they are true is a bit like explaining a joke — and not quite as catchy.
The term “tidy data” comes from the related concept of “data cleaning.” Cleaning up data is an effortful and expensive phase of data analysis. The “tidy” concept was defined in contrast to “messy data” by Wickham (2014b) and is the eponym for the tidyverse and the tidyr package.
The following sections merely provide a summary of essential tidyr functions. More extensive resources for this section include:
Chapter 7: Tidying data of the ds4psy book (Neth, 2022a).
Chapter 12: Tidy data of the r4ds book (Wickham & Grolemund, 2017).
The documentation of the tidyr package (Wickham & Henry, 2020).
6.4.1 What is tidy data?
What is tidy data? The notion of tidy data is a key inspiration for the tidyverse. The concept of “tidy” is neat and intuitive, but also a bit vague. Rather than attempting a formal definition, we will characterize the term by describing the intentions behind its uses:
Informally, tidy data serves as the antagonist of messy data. As messy data is data that is difficult to work with, data is considered “tidy” when it is easy to work with.
The key idea of tidy data is that each variable should be in its own column.
Both these descriptions sound rather trivial, of course. The second point raises the question: When is a variable not in its own column?
The trick here is that the term “variable” is used in a functional sense: A variable is some measure or description that we want to use as a variable in an analysis. For instance, depending on the particular task at hand, a “variable” could be a particular date, or the month, year, or century that corresponds to a date. Thus, what is considered to be “tidy” partly lies in the eyes of the beholder and depends on what someone wants to do with data, rather than on some inherent property of the data itself.
More generally, the notion of tidy data is illuminated within our framework of ecological rationality (see Section 1.2.6): The difference between messy and tidy data depends on (a) our goals or the intended use of the data (e.g., which task do we want to address?) and (b) the tools with which we typically carry out our tasks (e.g., which functions are we familiar with?). Given the tools provided by R and prominent R packages (e.g., dplyr or ggplot2), it makes sense to first identify the variables of our analysis and then reshape the data so that each variable is in its own column. Although this format is informationally equivalent to many alternative formats, it provides practical benefits for further transforming the data. For instance, we can use the variables to filter, select, group, or pivot the data to reshape or reduce it to answer our questions.
6.4.2 Essential tidyr functions
The tidyr package (Wickham & Henry, 2020) provides commands to create and transform messy into tidy data (or vice versa). Although the package provides many functions, we consider two pairs of two complementary tidyr commands as essential:
separate()
splits one variable into two variables;unite()
combines two variables into one variable;pivot_longer()
orgather()
make (“wide”) data longer (by gathering many variables into one variable);pivot_wider()
orspread()
make (“long”) data wider (by spreading one variable into many variables).
The first two commands allow splitting or combining variables:
separate()
is the complement to or inverse/opposite of unite()
.
The second two pairs of commands allow changing the data layout by making a given table longer or wider: spread()
and pivot_wider()
are the complements to or inverses/opposites of gather()
and pivot_longer()
.
In the following sections, we briefly explain how to use these four essential commands.
The table5
tibble of the tidyr package presents two obstacles to tidy data:
table5#> # A tibble: 6 × 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
Instead of solving typical tasks in terms of
century
and a 2-digit code ofyear
, we normally would encode the year as a 4-digit number. Doing so would require combining two variables into one — which is what theunite()
function is for.The variable
rate
suffers from the opposite problem: It contains both the number of TB cases and the population count, separated by a forward slash “/.” Splitting one variable into two is the job of theseparate()
function.
6.4.3 Unite variables
To combine the variables century
and year
of table5
into a single variable, we can use the following unite()
expression:
%>%
table5 unite(col = "yr", c("century", "year"), sep = "")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
Note that this unite()
function is preceded by the pipe operator, so that table5
is used as its first argument (run ?unite
to discover its arguments as unite(data, col, ..., sep, remove, na.rm)
).
Evaluating the pipe creates a 6 x 3
tibble with a new variable yr
and removes the two original variables (century
and year
) from the table, as the remove
argument of unite()
is set to TRUE
by default.
Incidentally, omitting the quotation marks around the variable names (here: yr
, century
, and year
) would work as well.
Additionally, the new variable yr
is of type “character.”
To create a corresponding number or factor, we could use mutate()
to create additional variables:
%>%
table5 unite(col = "yr", c("century", "year"), sep = "") %>%
mutate(yr_num = as.numeric(yr),
yr_fac = as.factor(yr))
#> # A tibble: 6 × 5
#> country yr rate yr_num yr_fac
#> <chr> <chr> <chr> <dbl> <fct>
#> 1 Afghanistan 1999 745/19987071 1999 1999
#> 2 Afghanistan 2000 2666/20595360 2000 2000
#> 3 Brazil 1999 37737/172006362 1999 1999
#> 4 Brazil 2000 80488/174504898 2000 2000
#> 5 China 1999 212258/1272915272 1999 1999
#> 6 China 2000 213766/1280428583 2000 2000
or use transmute()
to drop yr
in favor of an alternative variable.
In case of wondering why we specified sep = ""
and whether the order of variables in the expression matters, we can try the following variants:
%>%
table5 unite(col = "yr", c("century", "year"))
%>%
table5 unite(col = yr, c(year, century), sep = "")
Thus, sep = "_"
by default, and the order of variables matters, but variable names can be quoted or unquoted.
6.4.4 Separate variables
The second issue of table5
involved the rate
variable.
The separate()
function of tidyr allows creating two variables:
%>%
table5 separate(rate, into = c("cases", "popu"))
#> # A tibble: 6 × 5
#> country century year cases popu
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
Note that we did not need to specify sep = "/"
, as separate()
was smart enough to identify this as the only plausible splitting point. If we wanted to separate a variable without a dedicated symbol that signals the separation, we can also specify the numeric position at which we want to split a variable:
%>%
table5 separate(country, into = c("first_3", "rest"), sep = 3)
#> # A tibble: 6 × 5
#> first_3 rest century year rate
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afg hanistan 19 99 745/19987071
#> 2 Afg hanistan 20 00 2666/20595360
#> 3 Bra zil 19 99 37737/172006362
#> 4 Bra zil 20 00 80488/174504898
#> 5 Chi na 19 99 212258/1272915272
#> 6 Chi na 20 00 213766/1280428583
As the rate
and country
variables that were separated are no longer part of the resulting tibble, we now have remove = FALSE
by default. Finally, the new variables are still of type “character” (i.e., would require a round of “as.numeric()” to be turned into numbers).
Practice
- Combine the above
unite()
step (to create theyr
variable) with aseparate()
step that reverses its effect in a single dplyr pipe.
Solution:
%>%
table5 unite(col = "yr", c("century", "year"), sep = "") %>%
separate(yr, into = c("century", "year"), sep = 2)
- Combine the above
separate()
step (to split therate
variable) with aunite()
step that reverses its effect in a single dplyr pipe.
Solution:
%>%
table5 separate(rate, into = c("cases", "popu")) %>%
unite(col = "rate", c("cases", "popu"), sep = "/")
See Section 7.2 Essential tidyr commands for additional examples and practice tasks.
6.4.5 Making tables longer
When describing the shape of datasets, notions like “wide” and “long” make more sense when describing changes than in absolute terms. For instance, data tables in “wide format” can contain thousands of cases (which most people would consider quite “long”) and tables in “long format” can contain a large number of variables (i.e., be considered quite “wide”). However, the relative terms “longer” and “wider” make sense when describing a change in format of the same data table.
The first change considered here makes a (“wide”) data table (e.g.,, a table in which a single variable is distributed over multiple columns) longer. A simple case for such a task is provided by the data of table4a
of the tidyr package:
table4a#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
Using gather()
for making wide tables longer
Making wide tables longer can be achieved with the gather()
command of tidyr.
Gathering multiple columns requires that each cell value to be moved is described by a so-called key
(e.g., the name of a variable that describes or identifies the value) and its value
(e.g., the name of the variable that is being measured). Additionally, we need to specify which variables of a table are to be gathered (by specifying either a vector or range of variables). In the case of table4a
, the corresponding gather()
command could be:
%>%
table4a gather(key = "year", value = "cases", '1999':'2000')
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
Note that the range of to-be-gathered variables (i.e., 1999
:2000
) lacks an argument name, uses the :
-notation to specify a range of variables (but could also be expressed as a two-variable vector c('1999', '2000')
), and encloses the variable name in single quotes because the variables '1999'
and '2000'
start with a number, rather than a letter (as variables in R normally should).
A potential limitation of gather()
is that it is primarily designed for cases in which the values of one variable are spread over multiple columns. However, a common case in many datasets is that there are several variables whose values are scattered over multiple columns. For instance, consider the following table8
from the ds4psy package:
::table8
ds4psy#> # A tibble: 3 × 5
#> country cases_1999 cases_2000 popu_1999 popu_2000
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan 745 2666 19987071 20595360
#> 2 Brazil 37737 80488 172006362 174504898
#> 3 China 212258 213766 1272915272 1280428583
In table8
, the contents of table4a
are followed by two more columns from table4b
, and the meaning of each column (and the variables it contains) is signaled by the column names. One way of reshaping table8
into a longer format would be to split it into two parts (i.e., table4a
and table4b
), use gather()
on each part, and then join the two resulting tables (see the Practice exercises below). An alternative consists in using gather()
on all variables with values and later separate()
the key
variable (containing the previous column names) into two variables:
%>%
table8 gather(key = "key", value = "nr", cases_1999:popu_2000) %>%
separate(col = key, into = c("type", "year"))
#> # A tibble: 12 × 4
#> country type year nr
#> <chr> <chr> <chr> <dbl>
#> 1 Afghanistan cases 1999 745
#> 2 Brazil cases 1999 37737
#> 3 China cases 1999 212258
#> 4 Afghanistan cases 2000 2666
#> 5 Brazil cases 2000 80488
#> 6 China cases 2000 213766
#> # … with 6 more rows
However, this only works for table8
because the columns had sensible and systematic column names.
A more flexibel range of solutions is supported by the pivot_longer()
function.
Using pivot_longer()
rather than gather()
In recent versions of tidyr (e.g., version 1.1.0+), the documentation of gather()
carries a warning label: “Development on gather()
is complete, and for new code we recommend switching to pivot_longer()
, which is easier to use, more featureful, and still under active development.”
Hence, we can still use gather()
, but using the more recent pivot_longer()
command is a safer choice.
Making table4a
wider with this function can be achieved as follows:
%>%
table4a pivot_longer(cols = `1999`:`2000`,
names_to = "year", values_to = "cases")
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
For simple cases, pivot_longer()
works just like gather()
: The range of variables (columns) to be changed to a longer format are specified by the cols
argument, and the cryptic key
and value
arguments of gather()
are replaced by more intuitive names_to
and values_to
arguments.
The main benefit of pivot_longer()
is that it provides functionalities beyond those of gather()
.
For instance, the function can gather the two distributed variables of table8
in one expression:
::table8 %>%
ds4psypivot_longer(cols = cases_1999:popu_2000,
names_to = c("type", "year"),
names_sep = "_",
# names_pattern = "(.*)_(.*)",
values_to = "nr")
#> # A tibble: 12 × 4
#> country type year nr
#> <chr> <chr> <chr> <dbl>
#> 1 Afghanistan cases 1999 745
#> 2 Afghanistan cases 2000 2666
#> 3 Afghanistan popu 1999 19987071
#> 4 Afghanistan popu 2000 20595360
#> 5 Brazil cases 1999 37737
#> 6 Brazil cases 2000 80488
#> # … with 6 more rows
As names_to
argument specified two variables, we need to say how the column names are to be mapped to the two new variables.
This can either be done by specifying a separation point sep
(which can be a character or a numeric value) or by providing a names_pattern
(as a regular expression, see Appendix E: Using regular expressions for an introduction).
Overall, pivot_longer()
can do more than the gather()
function, but the price of using the more advanced features is additional complexity within the function.
More advanced features of pivot_longer()
Here are some helpful examples from the documentation of pivot_longer()
(see the vignette("pivot")
for explanations):
Using the
relig_income
data:Using the
billboard
data:
The tidyr::billboard
data provides top-100 song rankings for the year 2000.
The data provides variables for the artist
name, the track
(or song) name, and the date on which a song first appeared in the top-100 ratings (date.enter
). After this, a range of variables from wk1
to wk76
notes a track’s rank in each week after it entered. As most songs disappear from the rankings after some weeks, later variables (i.e., columns on the right side of the billbard
table) contain an increasing number of missing (or NA
) values:
# Data documentation:
?billboard
# Inspect data:
billboardglimpse(billboard)
# Note:
sum(!is.na(billboard)) # existing values
sum(is.na(billboard)) # missing values
The following pivot_longer()
command identifies the variables (or cols
) to be gathered (or moved from a wide to a longer format) by their common prefix "wk"
. Specifying names_prefix = "wk"
further ensures that the values in the new variable week
drop the "wk"
prefix (and could thus be transformed into numbers).
Finally, the values_drop_na = TRUE
argument ensures that the NA
values are removed from the resulting tibble:
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)#> # A tibble: 5,307 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#> # … with 5,301 more rows
Note that the new week
variable is of type “character,” but we could easily turn it into a numeric variable by adding a mutate()
step of mutate(nr_wk = as.numeric(week))
.
- Using the
who
data:
The tidyr::who
data encodes a lot of information (regarding the diagnosis, gender, and age, of the people captured in the corresponding frequency counts) in the name of its 56 rightmost variables. To extract this information from the variable names, the names_to
argument of pivot_longer()
can use multiple names of new variables and the names_pattern
argument accepts a regular expression that parses the names according to some pattern:
# Data:
# who
dim(who)
# Multiple variables stored in column names:
%>% pivot_longer(
who cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age_group"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
This is nicely illustrates the superior powers of pivot_longer()
, but requires some expertise in using regular expressions in R (see Appendix E: Using regular expressions).
- Using the
anscombe
data:
An example only intelligible to expert users is the following (using the anscombe
data from the datasets package):
::anscombe
datasets
# Multiple observations per row
%>%
anscombe pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
)
An explanation of this example, and many others, is available in vignette("pivot")
. The examples show that people spend a lot of time and effort on reshaping data files. Our more modest goal is to understand the basics…
Practice: Making tables longer
The shape of table4b
is identical to table4a
, but the values represent population
counts, rather than counts of cases
.
- Use both
gather()
andpivot_longer()
to maketable4b
(containing the countries’population
values) longer:
Solution:
table4b
%>%
table4b gather(key = "year", value = "population", `1999`:`2000`)
%>%
table4b pivot_longer(c(`1999`, `2000`),
names_to = "year", values_to = "population")
- What would we need to create
table1
from the longer versions of the same data intable4a
andtable4b
?
table1#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Solution:
We could create table1
from table4a
and table4b
in two steps:
- First, we would transform both
table4a
andtable4b
from their wide format into a longer format, so thatcases
andpopulation
would be a variable with a single column (as we have just done). - In a second step, we would need to combine both resulting tables: Provided that both outputs in longer formats are arranged the same way, we could add one of the variables (either
cases
orpopulation
) to the other table.
A safer way of combining both tables would join them based on their common variables (e.g., by using the merge()
function of base R or one of the join
functions of dplyr, see Chapter 8: Joining data of the ds4psy book).
6.4.6 Making tables wider
The opposite of making (“wide”) tables longer is making (“long”) tables wider.
And the inverse of gather()
in the tidyr package is spread()
.
Using spread()
for making long tables wider
The simplest case for the spread()
function is that we have a single key
variable (whose values are to be turned into new variable names) and a single value
variable (whose values are to be distributed over multiple variables).
For instance, if we removed the population
variable from table1
, we could spread the values of the cases
variable over two new variables whose names are generated from the year
variable (to obtain table4a
):
%>%
table1 select(-population) %>%
spread(key = year, value = cases)
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# same as: table4a
Removing the cases
variable would allow spreading the remaining values of table1
so that the values of the population
variable are spread over two new variables whose names are generated from the year
variable (to obtain table4b
):
%>%
table1 select(-cases) %>%
spread(key = year, value = population)
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
# same as: table4b
Unfortunately, just like we saw for gather()
above, the spread()
function is designed for a single dependent variable.
However, we often deal with multiple variables (e.g., table1
contains the variables cases
and populations
, both of which contain frequency counts that we may want to spread into a wider table format. How could we spread multiple variables at once?
Using pivot_wider()
rather than spread()
Perhaps not surprisingly, the more potent replacement of spread()
in the tidyr package is called pivot_wider()
.
To illustrate its potential, we can distinguish between two cases:
Case 1: Our table1
data
table1#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
can be viewed as containing one independent variable (year
) and two dependent variables (cases
and population
).
When reshaping data by spreading it from a (“long”) table into a wider table, the independent variable should provide the names of a new variable and the two dependent variables should become the values of the new variable.
The following pivot_wider()
expression does this in one step:
# 1. One IV, two DVs:
%>%
table1 pivot_wider(names_from = year,
values_from = c(cases, population))
#> # A tibble: 3 × 5
#> country cases_1999 cases_2000 population_1999 population_2000
#> <chr> <int> <int> <int> <int>
#> 1 Afghanistan 745 2666 19987071 20595360
#> 2 Brazil 37737 80488 172006362 174504898
#> 3 China 212258 213766 1272915272 1280428583
Note that values_from
argument contains both dependent variables — and that they are combined with the values of the independent variable (year
) to form the names of the new variables (using names_sep = "_"
by default).
Case 2: Alternatively, our table2
data
table2#> # A tibble: 12 × 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rows
can be viewed as containing two independent variables (year
and type
) and
and one dependent variable (count
).
When spreading this table, the two independent variables should form the names of new variables and the values of the dependent variable should be distributed over these variables.
This task can be addressed by the following pivot_wider()
command:
%>%
table2 pivot_wider(names_from = c(year, type),
values_from = c(count))
#> # A tibble: 3 × 5
#> country `1999_cases` `1999_population` `2000_cases` `2000_population`
#> <chr> <int> <int> <int> <int>
#> 1 Afghanistan 745 19987071 2666 20595360
#> 2 Brazil 37737 172006362 80488 174504898
#> 3 China 212258 1272915272 213766 1280428583
Note that the names_from
argument now contains both independent variables.
Their values are combined into the names of four new variables (using names_sep = "_"
by default).
As the names of the new variables start with a number (rather than a letter), they are enclosed in quotes.
To obtain exactly the same wider table as above, we could add two dplyr steps for shuffling the order of columns (by using select()
) and renaming some variables (using rename()
):
%>%
table2 pivot_wider(names_from = c(year, type),
values_from = c(count)) %>%
select(country, `1999_cases`, `2000_cases`, `1999_population`, everything()) %>%
rename(cases_1999 = `1999_cases`,
cases_2000 = `2000_cases`,
pop_1999 = `1999_population`,
pop_2000 = `2000_population`)
#> # A tibble: 3 × 5
#> country cases_1999 cases_2000 pop_1999 pop_2000
#> <chr> <int> <int> <int> <int>
#> 1 Afghanistan 745 2666 19987071 20595360
#> 2 Brazil 37737 80488 172006362 174504898
#> 3 China 212258 213766 1272915272 1280428583
Thus, both the names_from()
and the values_from()
arguments of pivot_wider()
can contain multiple variables.
More advanced features of pivot_wider()
Here are some more advanced examples from the documentation of pivot_wider()
(see the vignette("pivot")
for explanations):
- Using the
fish_encounters
data:
# Data:
fish_encounters
# Simplest case:
%>%
fish_encounters pivot_wider(names_from = station, values_from = seen)
# Fill in missing values (as 0):
%>%
fish_encounters pivot_wider(names_from = station, values_from = seen,
values_fill = 0)
- Using the
us_rent_income
data:
# Data:
us_rent_income
# Generate column names from multiple variables:
%>%
us_rent_income pivot_wider(names_from = variable, values_from = c(estimate, moe))
# When there are multiple `names_from` or `values_from`, we can use
# `names_sep` or `names_glue` to control the output variable names:
%>%
us_rent_income pivot_wider(
names_from = variable,
names_sep = ".",
values_from = c(estimate, moe)
)
%>%
us_rent_income pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe)
)
- Using the
warpbreaks
data:
# Data:
<- as_tibble(warpbreaks[c("wool", "tension", "breaks")])
warpbreaks
warpbreaks
# Can perform aggregation with values_fn:
%>%
warpbreaks pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
Overall, these examples show that pivot_wider()
is a more flexible replacement for spread()
.
Practice: Making tables wider
- Demonstrate that spreading is the opposite of gathering by starting with
table4a
and using firstgather()
and thenspread()
to re-create the same table.
Solution:
%>%
table4a gather(key = year, value = cases, `1999`:`2000`) %>%
spread(key = year, value = cases)
- Make the
table8
from ds4psy package tidy.
Solution:
::table8 %>%
ds4psy::pivot_longer(cols = cases_1999:popu_2000,
tidyrnames_to = c(".value", "year"),
names_sep = "_")
This concludes our summary of essential tidyr functions (Wickham & Henry, 2020). To sum up:
separate()
splits one variable into two;unite()
combines two variables into one;
pivot_longer()
orgather()
make (“wide”) data longer;
pivot_wider()
orspread()
make (“long”) data wider.
The combination of dplyr and tidyr provides a powerful combination for transforming, summarizing, and reshaping data.