7.2 Essential tidyr commands
The tidyr package (Wickham & Girlich, 2024) provides commands to create and transform tidy data. Although the package provides many functions, we consider two pairs of two complementary tidyr commands as essential:
separate()
splits one variable into two variables;
unite()
combines two variables into one variable;
gather()
orpivot_longer()
make wide data longer (by gathering many variables into one variable);
spread()
orpivot_wider()
make long data wider (by spreading one variable into many variables).
The first two commands allow splitting or combining variables: separate
is the complement to or inverse/opposite of unite
.
The second two pairs of commands allow changing the data layout by making a given table longer or wider:
spread()
and pivot_wider()
are the complements to or inverses/opposites of gather()
and pivot_longer()
.
In the following sections, we explain how to use these four essential commands.
7.2.1 separate()
1 variable into 2
The separate()
function splits 1 variable (column) into multiple variables (columns) — at a position where some separator character appears — and is the complement of unite()
.
Using separate()
requires the following arguments:
- some tibble/data frame
data
;
- the variable (column)
col
to be separated (specified by its name or column number);
- the names of the new variables (columns)
into
whichcol
is to be split (specified as a character vector);
- the separator character
sep
(as a character/regular expression).
An additional argument remove
regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE
.
Example
In tidyr::table3
(a 6 x 3 tibble), the column rate
contains two numbers, separated by a forward slash /
:
# Data to use:
tidyr::table3 # Note that column rate contains 2 numbers, separated by "/".
#> # A tibble: 6 × 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
We can use separate()
to split the rate
variable at the /
character into two variables (cases
and population
):
# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/", remove = TRUE)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
The output is a 6 x 4 tibble that contains separate columns for cases
and population
.
Note that both the original rate
column and the /
character disappeared from the output tibble.
The following commands are shorter variants that achieve the same result:
# Omitting argument names, allowing to guess sep character, using remove default:
separate(table3, rate, c("cases", "population"))
# Using the pipe:
table3 %>%
separate(rate, c("cases", "population"))
# Specifying the variable to be split (rate) by its column number (3):
table3 %>%
separate(3, c("cases", "population"))
These examples shows that the argument names (data
, col
, and into
) can be omitted (but still require appropriate arguments in the correct order) and sep
can be left unspecified when tidyr can make a good guess what the separator character might be. Also, the variable to be separated can be specified by its numeric index (column number).
If we wanted to keep the rate
variable, we could call:
# Not dropping the original rate variable:
t <- table3 %>%
separate(rate, c("cases", "population"), remove = FALSE)
Finally, since the rate
variable of table3
was of character type, all the tables so far contained variables for cases
and population
that also were of type character. To get these variables as numbers, we could either explicitly convert them (e.g., by applying the as.integer()
function to these columns) or let separate()
do this job by providing the convert = TRUE
argument:
# Not dropping the original rate variable:
table3 %>%
separate(rate, c("cases", "population"), convert = TRUE)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Practice
Consider table6
of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table6.csv):
# Copy data from ds4psy package:
table6 <- ds4psy::table6
# Load data (from csv file online):
# table6b <- readr::read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
# Alternatively (from local source "data/table6.csv"):
# table6c <- readr::read_csv("data/table6.csv") # from local directory
# all.equal(table6, table6b) # verify equality
- Inspect
table6
and describe what difficulty may occur when trying to applyseparate()
to it.
table6 # Note that column when_what contains several splitting options.
#> # A tibble: 6 × 2
#> country when_what
#> <chr> <chr>
#> 1 Afghanistan 19_99.745/19987071
#> 2 Afghanistan 20_00.2666/20595360
#> 3 Brazil 19_99.37737/172006362
#> 4 Brazil 20_00.80488/174504898
#> 5 China 19_99.212258/1272915272
#> 6 China 20_00.213766/1280428583
In table6
, the variable when_what
contains several plausible separation characters: _
, .
, and /
.
- What happens when we apply
separate
towhen_what
without providing a separating charactersep
?
# What happens when we do not specify "sep"?
table6 %>%
separate(col = when_what, into = c("var_1", "var_2")) # sep is not provided!
#> # A tibble: 6 × 3
#> country var_1 var_2
#> <chr> <chr> <chr>
#> 1 Afghanistan 19 99
#> 2 Afghanistan 20 00
#> 3 Brazil 19 99
#> 4 Brazil 20 00
#> 5 China 19 99
#> 6 China 20 00
The variable when_what
is split at the first plausible option (_
), but a warning is issued and some data is lost from the output table!
- Split the variable
when_what
in three different ways (by providing differentsep
arguments).
# Specifying different splitting characters:
# (a) split at "_":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "_") #
#> # A tibble: 6 × 3
#> country var_1 var_2
#> <chr> <chr> <chr>
#> 1 Afghanistan 19 99.745/19987071
#> 2 Afghanistan 20 00.2666/20595360
#> 3 Brazil 19 99.37737/172006362
#> 4 Brazil 20 00.80488/174504898
#> 5 China 19 99.212258/1272915272
#> 6 China 20 00.213766/1280428583
# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")
#> # A tibble: 6 × 3
#> country var_1 var_2
#> <chr> <chr> <chr>
#> 1 Afghanistan 19_99 745/19987071
#> 2 Afghanistan 20_00 2666/20595360
#> 3 Brazil 19_99 37737/172006362
#> 4 Brazil 20_00 80488/174504898
#> 5 China 19_99 212258/1272915272
#> 6 China 20_00 213766/1280428583
# (c) split at "/":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 × 3
#> country var_1 var_2
#> <chr> <chr> <chr>
#> 1 Afghanistan 19_99.745 19987071
#> 2 Afghanistan 20_00.2666 20595360
#> 3 Brazil 19_99.37737 172006362
#> 4 Brazil 20_00.80488 174504898
#> 5 China 19_99.212258 1272915272
#> 6 China 20_00.213766 1280428583
Note that using the point or period (.
) as a splitting character sep = "."
would not work.
Instead, we need to use the corresponding regular expression sep = "\\."
.
(See Appendix E for a primer on using regular expressions.)
None of the solutions so far is satisfactory.
This shows that we may need to use a sequence of separate()
commands if a table contains condensed columns as table6
.
- Split the
when_what
variable oftable6
three times to create a tibbletable6a
that contains five variables (columns) and reasonable variable names.
Here is a solution that takes three steps (separating one or two variables at a time):
table6a <- table6 %>%
separate(col = when_what, into = c("century", "stuff"), sep = "_") %>% # (a)
separate(col = stuff, into = c("year", "stuff"), sep = "\\.") %>% # (b)
separate(col = stuff, into = c("cases", "population"), sep = "/") # (c)
table6a
#> # A tibble: 6 × 5
#> country century year cases population
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
and here is a solution that achieves the same result in a single step (separating four variables at once):
table6 %>%
separate(when_what, c("century", "year", "cases", "population")) # 4 variables in 1 step
#> # A tibble: 6 × 5
#> country century year cases population
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
Note that the resulting 6 x 5 tibble (here: table6a
) is similar to table3
and table5
(if we applied separate()
to their rate
variable). Also, we may want to convert some variables from their current character format into numbers before analyzing the data further.
7.2.2 unite()
2 variables into 1
unite()
is the complement to (or opposite of) separate()
:
unite()
combines two variables (columns) into one variable (column) and allows adding an optional separator character.
Using unite()
requires the following arguments:
- some tibble/data frame
data
;
- the name of the new compound variable (column)
col
(specified as a character);
- the names of the variables (columns) to be combined (specified by their names or column numbers);
- an optional separator character
sep
(as a character/regular expression).
An additional argument remove
determines whether the original columns are dropped from the output tibble. By default, remove = TRUE
.
Example
In tidyr::table5
(a 6 x 4 tibble), the two columns century
and year
contain two numbers (represented as characters) that we may want to combine into one variable:
# Data to use:
tidyr::table5 # Note that columns 2 and 3 contain 2 values (as characters!) that belong together.
#> # A tibble: 6 × 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
The full unite()
command for combining century
and year
into a variable yr
is as follows:
# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
Note that we needed to explicitly specify sep = ""
(i.e., no separating character) to prevent unite()
from adding a default character (_
).
Also, the century
and year
variables disappeared from the 6 x 3 output tibble.
As before, we can use various shorter variants of the same command:
# (a) Omitting argument names:
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# (b) Using the pipe:
table5 %>%
unite("yr", century, year, sep = "")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# (c) Specifying the variables to be combined by column numbers (2 & 3):
table5 %>%
unite("yr", 2, 3, sep = "")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
Finally, here are some variants that illustrate the effects of using different sep
and remove
arguments:
# Providing a different separation character:
table5 %>%
unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 × 3
#> country yr rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 19<--|-->99 745/19987071
#> 2 Afghanistan 20<--|-->00 2666/20595360
#> 3 Brazil 19<--|-->99 37737/172006362
#> 4 Brazil 20<--|-->00 80488/174504898
#> 5 China 19<--|-->99 212258/1272915272
#> 6 China 20<--|-->00 213766/1280428583
# Keeping the original variables:
table5 %>%
unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 × 5
#> country yr century year rate
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 1999 19 99 745/19987071
#> 2 Afghanistan 2000 20 00 2666/20595360
#> 3 Brazil 1999 19 99 37737/172006362
#> 4 Brazil 2000 20 00 80488/174504898
#> 5 China 1999 19 99 212258/1272915272
#> 6 China 2000 20 00 213766/1280428583
Practice
- To turn
tidyr::table5
into tidy data, it may be indicated to apply bothunite()
andseparate()
to it. Can you achieve this in one pipe?
Solution
tidyr::table5 %>%
unite("year", century, year, sep = "") %>%
separate("rate", into = c("cases", "population"))
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
- Take the data from
dplyr::storms
and unite the variablesyear
,month
,day
into one more complex variabledate
.
Solution
# Data to use:
# dplyr::storms
# Select relevant subset of variables:
storm_small <- dplyr::storms %>%
select(name:day)
## Uniting year:day into date:
storm_small %>%
unite("date", year:day, sep = "/") %>%
head()
#> # A tibble: 6 × 2
#> name date
#> <chr> <chr>
#> 1 Amy 1975/6/27
#> 2 Amy 1975/6/27
#> 3 Amy 1975/6/27
#> 4 Amy 1975/6/27
#> 5 Amy 1975/6/28
#> 6 Amy 1975/6/28
- Consider
table7
of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table7.csv):
Load or read the data into R (as table7
) and inspect it:
#> # A tibble: 6 × 1
#> where_when_what
#> <chr>
#> 1 "Afghanistan@19:99$745\\19987071"
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"
#> 4 "Brazil@20:00$80488\\174504898"
#> 5 "China@19:99$212258\\1272915272"
#> 6 "China@20:00$213766\\1280428583"
Then transform table7
in the following ways:
Use multiple (4)
separate()
commands to splittable7
into a tibbletable7a
with multiple (5) columns.Use multiple (4)
unite()
commands ontable7a
to re-create a tibbletable7b
that contains all data in one column.
Solution
Possible solutions for table7a
and table7b
:
- Separating
table7
intotable7a
:
#> # A tibble: 6 × 5
#> country century year rate population
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
- Re-uniting
table7a
intotable7b
:
#> # A tibble: 6 × 1
#> where_when_what
#> <chr>
#> 1 Afghanistan:1999_745/19987071
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362
#> 4 Brazil:2000_80488/174504898
#> 5 China:1999_212258/1272915272
#> 6 China:2000_213766/1280428583
7.2.3 Wide vs. long format
Although separate()
and unite()
change the number of columns in a data table, these commands do not change the basic layout of the table.
By contrast, gather()
and its complement spread()
do change the overall layout by either collecting (or “gathering”) multiple variables into 1,
or distributing (or “spreading”) 1 variable across multiple variables.
A clear indication that the layout of a table has changed is that its number of cases or observations (i.e., rows) has changed.
To illustrate what this means, re-consider the example of table4a
:
country | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
We noted above (in Sections 7.1.2 and 7.1.3) that this appears to be a natural representation of the TB cases in 3 countries over 2 years.
However, we also noted that table4a
does not qualify as tidy data (as the year
variable is spread over 2 variables) and impractical in many contexts (e.g., it does not allow using a year
variable in a ggplot()
call). To remedy both shortcomings, we would need a re-formatted table like the following:
country | year | cases |
---|---|---|
Afghanistan | 1999 | 745 |
Brazil | 1999 | 37737 |
China | 1999 | 212258 |
Afghanistan | 2000 | 2666 |
Brazil | 2000 | 80488 |
China | 2000 | 213766 |
Both tables obviously contain the same information, so how do they differ?
The difference between both tables is best characterized by describing the layout of the year
variable’s values:
The 1st table (
table4a
) is in a so-called wider format. Different TB cases per year are represented in different columns (einstances ofyear
.The 2nd table represents the same data in a so-called longer format. The count of TB
cases
are stored in a dedicated variable and the different columns oftable4a
are distinguished by a separateyear
variable. (Theyear
variable is sometimes called akey
, as it qualifies the values of thecases
variable.)
Note that the terms “wide” and “long” only make sense in relative terms:
A 6 x 3 table really is not very long, but still longer than (or twice as long as) a 3 x 3 table that contains the same information.
And the 3 x 3 table is actually not wider than the 6 x 3 table (as the latter requires an additional year
variable to qualify the cases
).
But as the values of cases
are distributed over 2 columns in table4a
, we would still consider it to be in a wider format relative to the longer 6 x 3 table.
In psychology, we often represent data collected over multiple measurements (e.g., several tests of one person or the same test administered at different times) in wide format.
Many software packages (like SPSS) require data to be in long format for certain analyses (e.g., MANOVA), but in wide format for others (like multiple regression).
As we have seen, R and ggplot2 frequently require data to be in long format.
Thus, learning to tranform data from wide to long (and vice versa) is an essential data science skill.
The tidyr package provides the gather()
and spread()
commands to achieve this transformation.
7.2.4 gather()
or pivot_longer()
make wide data longer
Gathering is the opposite of spreading (to be discussed in Section 7.2.5 below) and used when observations that are distributed over multiple columns should be contained in one variable (column). More specifically, gather
moves the values of several variables (columns) into one column value
and describes this value by the value of a new key
variable. When gathering more than two variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).47
Using gather()
requires the following arguments:
data
is a data frame or tibble;
key
is the name of the variable that describes the values of the gathered columns (or name of the independent variable);
value
is the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
...
(orvar_x:var_y
) is a selection (or block) of variables (columns) to be gathered.
Example
We illustrate the gather()
command with table4a
:
# Data:
table4a <- tidyr::table4a
table4a
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
As the counts of cases are distributed over 2 variables (columns) for each country, table4a
is in wide format.
Here is how we can gather()
the two columns 1999
and 2000
into a count of cases
and qualify them by a new year
variable:
# gather 2 variables into 1 variable:
gather(data = table4a,
key = year, value = cases,
`1999`:`2000`)
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# ?gather # provides documentation
As with other tidyr commands, there are many variants of gather()
that yield the same result.
For instance, we can omit the argument names, use the pipe operator to simplify the command, or provide the variables to be gathered in numeric format:
# Omitting argument names:
gather(table4a, year, cases, `1999`:`2000`)
# The same command using the pipe:
table4a %>% gather(year, cases, `1999`:`2000`)
# The same command specifying the numbers of the columns to gather:
table4a %>% gather(2:3, key = year, value = cases)
Note that the key variable year
is of type character in the above example.
If we wanted our key
variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE
:
## Default: convert = FALSE:
# table4a %>%
# gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
## => year is a character vector.
## Converting year into an integer:
table4a %>%
gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <int> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# => year is a vector of integers.
Since version 1.0.0 of tidyr onwards (released in September 2019), we can replace gather()
by the pivot_longer()
function.
Its documentation notes the following:
“pivot_longer()
is an updated approach to gather()
, designed to be both simpler to use and to handle more use cases.
We recommend you use pivot_longer()
for new code; gather()
isn’t going away but is no longer under active development.” (Wickham & Girlich, 2024)
Whereas, the key
and value
arguments of gather()
tend to be confusing, the corresponding names_to
and values_to
arguments of pivot_longer()
are more intuitive.
Thus, using the pivot_longer()
function to make table4a
longer is simple and straightforward:
# Data:
table4a <- tidyr::table4a
table4a
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# Using pivot_longer():
table4a %>%
pivot_longer(cols = `1999`:`2000`,
names_to = "year",
values_to = "cases")
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
Overall, if you are new to tidyr and only want to study one command, learning pivot_longer()
is a better investment.
Practice
- Save the following data as a tibble
de
and then turn it into a tidy tibblede_2
(by using eithergather()
orpivot_longer()
to create a dependent variableshare
and listing the electionyear
as an additional variable).
Hint: First convert the table into a longer format and then use separate()
and select()
to obtain appropriate variables.
party | share_2013 | share_2017 |
---|---|---|
CDU/CSU | 0.415 | 0.330 |
SPD | 0.257 | 0.205 |
Others | 0.328 | 0.465 |
Solution
- Using
gather()
andseparate()
:
# (a) Data (to be saved as a tibble):
de_1
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
# (b) Converting de into a tidy data table:
de_2 <- de_1 %>%
gather(share_2013:share_2017, key = "election", value = "share") %>%
separate(col = election, into = c("dummy", "year")) %>%
select(year, party, share) %>%
arrange(year, party)
de_2
#> # A tibble: 6 × 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
- Using
pivot_longer()
andseparate()
:
# (a) Data (to be saved as a tibble):
de_1
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
# (b) Converting de into a tidy data table:
de_3 <- de_1 %>%
pivot_longer(share_2013:share_2017, names_to = "election", values_to = "share") %>%
separate(col = election, into = c("dummy", "year")) %>%
select(year, party, share) %>%
arrange(year, party)
de_3
#> # A tibble: 6 × 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
# Verify equality:
all.equal(de_2, de_3)
#> [1] TRUE
7.2.5 spread()
or pivot_wider()
make long data wider
Spreading is the opposite of gathering (see Section 7.2.4 above) and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread
puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).48
Using spread
requires the following arguments:
data
is a data frame or tibble;
key
is the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
value
is the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable).
Note that we do not need to specify a range of new columns. The number and the names of new columns is determined by the number of different values in the key
variable.
Example
We illustrate the spread()
function with table2
:
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
Note that the variable count
actually contains 2 different DVs, which are further qualified by the values of the key variable type
.
Here is how we can spread
the values of the count
variable into several columns:
# spread 2 rows into 2 columns of 1 row:
spread(data = table2, key = type, value = count)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# ?spread # provides documentation
The resulting 6 x 4 tibble contains two new columns cases
and population
.
Here are some variants of the same spread
command:
# Omitting argument names:
spread(table2, type, count)
# The same command using the pipe:
table2 %>%
spread(key = type, value = count)
# Using the pipe without argument names:
table2 %>% spread(type, count)
If we want to make it clear that the new variables (i.e., the columns cases
and population
) were created by spreading a common key variable (type
), we could specify an additional sep
argument that inserts a character between the key and its value to create the new variable name:
# Use <key><sep><value> to create explicit column names:
table2 %>%
spread(key = type, value = count, sep = "_")
#> # A tibble: 6 × 4
#> country year type_cases type_population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
As we noted in Section 7.2.4, version 1.0.0 of tidyr (released in September 2019) offers new pivoting functions, that are simpler and more intuitive to use. The spread()
function corresponds to the pivot_wider()
function:
Whereas, the key
and value
arguments of gather()
tend to be confusing, the corresponding names_to
and values_to
arguments of pivot_wider()
are more intuitive. Thus, using the pivot_wider()
function to make table2
wider is straightforward:
# Data:
table2 <- tidyr::table2
table2
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# Using pivot_wider():
table2 %>%
pivot_wider(names_from = "type",
values_from = "count")
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Again, learning pivot_wider()
is recommended over spread()
, as it is simpler and more powerful.
Practice
- Take the tidy 6 x 3 tibble
de_2
(from above) and usespread()
orpivot_wider()
to create a 3 x 3 tibblede_3
that re-creates the original tibblede_1
from it.
Solution
- Using
spread()
:
# (a) Data from above:
de_2
#> # A tibble: 6 × 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
# (b) Using spread() to put share by year into 2 columns/variables:
de_3 <- de_2 %>%
spread(key = year, value = share) %>%
rename(share_2013 = `2013`, # restore original variable names
share_2017 = `2017`)
de_3
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
# (c) Comparing de_3 to de:
de_1
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
all.equal(de_3, de_1)
#> [1] TRUE
- Using
pivot_wider()
:
# (a) Data from above:
de_2
#> # A tibble: 6 × 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
# (b) Using spread_wider() to put share by year into 2 columns/variables:
de_4 <- de_2 %>%
pivot_wider(names_from = year, values_from = share) %>%
rename(share_2013 = `2013`, # restore original variable names
share_2017 = `2017`)
de_4
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
# (c) Comparing de_4 to de_1:
de_1
#> # A tibble: 3 × 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.33
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
all.equal(de_4, de_1)
#> [1] TRUE
7.2.6 Multiple DVs
Before getting too excited about the powers of gather()
and spread()
we should mention an important limitation:
Both commands are designed to gather and spread the values of a single dependent variable. In psychology, however, we are frequently dealing with multiple dependent variables. Fortunately, we can use a sequence of tidyr commands to deal with this case as well — or use the reshape()
function (from the stats package included in R) as an alternative.
Example
Yet another way of representing the data contained in table1
(and its variations in table2
to table7
) is provided by
table8
of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table8.csv):
country | cases_1999 | cases_2000 | popu_1999 | popu_2000 |
---|---|---|---|---|
Afghanistan | 745 | 2666 | 19987071 | 20595360 |
Brazil | 37737 | 80488 | 172006362 | 174504898 |
China | 212258 | 213766 | 1272915272 | 1280428583 |
This 3 x 5 table is a version of table1
that seems quite natural to many social scientists:
The data for three countries are listed as rows (observations), with a variable
country
denoting the country name.The four measurements — 2 levels (
1999
and2000
) on each of 2 variables (cases
andpopu
) — are listed as different variables (columns).
As this format succinctly expresses the repeated measurement of a variable (e.g., cases
) on different years for each instance of country
, some statistical software packages (like SPSS) require this format for some analyses (e.g., MANOVA). In R, we typically don’t want nested formats that distribute instances of the same variable (e.g., cases
) over multiple columns (cases_1999
vs. cases_2000
). In the following, we show three alternative ways of tidying table8
.
1. Combining gather()
and spread()
Gather the four dependent variables, then separate the key into two variables, and spread the two different dependent variables into columns:
# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8) # 3 x 5
#> [1] 3 5
# Import data (from csv-file online):
# table8b <- readr::read_csv("http://rpository.com/ds4psy/data/table8.csv") # from online source
# all.equal(table8, table8b) # verify equality
## (1) Gather, separate, spread: --------
table8_tidy <- table8 %>%
gather(key = "key", value = "value", cases_1999:popu_2000) %>%
separate(col = key, into = c("type", "year")) %>%
spread(key = type, value = "value")
names(table8_tidy)[4] <- "population" # fix variable name
table8_tidy
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
2. Using two tables
An alternative solution first splits table8
into two tidy sub-tables (one for cases
and one for population
counts), and then combines both sub-tables:
# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8) # 3 x 5
#> [1] 3 5
## (2) Split table8 into 2 sub-tables: --------
## (a) Tidy counts of cases per year:
t8a <- table8 %>%
select(country:cases_2000) %>%
gather(key = "key", value = "cases", cases_1999:cases_2000) %>%
separate(col = key, into = c("drop", "year")) %>%
select(country, year, cases) %>%
arrange(country, year)
t8a
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
## (b) Tidy counts of populations per year:
t8b <- table8 %>%
select(country, popu_1999:popu_2000) %>%
gather(key = "key", value = "population", popu_1999:popu_2000) %>%
separate(col = key, into = c("drop", "year")) %>%
select(country, year, population) %>%
arrange(country, year)
t8b
#> # A tibble: 6 × 3
#> country year population
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 19987071
#> 2 Afghanistan 2000 20595360
#> 3 Brazil 1999 172006362
#> 4 Brazil 2000 174504898
#> 5 China 1999 1272915272
#> 6 China 2000 1280428583
## (c) Join both sub-tables: --------
# Either by using join commands of dplyr (see the next chapter on "Relational data"):
table8_tidy_2 <- dplyr::left_join(t8a, t8b)
table8_tidy_2
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# OR (adding new variables by assignment):
table8_tidy_2b <- t8a # copy
table8_tidy_2b$population <- t8b$population # CAREFUL: Ensure that both tables are in same order (see arrange above)!
table8_tidy_2b
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
3. Using reshape()
(from base R stats)
Using the base R function reshape()
(included in the R stats package) also solves our problem (and even allows combining multiple blocks of DVs at once).
However, since reshape()
assumes a data frame, rather than a tibble, we will convert it with as.data.frame()
and re-convert the result into a tibble with as_tibble()
(and add some adjustments to recover the year
variable):
# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8) # 3 x 5
#> [1] 3 5
# Convert into data frame:
table8 <- as.data.frame(table8)
dim(table8)
#> [1] 3 5
# See
# ?reshape # for details
## From wide to long by using reshape: --------
table8_long <- stats::reshape(table8,
varying = list(cases = c("cases_1999", "cases_2000"), # 1st set of variables to combine into 1
population = c("popu_1999", "popu_2000")), # 2nd set of variables to combine into 1
direction = "long",
v.names = c("cases", "population"), # to rename combined variables
timevar = "time", # name of time variable
idvar = "nr" # name of id variable
)
table8_long
#> country time cases population nr
#> 1.1 Afghanistan 1 745 19987071 1
#> 2.1 Brazil 1 37737 172006362 2
#> 3.1 China 1 212258 1272915272 3
#> 1.2 Afghanistan 2 2666 20595360 1
#> 2.2 Brazil 2 80488 174504898 2
#> 3.2 China 2 213766 1280428583 3
## Fix some stuff: --------
# Add a year variable (based on the time value):
table8_long$year <- NA # initialize
table8_long$year[table8_long$time == 1] <- 1999
table8_long$year[table8_long$time == 2] <- 2000
# Select and re-arrange:
table8_tidy_3 <- table8_long %>%
select(country, year, cases, population) %>%
arrange(country, year)
# Make year a character (as in table8_tidy above):
table8_tidy_3$year <- as.character(table8_tidy_3$year)
# Convert into tibble:
table8_tidy_3 <- tibble::as_tibble(table8_tidy_3)
table8_tidy_3
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
4. Using the pivot
functions from tidyr 1.0.0
Using the pivot_longer()
function (available from tidyr version 1.0.0 onwards):
# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8) # 3 x 5
#> [1] 3 5
# table8
# Using pivot_longer:
table8_tidy_4 <- table8 %>%
tidyr::pivot_longer(cols = cases_1999:popu_2000,
names_to = c(".value", "year"),
names_sep = "_")
# Make year a character (as in table8_tidy above):
table8_tidy_4$year <- as.character(table8_tidy_4$year)
# Rename popu column to population:
table8_tidy_4 <- table8_tidy_4 %>% rename(population = popu)
table8_tidy_4
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Using the pivot_longer()
function with the names_to
and names_sep
arguments (and the special ".value"
sentinel, which results in ignoring this value, and naming the value column from a part of the existing column names) is short, but rather cryptic again.
More explicit control about the transformation is provided by the pivot_longer_spec()
function.
This option, however, is so new (as of November 2020), that there are few examples of it yet.
However, the excellent examples in vignette("pivot")
provide useful starting points.
Verify the equality of solutions
To make sure that all four solutions yielded the same result, let’s verify their equality:
## Verify that all 4 of the above solutions are equal:
all.equal(table8_tidy, table8_tidy_2) # TRUE
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_2b) # TRUE
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_3, check.attributes = FALSE) # TRUE (except for attributes)
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_4) # TRUE, qed.
#> [1] TRUE
As we have seen before, R provides many alternative ways to achieve a result. This is great, of course, but also implies that it is absolutely crucial that we know which particular result we want to achieve.
7.2.7 Beware of updates
As the previous sections have shown, the tidyr package (Wickham & Girlich, 2024) is still changing and improving, which can be really annoying, when working code suddenly starts failing. Its version 1.0.0 was released in September 2019 with many substantial changes, including new functionality:
- Pivoting essentially improves on and replaces
spread()
andgather()
. - Rectangling allows converting deeply nested lists into tidy data frames or tibbles.
To learn more about the corresponding functions, study the extensive vignettes:
As an additional bonus, tidyr contains many interesting practice datasets, including:
# tidyr datasets:
tidyr::billboard
tidyr::construction
tidyr::relig_income
tidyr::us_rent_income
tidyr::who
tidyr::population
See https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0 for additional details and watch https://www.tidyverse.org/ for further updates.