7.2 Essential tidyr commands

The tidyr package (Wickham & Henry, 2020) provides commands to create and transform tidy data. Although the package provides many functions, we consider two pairs of two complementary tidyr commands as essential:

  1. separate() splits one variable into two variables;
  2. unite() combines two variables into one variable;
  3. gather() or pivot_longer() make wide data longer (by gathering many variables into one variable);
  4. spread() or pivot_wider() make long data wider (by spreading one variable into many variables).

The first two commands allow splitting or combining variables: separate is the complement to or inverse/opposite of unite. The second two pairs of commands allow changing the data layout by making a given table longer or wider: spread() and pivot_wider() are the complements to or inverses/opposites of gather() and pivot_longer().

In the following sections, we explain how to use these four essential commands.

7.2.1 separate() 1 variable into 2

The separate() function splits 1 variable (column) into multiple variables (columns) — at a position where some separator character appears — and is the complement of unite(). Using separate() requires the following arguments:

  • some tibble/data frame data;
  • the variable (column) col to be separated (specified by its name or column number);
  • the names of the new variables (columns) into which col is to be split (specified as a character vector);
  • the separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

Example

In tidyr::table3 (a 6 x 3 tibble), the column rate contains two numbers, separated by a forward slash /:

# Data to use: 
tidyr::table3  # Note that column rate contains 2 numbers, separated by "/". 
#> # A tibble: 6 × 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

We can use separate() to split the rate variable at the / character into two variables (cases and population):

# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/", remove = TRUE)
#> # A tibble: 6 × 4
#>   country      year cases  population
#>   <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

The output is a 6 x 4 tibble that contains separate columns for cases and population. Note that both the original rate column and the / character disappeared from the output tibble.

The following commands are shorter variants that achieve the same result:

# Omitting argument names, allowing to guess sep character, using remove default:
separate(table3, rate, c("cases", "population"))

# Using the pipe: 
table3 %>% 
  separate(rate, c("cases", "population"))

# Specifying the variable to be split (rate) by its column number (3):
table3 %>% 
  separate(3, c("cases", "population"))

These examples shows that the argument names (data, col, and into) can be omitted (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be. Also, the variable to be separated can be specified by its numeric index (column number).

If we wanted to keep the rate variable, we could call:

# Not dropping the original rate variable:
t <- table3 %>% 
  separate(rate, c("cases", "population"), remove = FALSE)

Finally, since the rate variable of table3 was of character type, all the tables so far contained variables for cases and population that also were of type character. To get these variables as numbers, we could either explicitly convert them (e.g., by applying the as.integer() function to these columns) or let separate() do this job by providing the convert = TRUE argument:

# Not dropping the original rate variable:
table3 %>% 
  separate(rate, c("cases", "population"), convert = TRUE)
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Practice

Consider table6 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table6.csv):

# Copy data from ds4psy package: 
table6 <- ds4psy::table6

# Load data (from csv file online): 
# table6b <- readr::read_csv("http://rpository.com/ds4psy/data/table6.csv")  # from online source

# Alternatively (from local source "data/table6.csv"): 
# table6c <- readr::read_csv("data/table6.csv")  # from local directory

# all.equal(table6, table6b)  # verify equality
  • Inspect table6 and describe what difficulty may occur when trying to apply separate() to it.
table6  # Note that column when_what contains several splitting options. 
#> # A tibble: 6 × 2
#>   country     when_what              
#>   <chr>       <chr>                  
#> 1 Afghanistan 19_99.745/19987071     
#> 2 Afghanistan 20_00.2666/20595360    
#> 3 Brazil      19_99.37737/172006362  
#> 4 Brazil      20_00.80488/174504898  
#> 5 China       19_99.212258/1272915272
#> 6 China       20_00.213766/1280428583

In table6, the variable when_what contains several plausible separation characters: _, ., and /.

  • What happens when we apply separate to when_what without providing a separating character sep?
# What happens when we do not specify "sep"? 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"))  # sep is not provided!
#> # A tibble: 6 × 3
#>   country     var_1 var_2
#>   <chr>       <chr> <chr>
#> 1 Afghanistan 19    99   
#> 2 Afghanistan 20    00   
#> 3 Brazil      19    99   
#> 4 Brazil      20    00   
#> 5 China       19    99   
#> 6 China       20    00

The variable when_what is split at the first plausible option (_), but a warning is issued and some data is lost from the output table!

  • Split the variable when_what in three different ways (by providing different sep arguments).
# Specifying different splitting characters:
# (a) split at "_": 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "_")  # 
#> # A tibble: 6 × 3
#>   country     var_1 var_2               
#>   <chr>       <chr> <chr>               
#> 1 Afghanistan 19    99.745/19987071     
#> 2 Afghanistan 20    00.2666/20595360    
#> 3 Brazil      19    99.37737/172006362  
#> 4 Brazil      20    00.80488/174504898  
#> 5 China       19    99.212258/1272915272
#> 6 China       20    00.213766/1280428583

# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")  
#> # A tibble: 6 × 3
#>   country     var_1 var_2            
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 19_99 745/19987071     
#> 2 Afghanistan 20_00 2666/20595360    
#> 3 Brazil      19_99 37737/172006362  
#> 4 Brazil      20_00 80488/174504898  
#> 5 China       19_99 212258/1272915272
#> 6 China       20_00 213766/1280428583

# (c) split at "/":
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 × 3
#>   country     var_1        var_2     
#>   <chr>       <chr>        <chr>     
#> 1 Afghanistan 19_99.745    19987071  
#> 2 Afghanistan 20_00.2666   20595360  
#> 3 Brazil      19_99.37737  172006362 
#> 4 Brazil      20_00.80488  174504898 
#> 5 China       19_99.212258 1272915272
#> 6 China       20_00.213766 1280428583

Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Appendix E for a primer on using regular expressions.)

None of the solutions so far is satisfactory. This shows that we may need to use a sequence of separate() commands if a table contains condensed columns as table6.

  • Split the when_what variable of table6 three times to create a tibble table6a that contains five variables (columns) and reasonable variable names.

Here is a solution that takes three steps (separating one or two variables at a time):

table6a <- table6 %>%
  separate(col = when_what, into = c("century", "stuff"), sep = "_") %>%  # (a)
  separate(col = stuff, into = c("year", "stuff"), sep = "\\.") %>%       # (b) 
  separate(col = stuff, into = c("cases", "population"), sep = "/")       # (c)

table6a
#> # A tibble: 6 × 5
#>   country     century year  cases  population
#>   <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583

and here is a solution that achieves the same result in a single step (separating four variables at once):

table6 %>% 
  separate(when_what, c("century", "year", "cases", "population"))  # 4 variables in 1 step
#> # A tibble: 6 × 5
#>   country     century year  cases  population
#>   <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583

Note that the resulting 6 x 5 tibble (here: table6a) is similar to table3 and table5 (if we applied separate() to their rate variable). Also, we may want to convert some variables from their current character format into numbers before analyzing the data further.

7.2.2 unite() 2 variables into 1

unite() is the complement to (or opposite of) separate(): unite() combines two variables (columns) into one variable (column) and allows adding an optional separator character. Using unite() requires the following arguments:

  • some tibble/data frame data;
  • the name of the new compound variable (column) col (specified as a character);
  • the names of the variables (columns) to be combined (specified by their names or column numbers);
  • an optional separator character sep (as a character/regular expression).

An additional argument remove determines whether the original columns are dropped from the output tibble. By default, remove = TRUE.

Example

In tidyr::table5 (a 6 x 4 tibble), the two columns century and year contain two numbers (represented as characters) that we may want to combine into one variable:

# Data to use: 
tidyr::table5  # Note that columns 2 and 3 contain 2 values (as characters!) that belong together. 
#> # A tibble: 6 × 4
#>   country     century year  rate             
#> * <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

The full unite() command for combining century and year into a variable yr is as follows:

# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 × 3
#>   country     yr    rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

Note that we needed to explicitly specify sep = "" (i.e., no separating character) to prevent unite() from adding a default character (_). Also, the century and year variables disappeared from the 6 x 3 output tibble.

As before, we can use various shorter variants of the same command:

# (a) Omitting argument names: 
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 × 3
#>   country     yr    rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

# (b) Using the pipe: 
table5 %>%
  unite("yr", century, year, sep = "")
#> # A tibble: 6 × 3
#>   country     yr    rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

# (c) Specifying the variables to be combined by column numbers (2 & 3):
table5 %>% 
  unite("yr", 2, 3, sep = "")
#> # A tibble: 6 × 3
#>   country     yr    rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

Finally, here are some variants that illustrate the effects of using different sep and remove arguments:

# Providing a different separation character:
table5 %>%
  unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 × 3
#>   country     yr          rate             
#>   <chr>       <chr>       <chr>            
#> 1 Afghanistan 19<--|-->99 745/19987071     
#> 2 Afghanistan 20<--|-->00 2666/20595360    
#> 3 Brazil      19<--|-->99 37737/172006362  
#> 4 Brazil      20<--|-->00 80488/174504898  
#> 5 China       19<--|-->99 212258/1272915272
#> 6 China       20<--|-->00 213766/1280428583

# Keeping the original variables:
table5 %>%
  unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 × 5
#>   country     yr    century year  rate             
#>   <chr>       <chr> <chr>   <chr> <chr>            
#> 1 Afghanistan 1999  19      99    745/19987071     
#> 2 Afghanistan 2000  20      00    2666/20595360    
#> 3 Brazil      1999  19      99    37737/172006362  
#> 4 Brazil      2000  20      00    80488/174504898  
#> 5 China       1999  19      99    212258/1272915272
#> 6 China       2000  20      00    213766/1280428583

Practice

  1. To turn tidyr::table5 into tidy data, it may be indicated to apply both unite() and separate() to it. Can you achieve this in one pipe?

Solution

tidyr::table5 %>%
  unite("year", century, year, sep = "") %>%
  separate("rate", into = c("cases", "population"))
#> # A tibble: 6 × 4
#>   country     year  cases  population
#>   <chr>       <chr> <chr>  <chr>     
#> 1 Afghanistan 1999  745    19987071  
#> 2 Afghanistan 2000  2666   20595360  
#> 3 Brazil      1999  37737  172006362 
#> 4 Brazil      2000  80488  174504898 
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583
  1. Take the data from dplyr::storms and unite the variables year, month, day into one more complex variable date.

Solution

# Data to use: 
# dplyr::storms

# Select relevant subset of variables: 
storm_small <- dplyr::storms %>%
  select(name:day)

## Uniting year:day into date:
storm_small %>%
  unite("date", year:day, sep = "/") %>%
  head()
#> # A tibble: 6 × 2
#>   name  date     
#>   <chr> <chr>    
#> 1 Amy   1975/6/27
#> 2 Amy   1975/6/27
#> 3 Amy   1975/6/27
#> 4 Amy   1975/6/27
#> 5 Amy   1975/6/28
#> 6 Amy   1975/6/28
  1. Consider table7 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table7.csv):

Load or read the data into R (as table7) and inspect it:

#> # A tibble: 6 × 1
#>   where_when_what                   
#>   <chr>                             
#> 1 "Afghanistan@19:99$745\\19987071" 
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"   
#> 4 "Brazil@20:00$80488\\174504898"   
#> 5 "China@19:99$212258\\1272915272"  
#> 6 "China@20:00$213766\\1280428583"

Then transform table7 in the following ways:

  1. Use multiple (4) separate() commands to split table7 into a tibble table7a with multiple (5) columns.

  2. Use multiple (4) unite() commands on table7a to re-create a tibble table7b that contains all data in one column.

Solution

Possible solutions for table7a and table7b:

  1. Separating table7 into table7a:
#> # A tibble: 6 × 5
#>   country     century year  rate   population
#>   <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583
  1. Re-uniting table7a into table7b:
#> # A tibble: 6 × 1
#>   where_when_what               
#>   <chr>                         
#> 1 Afghanistan:1999_745/19987071 
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362   
#> 4 Brazil:2000_80488/174504898   
#> 5 China:1999_212258/1272915272  
#> 6 China:2000_213766/1280428583

7.2.3 Wide vs. long format

Although separate() and unite() change the number of columns in a data table, these commands do not change the basic layout of the table. By contrast, gather() and its complement spread() do change the overall layout by either collecting (or “gathering”) multiple variables into 1, or distributing (or “spreading”) 1 variable across multiple variables. A clear indication that the layout of a table has changed is that its number of cases or observations (i.e., rows) has changed.

To illustrate what this means, re-consider the example of table4a:

Table 7.9: table4a from the tidyr package.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

We noted above (in Sections 7.1.2 and 7.1.3) that this appears to be a natural representation of the TB cases in 3 countries over 2 years. However, we also noted that table4a does not qualify as tidy data (as the year variable is spread over 2 variables) and impractical in many contexts (e.g., it does not allow using a year variable in a ggplot() call). To remedy both shortcomings, we would need a re-formatted table like the following:

Table 7.10: A revised, longer version of table4a.
country year cases
Afghanistan 1999 745
Brazil 1999 37737
China 1999 212258
Afghanistan 2000 2666
Brazil 2000 80488
China 2000 213766

Both tables obviously contain the same information, so how do they differ? The difference between both tables is best characterized by describing the layout of the year variable’s values:

  • The 1st table (table4a) is in a so-called wider format. Different TB cases per year are represented in different columns (einstances of year.

  • The 2nd table represents the same data in a so-called longer format. The count of TB cases are stored in a dedicated variable and the different columns of table4a are distinguished by a separate year variable. (The year variable is sometimes called a key, as it qualifies the values of the cases variable.)

Note that the terms “wide” and “long” only make sense in relative terms: A 6 x 3 table really is not very long, but still longer than (or twice as long as) a 3 x 3 table that contains the same information. And the 3 x 3 table is actually not wider than the 6 x 3 table (as the latter requires an additional year variable to qualify the cases). But as the values of cases are distributed over 2 columns in table4a, we would still consider it to be in a wider format relative to the longer 6 x 3 table.

In psychology, we often represent data collected over multiple measurements (e.g., several tests of one person or the same test administered at different times) in wide format. Many software packages (like SPSS) require data to be in long format for certain analyses (e.g., MANOVA), but in wide format for others (like multiple regression). As we have seen, R and ggplot2 frequently require data to be in long format. Thus, learning to tranform data from wide to long (and vice versa) is an essential data science skill. The tidyr package provides the gather() and spread() commands to achieve this transformation.

7.2.4 gather() or pivot_longer() make wide data longer

Gathering is the opposite of spreading (to be discussed in Section 7.2.5 below) and used when observations that are distributed over multiple columns should be contained in one variable (column). More specifically, gather moves the values of several variables (columns) into one column value and describes this value by the value of a new key variable. When gathering more than two variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).47

Using gather() requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or name of the independent variable);
  • value is the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
  • ... (or var_x:var_y) is a selection (or block) of variables (columns) to be gathered.

Example

We illustrate the gather() command with table4a:

# Data: 
table4a <- tidyr::table4a
table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

As the counts of cases are distributed over 2 variables (columns) for each country, table4a is in wide format. Here is how we can gather() the two columns 1999 and 2000 into a count of cases and qualify them by a new year variable:

# gather 2 variables into 1 variable:
gather(data = table4a, 
       key = year, value = cases, 
       `1999`:`2000`)
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766
# ?gather # provides documentation

As with other tidyr commands, there are many variants of gather() that yield the same result. For instance, we can omit the argument names, use the pipe operator to simplify the command, or provide the variables to be gathered in numeric format:

# Omitting argument names:
gather(table4a, year, cases, `1999`:`2000`)

# The same command using the pipe:
table4a %>% gather(year, cases, `1999`:`2000`)

# The same command specifying the numbers of the columns to gather:
table4a %>% gather(2:3, key = year, value = cases)

Note that the key variable year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:

## Default: convert = FALSE: 
# table4a %>%
#  gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
## => year is a character vector.

## Converting year into an integer: 
table4a %>%
  gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 × 3
#>   country      year  cases
#>   <chr>       <int>  <int>
#> 1 Afghanistan  1999    745
#> 2 Brazil       1999  37737
#> 3 China        1999 212258
#> 4 Afghanistan  2000   2666
#> 5 Brazil       2000  80488
#> 6 China        2000 213766
# => year is a vector of integers. 

Since version 1.0.0 of tidyr onwards (released in September 2019), we can replace gather() by the pivot_longer() function. Its documentation notes the following:

pivot_longer() is an updated approach to gather(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_longer() for new code; gather() isn’t going away but is no longer under active development.” (Wickham & Henry, 2020)

Whereas, the key and value arguments of gather() tend to be confusing, the corresponding names_to and values_to arguments of pivot_longer() are more intuitive. Thus, using the pivot_longer() function to make table4a longer is simple and straightforward:

# Data: 
table4a <- tidyr::table4a
table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

# Using pivot_longer(): 
table4a %>% 
  pivot_longer(cols = `1999`:`2000`, 
               names_to = "year",
               values_to = "cases")
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

Overall, if you are new to tidyr and only want to study one command, learning pivot_longer() is a better investment.

Practice

  • Save the following data as a tibble de and then turn it into a tidy tibble de_2 (by using either gather() or pivot_longer() to create a dependent variable share and listing the election year as an additional variable).

Hint: First convert the table into a longer format and then use separate() and select() to obtain appropriate variables.

Table 7.11: Election results.
party share_2013 share_2017
CDU/CSU 0.415 0.330
SPD 0.257 0.205
Others 0.328 0.465

Solution

  • Using gather() and separate():
# (a) Data (to be saved as a tibble): 
de_1
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

# (b) Converting de into a tidy data table:
de_2 <- de_1 %>%
  gather(share_2013:share_2017, key = "election", value = "share") %>%
  separate(col = election, into = c("dummy", "year")) %>%
  select(year, party, share) %>%
  arrange(year, party)

de_2
#> # A tibble: 6 × 3
#>   year  party   share
#>   <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.33 
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465
  • Using pivot_longer() and separate():
# (a) Data (to be saved as a tibble): 
de_1
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

# (b) Converting de into a tidy data table:
de_3 <- de_1 %>%
  pivot_longer(share_2013:share_2017, names_to = "election", values_to = "share") %>%
  separate(col = election, into = c("dummy", "year")) %>%
  select(year, party, share) %>%
  arrange(year, party)

de_3
#> # A tibble: 6 × 3
#>   year  party   share
#>   <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.33 
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

# Verify equality:
all.equal(de_2, de_3)
#> [1] TRUE

7.2.5 spread() or pivot_wider() make long data wider

Spreading is the opposite of gathering (see Section 7.2.4 above) and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).48

Using spread requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
  • value is the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable).

Note that we do not need to specify a range of new columns. The number and the names of new columns is determined by the number of different values in the key variable.

Example

We illustrate the spread() function with table2:

Table 7.12: table2 from the tidyr package.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Note that the variable count actually contains 2 different DVs, which are further qualified by the values of the key variable type.

Here is how we can spread the values of the count variable into several columns:

# spread 2 rows into 2 columns of 1 row:
spread(data = table2, key = type, value = count)
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# ?spread # provides documentation

The resulting 6 x 4 tibble contains two new columns cases and population.

Here are some variants of the same spread command:

# Omitting argument names:
spread(table2, type, count)

# The same command using the pipe:
table2 %>% 
  spread(key = type, value = count)

# Using the pipe without argument names: 
table2 %>% spread(type, count)

If we want to make it clear that the new variables (i.e., the columns cases and population) were created by spreading a common key variable (type), we could specify an additional sep argument that inserts a character between the key and its value to create the new variable name:

# Use <key><sep><value> to create explicit column names:
table2 %>% 
  spread(key = type, value = count, sep = "_")
#> # A tibble: 6 × 4
#>   country      year type_cases type_population
#>   <chr>       <int>      <int>           <int>
#> 1 Afghanistan  1999        745        19987071
#> 2 Afghanistan  2000       2666        20595360
#> 3 Brazil       1999      37737       172006362
#> 4 Brazil       2000      80488       174504898
#> 5 China        1999     212258      1272915272
#> 6 China        2000     213766      1280428583

As we noted in Section 7.2.4, version 1.0.0 of tidyr (released in September 2019) offers new pivoting functions, that are simpler and more intuitive to use. The spread() function corresponds to the pivot_wider() function:

Whereas, the key and value arguments of gather() tend to be confusing, the corresponding names_to and values_to arguments of pivot_wider() are more intuitive. Thus, using the pivot_wider() function to make table2 wider is straightforward:

# Data: 
table2 <- tidyr::table2
table2
#> # A tibble: 12 × 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

# Using pivot_wider(): 
table2 %>%
  pivot_wider(names_from = "type",
              values_from = "count")
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Again, learning pivot_wider() is recommended over spread(), as it is simpler and more powerful.

Practice

  • Take the tidy 6 x 3 tibble de_2 (from above) and use spread() or pivot_wider() to create a 3 x 3 tibble de_3 that re-creates the original tibble de_1 from it.

Solution

  • Using spread():
# (a) Data from above: 
de_2
#> # A tibble: 6 × 3
#>   year  party   share
#>   <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.33 
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

# (b) Using spread() to put share by year into 2 columns/variables:
de_3 <- de_2 %>% 
  spread(key = year, value = share) %>%
  rename(share_2013 = `2013`,  # restore original variable names
         share_2017 = `2017`)

de_3
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

# (c) Comparing de_3 to de: 
de_1
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465
all.equal(de_3, de_1)
#> [1] TRUE
  • Using pivot_wider():
# (a) Data from above: 
de_2
#> # A tibble: 6 × 3
#>   year  party   share
#>   <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.33 
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

# (b) Using spread_wider() to put share by year into 2 columns/variables:
de_4 <- de_2 %>% 
  pivot_wider(names_from = year, values_from = share) %>%
  rename(share_2013 = `2013`,  # restore original variable names
         share_2017 = `2017`)

de_4
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

# (c) Comparing de_4 to de_1: 
de_1
#> # A tibble: 3 × 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.33 
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465
all.equal(de_4, de_1)
#> [1] TRUE

7.2.6 Multiple DVs

Before getting too excited about the powers of gather() and spread() we should mention an important limitation: Both commands are designed to gather and spread the values of a single dependent variable. In psychology, however, we are frequently dealing with multiple dependent variables. Fortunately, we can use a sequence of tidyr commands to deal with this case as well — or use the reshape() function (from the stats package included in R) as an alternative.

Example

Yet another way of representing the data contained in table1 (and its variations in table2 to table7) is provided by table8 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table8.csv):

Table 7.13: table8: A version of table1 in a wider format.
country cases_1999 cases_2000 popu_1999 popu_2000
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583

This 3 x 5 table is a version of table1 that seems quite natural to many social scientists:

  • The data for three countries are listed as rows (observations), with a variable country denoting the country name.

  • The four measurements — 2 levels (1999 and 2000) on each of 2 variables (cases and popu) — are listed as different variables (columns).

As this format succinctly expresses the repeated measurement of a variable (e.g., cases) on different years for each instance of country, some statistical software packages (like SPSS) require this format for some analyses (e.g., MANOVA). In R, we typically don’t want nested formats that distribute instances of the same variable (e.g., cases) over multiple columns (cases_1999 vs. cases_2000). In the following, we show three alternative ways of tidying table8.

1. Combining gather() and spread()

Gather the four dependent variables, then separate the key into two variables, and spread the two different dependent variables into columns:

# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8)  # 3 x 5
#> [1] 3 5

# Import data (from csv-file online): 
# table8b <- readr::read_csv("http://rpository.com/ds4psy/data/table8.csv") # from online source

# all.equal(table8, table8b)  # verify equality


## (1) Gather, separate, spread: -------- 
table8_tidy <- table8 %>%
  gather(key = "key", value = "value", cases_1999:popu_2000) %>%
  separate(col = key, into = c("type", "year")) %>%
  spread(key = type, value = "value")

names(table8_tidy)[4] <- "population"  # fix variable name
table8_tidy 
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

2. Using two tables

An alternative solution first splits table8 into two tidy sub-tables (one for cases and one for population counts), and then combines both sub-tables:

# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8)  # 3 x 5
#> [1] 3 5

## (2) Split table8 into 2 sub-tables: -------- 

## (a) Tidy counts of cases per year:
t8a <- table8 %>%
  select(country:cases_2000) %>%
  gather(key = "key", value = "cases", cases_1999:cases_2000) %>%
  separate(col = key, into = c("drop", "year")) %>%
  select(country, year, cases) %>%
  arrange(country, year)
t8a
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <dbl>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

## (b) Tidy counts of populations per year: 
t8b <- table8 %>%
  select(country, popu_1999:popu_2000) %>%
  gather(key = "key", value = "population", popu_1999:popu_2000) %>%
  separate(col = key, into = c("drop", "year")) %>%
  select(country, year, population) %>%
  arrange(country, year)
t8b
#> # A tibble: 6 × 3
#>   country     year  population
#>   <chr>       <chr>      <dbl>
#> 1 Afghanistan 1999    19987071
#> 2 Afghanistan 2000    20595360
#> 3 Brazil      1999   172006362
#> 4 Brazil      2000   174504898
#> 5 China       1999  1272915272
#> 6 China       2000  1280428583

## (c) Join both sub-tables: -------- 

# Either by using join commands of dplyr (see the next chapter on "Relational data"): 
table8_tidy_2 <- dplyr::left_join(t8a, t8b)
table8_tidy_2
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

# OR (adding new variables by assignment): 
table8_tidy_2b <- t8a  # copy
table8_tidy_2b$population <- t8b$population  # CAREFUL: Ensure that both tables are in same order (see arrange above)! 
table8_tidy_2b
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

3. Using reshape() (from base R stats)

Using the base R function reshape() (included in the R stats package) also solves our problem (and even allows combining multiple blocks of DVs at once).

However, since reshape() assumes a data frame, rather than a tibble, we will convert it with as.data.frame() and re-convert the result into a tibble with as_tibble() (and add some adjustments to recover the year variable):

# Copy data from ds4psy package:
table8 <- ds4psy::table8 
dim(table8)  # 3 x 5
#> [1] 3 5

# Convert into data frame: 
table8 <- as.data.frame(table8)
dim(table8)
#> [1] 3 5

# See 
# ?reshape # for details

## From wide to long by using reshape: -------- 
table8_long <- stats::reshape(table8, 
                              varying = list(cases = c("cases_1999", "cases_2000"),     # 1st set of variables to combine into 1
                                             population = c("popu_1999", "popu_2000")), # 2nd set of variables to combine into 1
                              direction = "long",
                              v.names = c("cases", "population"),  # to rename combined variables
                              timevar = "time",  # name of time variable 
                              idvar = "nr"       # name of id variable
)
table8_long
#>         country time  cases population nr
#> 1.1 Afghanistan    1    745   19987071  1
#> 2.1      Brazil    1  37737  172006362  2
#> 3.1       China    1 212258 1272915272  3
#> 1.2 Afghanistan    2   2666   20595360  1
#> 2.2      Brazil    2  80488  174504898  2
#> 3.2       China    2 213766 1280428583  3


## Fix some stuff: --------

# Add a year variable (based on the time value):
table8_long$year <- NA  # initialize
table8_long$year[table8_long$time == 1] <- 1999
table8_long$year[table8_long$time == 2] <- 2000

# Select and re-arrange:
table8_tidy_3 <- table8_long %>%
  select(country, year, cases, population) %>%
  arrange(country, year)

# Make year a character (as in table8_tidy above):
table8_tidy_3$year <- as.character(table8_tidy_3$year)

# Convert into tibble:
table8_tidy_3 <- tibble::as_tibble(table8_tidy_3)
table8_tidy_3
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

4. Using the pivot functions from tidyr 1.0.0

Using the pivot_longer() function (available from tidyr version 1.0.0 onwards):

# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8)  # 3 x 5
#> [1] 3 5
# table8

# Using pivot_longer: 
table8_tidy_4 <- table8 %>%
  tidyr::pivot_longer(cols = cases_1999:popu_2000,
                      names_to = c(".value", "year"),
                      names_sep = "_")

# Make year a character (as in table8_tidy above):
table8_tidy_4$year <- as.character(table8_tidy_4$year)

# Rename popu column to population:
table8_tidy_4 <- table8_tidy_4 %>% rename(population = popu)

table8_tidy_4
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

Using the pivot_longer() function with the names_to and names_sep arguments (and the special ".value" sentinel, which results in ignoring this value, and naming the value column from a part of the existing column names) is short, but rather cryptic again.

More explicit control about the transformation is provided by the pivot_longer_spec() function. This option, however, is so new (as of November 2020), that there are few examples of it yet. However, the excellent examples in vignette("pivot") provide useful starting points.

Verify the equality of solutions

To make sure that all four solutions yielded the same result, let’s verify their equality:

## Verify that all 4 of the above solutions are equal: 
all.equal(table8_tidy, table8_tidy_2)   # TRUE
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_2b)  # TRUE
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_3, check.attributes = FALSE)  # TRUE (except for attributes)
#> [1] TRUE
all.equal(table8_tidy, table8_tidy_4)   # TRUE, qed. 
#> [1] TRUE

As we have seen before, R provides many alternative ways to achieve a result. This is great, of course, but also implies that it is absolutely crucial that we know which particular result we want to achieve.

7.2.7 Beware of updates

As the previous sections have shown, the tidyr package (Wickham & Henry, 2020) is still changing and improving, which can be really annoying, when working code suddenly starts failing. Its version 1.0.0 was released in September 2019 with many substantial changes, including new functionality:

  • Pivoting essentially improves on and replaces spread() and gather().
  • Rectangling allows converting deeply nested lists into tidy data frames or tibbles.

To learn more about the corresponding functions, study the extensive vignettes:

# Documentation:
vignette("pivot")
vignette("rectangle") 
vignette("nest")
vignette("in-packages") 

As an additional bonus, tidyr contains many interesting practice datasets, including:

# tidyr datasets:
tidyr::billboard
tidyr::construction
tidyr::relig_income
tidyr::us_rent_income
tidyr::who
tidyr::population

See https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0 for additional details and watch https://www.tidyverse.org/ for further updates.

References

Wickham, H., & Henry, L. (2020). tidyr: Tidy messy data. Retrieved from https://CRAN.R-project.org/package=tidyr

  1. The length and width of a data set are relative terms here: gathering tends to decrease data width (column) by increasing length (number of rows), spreading tends to decrease data length (rows) by increasing width (columns).↩︎

  2. Again, the length and width of data sets are relative terms.↩︎