7.2 Essential tidyr commands

The tidyr package (Wickham & Henry, 2020) provides commands to create and transform tidy data. Although the package provides many functions, we consider 2 pairs of 2 complementary tidyr commands as essential:

  1. separate() splits 1 variable into 2 variables;
  2. unite() combines 2 variables into 1 variable;
  3. gather() or pivot_longer() make wide data longer (by gathering many variables into 1);
  4. spread() or pivot_wider() make long data wider (by spreading 1 variable into many).

The first 2 commands allow splitting or combining variables: separate is the complement to or inverse/opposite of unite. The second 2 pairs of commands allow changing the data layout by making a given table longer or wider: spread() and pivot_wider() are the complements to or inverses/opposites of gather() and pivot_longer().

In the following sections, we explain how to use these 4 essential commands.

7.2.1 separate 1 variable into 2

The separate() function splits 1 variable (column) into multiple variables (columns) — at a position where some separator character appears — and is the complement to unite. Using separate() requires the following arguments:

  • some tibble/data frame data;
  • the variable (column) col to be separated (specified by its name or column number);
  • the names of the new variables (columns) into which col is to be split (specified as a character vector);
  • the separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

Example

In tidyr::table3 (a 6 x 3 tibble), the column rate contains 2 numbers, separated by a forward slash /:

We can use separate() to split the rate variable at the / character into 2 variables (cases and population):

The output is a 6 x 4 tibble that contains separate columns for cases and population. Note that both the original rate column and the / character disappeared from the output tibble.

The following commands are shorter variants that achieve the same result:

These examples shows that the argument names (data, col, and into) can be omitted (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be. Also, the variable to be separated can be specified by its numeric index (column number).

If we wanted to keep the rate variable, we could call:

Finally, since the rate variable of table3 was of character type, all the tables so far contained variables for cases and population that also were of type character. To get these variables as numbers, we could either explicitly convert them (e.g., by applying the as.integer() function to these columns) or let separate() do this job by providing the convert = TRUE argument:

Practice

Consider table6 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table6.csv):

  • Inspect table6 and describe what difficulty may occur when trying to apply separate() to it.

In table6, the variable when_what contains several plausible separation characters: _, ., and /.

  • What happens when we apply separate to when_what without providing a separating character sep?

The variable when_what is split at the 1st plausible option (_), but a warning is issued and some data is lost from the output table!

  • Split the variable when_what in 3 different ways (by providing different sep arguments).

Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Chapter 14: Strings for details.)

None of the solutions so far is satisfactory. This shows that we may need to use a sequence of separate commands if a table contains condensed columns as table6.

  • Split the when_what variable of table6 3 times to create a tibble table6a that contains 5 variables (columns) and reasonable variable names.

Here is a solution that takes three steps (separating 1 or 2 variables at a time):

and here is a solution that achieves the same result in a single step (separating 4 variables at once):

Note that the resulting 6 x 5 tibble (here: table6a) is similar to table3 and table5 (if we applied separate to their rate variable). Also, we may want to convert some variables from their current character format into numbers before analyzing the data further.

7.2.2 unite 2 variables into 1

unite() is the complement to (or opposite of) separate(): unite() combines 2 variables (columns) into 1 variable (column) and allows adding an optional separator character. Using unite() requires the following arguments:

  • some tibble/data frame data;
  • the name of the new compound variable (column) col (specified as a character);
  • the names of the variables (columns) to be combined (specified by their names or column numbers);
  • an optional separator character sep (as a character/regular expression).

An additional argument remove determines whether the original columns are dropped from the output tibble. By default, remove = TRUE.

Example

In tidyr::table5 (a 6 x 4 tibble), the 2 columns century and year contains 2 numbers (represented as characters) that we may want to combine into 1 variable:

The full unite() command for combining century and year into a variable yr is as follows:

Note that we needed to explicitly specify sep = "" (i.e., no separating character) to prevent unite() from adding a default character (_). Also, the century and year variables disappeared from the 6 x 3 output tibble.

As before, we can use various shorter variants of the same command:

Finally, here are some variants that illustrate the effects of using different sep and remove arguments:

Practice

  1. To turn tidyr::table5 into tidy data, it may be indicated to apply both unite() and separate() to it. Can you achieve this in one pipe?
  1. Take the data from dplyr::storms and unite the variables year, month, day into 1 variable date?
#> # A tibble: 6 x 2
#>   name  date     
#>   <chr> <chr>    
#> 1 Amy   1975/6/27
#> 2 Amy   1975/6/27
#> 3 Amy   1975/6/27
#> 4 Amy   1975/6/27
#> 5 Amy   1975/6/28
#> 6 Amy   1975/6/28
  1. Consider table7 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table7.csv):
#> [1] TRUE
#> # A tibble: 6 x 1
#>   where_when_what                   
#>   <chr>                             
#> 1 "Afghanistan@19:99$745\\19987071" 
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"   
#> 4 "Brazil@20:00$80488\\174504898"   
#> 5 "China@19:99$212258\\1272915272"  
#> 6 "China@20:00$213766\\1280428583"

Then transform it into the following ways:

  1. Use multiple (4) separate() commands to split table7 into a tibble table7a with multiple (5) columns.

  2. Use multiple (4) unite() commands on table7a to re-create a tibble table7b that contains all data in 1 column.

Possible solutions for table7a and table7b:

  1. Separating table7 into table7a:
#> # A tibble: 6 x 5
#>   country     century year  rate   population
#>   <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583
  1. Re-uniting table7a into table7b:
#> # A tibble: 6 x 1
#>   where_when_what               
#>   <chr>                         
#> 1 Afghanistan:1999_745/19987071 
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362   
#> 4 Brazil:2000_80488/174504898   
#> 5 China:1999_212258/1272915272  
#> 6 China:2000_213766/1280428583

7.2.3 Wide vs. long format

Although separate() and unite() change the number of columns in a data table, these commands do not change the basic layout of the table. By contrast, gather() and its complement spread() do change the overall layout by either collecting (or “gathering”) multiple variables into 1, or distributing (or “spreading”) 1 variable across multiple variables. A clear indication that the layout of a table has changed is that its number of cases or observations (i.e., rows) has changed.

To illustrate what this means, re-consider the example of table4a:

Table 7.8: table4a from the tidyr package.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

We noted above (in Sections 7.1.2 and 7.1.3) that this appears to be a natural representation of the TB cases in 3 countries over 2 years. However, we also noted that table4a does not qualify as tidy data (as the year variable is spread over 2 variables) and impractical in many contexts (e.g., it does not allow using a year variable in a ggplot() call). To remedy both shortcomings, we would need a re-formatted table like the following:

Table 7.9: A revised, longer version of table4a.
country year cases
Afghanistan 1999 745
Brazil 1999 37737
China 1999 212258
Afghanistan 2000 2666
Brazil 2000 80488
China 2000 213766

Both tables obviously contain the same information, so how do they differ? The difference between both tables is best characterized by describing the layout of the year variable’s values:

  • The 1st table (table4a) is in a so-called wider format. Different TB cases per year are represented in different columns (einstances of year.

  • The 2nd table represents the same data in a so-called longer format. The count of TB cases are stored in a dedicated variable and the different columns of table4a are distinguished by a separate year variable. (The year variable is sometimes called a key, as it qualifies the values of the cases variable.)

Note that the terms “wide” and “long” only make sense in relative terms: A 6 x 3 table really is not very long, but still longer than (or twice as long as) a 3 x 3 table that contains the same information. And the 3 x 3 table is actually not wider than the 6 x 3 table (as the latter requires an additional year variable to qualify the cases). But as the values of cases are distributed over 2 columns in table4a, we would still consider it to be in a wider format relative to the longer 6 x 3 table.

In psychology, we often represent data collected over multiple measurements (e.g., several tests of one person or the same test administered at different times) in wide format. Many software packages (like SPSS) require data to be in long format for certain analyses (e.g., MANOVA), but in wide format for others (like multiple regression). As we have seen, R and ggplot2 frequently require data to be in long format. Thus, learning to tranform data from wide to long (and vice versa) is an essential data science skill. The tidyr package provides the gather() and spread() commands to achieve this transformation.

7.2.4 gather or pivot_longer make wide data longer

Gathering is the opposite of spreading (to be discussed in Section 7.2.5 below) and used when observations that are distributed over multiple columns should be contained in 1 variable (column). More specifically, gather moves the values of several variables (columns) into 1 column value and describes this value by the value of a new key variable. When gathering more than 2 variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).31

Using gather() requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or name of the independent variable);
  • value is the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
  • ... (or var_x:var_y) is a selection (or block) of variables (columns) to be gathered.

Example

We illustrate the gather() command with table4a:

As the counts of cases are distributed over 2 variables (columns) for each country, table4a is in wide format. Here is how we can gather() the 2 columns 1999 and 2000 into a count of cases and qualify them by a new year variable:

As with other tidyr commands, there are many variants of gather() that yield the same result. For instance, we can omit the argument names, use the pipe operator to simplify the command, or provide the variables to be gathered in numeric format:

Note that the key variable year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:

Since version 1.0.0 of tidyr (released in September 2019), we can replace gather() by the pivot_longer() function. Its documentation notes the following:

pivot_longer() is an updated approach to gather(), designed to be both simpler to use and to handle more use cases. We recomend you use pivot_longer() for new code; gather() isn’t going away but is no longer under active development.” (Wickham & Henry, 2020)

Whereas, the key and value arguments of gather() tend to be confusing, the corresponding names_to and values_to arguments of pivot_longer() are more intuitive. Thus, using the pivot_longer() function to make table4a longer is simple and straightforward:

Overall, if you are new to tidyr and only want to study one command, learning pivot_longer() is a better investment.

Practice

  • Save the following data as a tibble de and then turn it into a tidy tibble de_2 (by using either gather() or pivot_longer() to create a dependent variable share and listing the election year as an additional variable).

Hint: First convert the table into a longer format and then use separate() and select() to obtain appropriate variables.

Table 7.10: Election results.
party share_2013 share_2017
CDU/CSU 0.415 0.330
SPD 0.257 0.205
Others 0.328 0.465

7.2.5 spread or pivot_wider make long data wider

Spreading is the opposite of gathering (see Section 7.2.4 above) and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).32

Using spread requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
  • value is the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable).

Note that we do not need to specify a range of new columns. The number and the names of new columns is determined by the number of different values in the key variable.

Example

We illustrate the spread() function with table2:

Table 7.11: table2 from the tidyr package.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Note that the variable count actually contains 2 different DVs, which are further qualified by the values of the key variable type.

Here is how we can spread the values of the count variable into several columns:

The resulting 6 x 4 tibble contains 2 new columns cases and population.

Here are some variants of the same spread command:

If we want to make it clear that the new variables (i.e., the columns cases and population) were created by spreading a common key variable (type), we could specify an additional sep argument that inserts a character between the key and its value to create the new variable name:

As we noted in Section 7.2.4, version 1.0.0 of tidyr (released in September 2019) offers new pivoting functions, that are simpler and more intuitive to use. The spread() function corresponds to the pivot_wider() function:

Whereas, the key and value arguments of gather() tend to be confusing, the corresponding names_to and values_to arguments of pivot_longer() are more intuitive. Thus, using the pivot_longer() function to make table4a longer is straightforward:

Again, learning pivot_wider() is recommended over spread(), as it is simpler and more powerful.

Practice

  • Take the tidy 6 x 3 tibble de_2 (from above) and use spread() or pivot_wider() to create a 3 x 3 tibble de_3 that re-creates the original tibble de_1 from it.

Solution

  • Using spread():
  • Using pivot_wider():

7.2.6 Multiple DVs

Before getting too excited about the powers of gather and spread we should mention an important limitation: Both commands are designed to gather and spread the values of a single dependent variable. In psychology, however, we are frequently dealing with multiple dependent variables. Fortunately, we can use a sequence of tidyr commands to deal with this case as well — or use the reshape() function (from the stats package included in R) as an alternative.

Example

Yet another way of representing the data contained in table1 (and its variations in table2 to table7) is provided by table8 of the ds4psy package (also available online as a csv-file at http://rpository.com/ds4psy/data/table8.csv):

Table 7.12: table8: A version of table1 in a wider format.
country cases_1999 cases_2000 popu_1999 popu_2000
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583

This 3 x 5 table is a version of table1 that seems quite natural to many social scientists:

  • The data for 3 countries are listed as rows (observations), with a variable country denoting the country name.

  • The 4 measurements – 2 levels (1999 and 2000) on each of 2 variables (cases and popu) – are listed as different variables (columns).

As this format succinctly expresses the repeated measurement of a variable (e.g., cases) on different years for each instance of country, some statistical software packages (like SPSS) require this format for some analyses (e.g., MANOVA). In R, we typically don’t want nested formats that distribute instances of the same variable (e.g., cases) over multiple columns (cases_1999 vs. cases_2000). In the following, we show 3 alternative ways of tidying table8.

2. Using 2 tables

An alternative solution first splits table8 into 2 tidy sub-tables (one for cases and one for population counts), and then combines both sub-tables:

# Copy data from ds4psy package:
table8 <- ds4psy::table8
dim(table8)  # 3 x 5
#> [1] 3 5

## (2) Split table8 into 2 sub-tables: -------- 

## (a) Tidy counts of cases per year:
t8a <- table8 %>%
  select(country:cases_2000) %>%
  gather(key = "key", value = "cases", cases_1999:cases_2000) %>%
  separate(col = key, into = c("drop", "year")) %>%
  select(country, year, cases) %>%
  arrange(country, year)
t8a
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <dbl>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

## (b) Tidy counts of populations per year: 
t8b <- table8 %>%
  select(country, popu_1999:popu_2000) %>%
  gather(key = "key", value = "population", popu_1999:popu_2000) %>%
  separate(col = key, into = c("drop", "year")) %>%
  select(country, year, population) %>%
  arrange(country, year)
t8b
#> # A tibble: 6 x 3
#>   country     year  population
#>   <chr>       <chr>      <dbl>
#> 1 Afghanistan 1999    19987071
#> 2 Afghanistan 2000    20595360
#> 3 Brazil      1999   172006362
#> 4 Brazil      2000   174504898
#> 5 China       1999  1272915272
#> 6 China       2000  1280428583

## (c) Join both sub-tables: -------- 

# Either by using join commands of dplyr (see the next chapter on "Relational data"): 
table8_tidy_2 <- dplyr::left_join(t8a, t8b)
table8_tidy_2
#> # A tibble: 6 x 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

# OR (adding new variables by assignment): 
table8_tidy_2b <- t8a  # copy
table8_tidy_2b$population <- t8b$population  # CAREFUL: Ensure that both tables are in same order (see arrange above)! 
table8_tidy_2b
#> # A tibble: 6 x 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

3. Using reshape() (from base R stats)

Using the base R function reshape() (included in the R stats package) also solves our problem (and even allows combining multiple blocks of DVs at once).

However, since reshape() assumes a data frame, rather than a tibble, we will convert it with as.data.frame() and re-convert the result into a tibble with as_tibble() (and add some adjustments to recover the year variable):

# Copy data from ds4psy package:
table8 <- ds4psy::table8 
dim(table8)  # 3 x 5
#> [1] 3 5

# Convert into data frame: 
table8 <- as.data.frame(table8)
dim(table8)
#> [1] 3 5

# See 
# ?reshape # for details

## From wide to long by using reshape: -------- 
table8_long <- stats::reshape(table8, 
                              varying = list(cases = c("cases_1999", "cases_2000"),     # 1st set of variables to combine into 1
                                             population = c("popu_1999", "popu_2000")), # 2nd set of variables to combine into 1
                              direction = "long",
                              v.names = c("cases", "population"),  # to rename combined variables
                              timevar = "time",  # name of time variable 
                              idvar = "nr"       # name of id variable
)
table8_long
#>         country time  cases population nr
#> 1.1 Afghanistan    1    745   19987071  1
#> 2.1      Brazil    1  37737  172006362  2
#> 3.1       China    1 212258 1272915272  3
#> 1.2 Afghanistan    2   2666   20595360  1
#> 2.2      Brazil    2  80488  174504898  2
#> 3.2       China    2 213766 1280428583  3


## Fix some stuff: --------

# Add a year variable (based on the time value):
table8_long$year <- NA  # initialize
table8_long$year[table8_long$time == 1] <- 1999
table8_long$year[table8_long$time == 2] <- 2000

# Select and re-arrange:
table8_tidy_3 <- table8_long %>%
  select(country, year, cases, population) %>%
  arrange(country, year)

# Make year a character (as in table8_tidy above):
table8_tidy_3$year <- as.character(table8_tidy_3$year)

# Convert into tibble:
table8_tidy_3 <- tibble::as_tibble(table8_tidy_3)
table8_tidy_3
#> # A tibble: 6 x 4
#>   country     year   cases population
#>   <chr>       <chr>  <dbl>      <dbl>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

4. Using pivot_ functions from tidyr 1.0.0

Using pivot_longer() function (from tidyr version 1.0.0, released in September 2019):

Using the pivot_longer() function with the names_to and names_sep arguments (and the special ".value" sentinel, which results in ignoring this value, and naming the value column from a part of the existing column names) is short, but rather cryptic again.

More explicit control about the transformation is provided by the pivot_longer_spec() function. This option, however, is so new (as of November 2019), that there are few examples of it yet, but vignette("pivot") is a useful starting point.

Verify the equality of solutions

To make sure that all 4 solutions yielded the same result, let’s verify their equality:

As always, R provides many alternative ways to achieve a result. This is great, of course, but also implies that it is absolutely crucial that we know which particular result we want to achieve.

7.2.7 Beware of updates

As the previous sections have shown, the tidyr package (Wickham & Henry, 2020) is still changing and improving, which can be really annoying, when working code suddenly starts failing. Its version 1.0.0 was released in September 2019 with many substantial changes, including new functionality:

  • Pivoting essentially improves on and replaces spread() and gather().
  • Rectangling allows converting deeply nested lists into tidy data frames or tibbles.

To learn more about the corresponding functions, study the extensive vignettes:

As an additional bonus, tidyr contains many interesting practice datasets, including:

See https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0 for additional details and watch https://www.tidyverse.org/ for further updates.

References

Wickham, H., & Henry, L. (2020). tidyr: Tidy messy data. Retrieved from https://CRAN.R-project.org/package=tidyr


  1. The length and width of a data set are relative terms here: gathering tends to decrease data width (column) by increasing length (number of rows), spreading tends to decrease data length (rows) by increasing width (columns).

  2. Again, the length and width of data sets are relative terms.