Chapter 7 Reshaping Data

Data do not always come in the form that we would like them to. For example, when collecting data for subjects in multiple waves, it is typical to enter the data with a single row for each participant. A separate column is then created for each wave of measurement. This is referred to as wide format. Here is an example what this format looks like:

##   ID      age t_1 t_2 t_3
## 1  1 27.40234   1   2   1
## 2  2 22.26648   1   3   3
## 3  3 21.51467   2   1   3
## 4  4 27.07017   3   3   2
## 5  5 26.21972   2   1   2

However, many functions in R, such as the lmer() function used to fit mixed-effects/multilevel models, expect data to be in long format. where measures of the same variable only take up one column, there is an additional column representing time, and observations of each subject are spread across multiple rows. For the example above, this is what the same data look like in long format:

##    ID      age time score
## 1   1 27.40234    1     1
## 2   2 22.26648    1     1
## 3   3 21.51467    1     2
## 4   4 27.07017    1     3
## 5   5 26.21972    1     2
## 6   1 27.40234    2     2
## 7   2 22.26648    2     3
## 8   3 21.51467    2     1
## 9   4 27.07017    2     3
## 10  5 26.21972    2     1
## 11  1 27.40234    3     1
## 12  2 22.26648    3     3
## 13  3 21.51467    3     3
## 14  4 27.07017    3     2
## 15  5 26.21972    3     2

We started with a format where there was a single row per participant and reshaped it into a format with a single row value of score. We thus needed an identifying variable time to tell us which observation event the given value of score comes from.

As with everything else in R, there are numerous ways to do this. We are going to do this task using a dialect of R called “tidyverse” that has become very popular in data science. There is nothing wrong with doing this the “normal” R way but many find tidyverse quite intuitive (though many prefer base R) and it provides a powerful ecosystem of functions for data processing.

7.1 Tidyverse

As the name suggests, tidyverse is a bit of a parallel universe to base R. It is still R and all the functions tidyverse encompasses can be used the “normal” way but they come to their full strength when used with the tidyverse syntax. So yes, tidyverse is still the R language but it’s a dialect within the language that’s somewhat different to base R.

The main difference between base R and tidyverse is that while the former reads “inside-out”, the latter presents a pipeline of operations that flows from left to right.

What do we mean by “inside-out”? Let’s look at this base R code:

# base R - inside out
round(mean(rnorm(100)), 3)

## [1] 0.109

Here, we generated 100 random standard normal numbers, took their mean, and rounded it to 3 decimal places. The first function is the innermost one and the sequence proceed outwards. Here is the same thing written the tidyverse way:

# tidyverse - left to right
rnorm(100) %>% mean() %>% round(3)

## [1] 0.109

Here, the sequence goes from left to right. The pipe operator %>% takes the output of whatever is to the left of it and pushes it to the function to the right of it as its first argument. rnorm(100) %>% mean() is thus the same as mean(rnorm(100)). So we first took the output of rnorm(100) and pushed it to the mean() function as its first argument. Then took the output of this step and pushed it to the round() function, again as its first argument, and, as we wanted to round to 3 decimal places, we specified the second argument of round() to be 3.

Pipes are quite useful when your manipulation of a single object involves a sequence of several steps where none of the intermediate steps is particularly useful to you. They make the code legible in a way that many find more intuitive than the inside-out way base R works.

Apart from this principle, tidyverse includes several packages that were written with a uniform approach to data processing and visualisation in mind. Some of these packages are ggplot2, tidyr, dplyr, tibble. The latter introduces an additional data structure called tibble that is a kind of data.frame+ and the former presents an alternative approach to graphics in R.

You can install the entire suite as a single package, aptly named "tidyverse" using the install.packages("tidyverse") command. Just remember you also need to load the package if you want to use the functions therein! (for more information, see the tidyverse website)

7.2 Wide to long

With the intro out of the way, how do we use tidyverse to reshape our data from a wide format to long? The gather() and separate() functions will come in handy here. Remember to load tidyverse before you try to use them!

7.2.1 `gather()`

The gather() function takes several columns and creates a single column out of them. It also creates an identifier variable that includes as its values the names of the original columns. All we need to provide is data, the name of the identifier (AKA key) variable, the name of the resulting single column, and the columns we want to merge into one. If the columns to be merged are adjacent in the original data, the latter argument can be a range of columns, e.g., col1:col8.

To use the example at the very top, we want to create a single column out of the values in t_1, t_2, and t_3 called score and a key variable called time. If the data frame in question is called df, then:

df %>% gather(key = time, value = score, t_1:t_3)

##    ID      age time score
## 1   1 27.40234  t_1     1
## 2   2 22.26648  t_1     1
## 3   3 21.51467  t_1     2
## 4   4 27.07017  t_1     3
## 5   5 26.21972  t_1     2
## 6   1 27.40234  t_2     2
## 7   2 22.26648  t_2     3
## 8   3 21.51467  t_2     1
## 9   4 27.07017  t_2     3
## 10  5 26.21972  t_2     1
## 11  1 27.40234  t_3     1
## 12  2 22.26648  t_3     3
## 13  3 21.51467  t_3     3
## 14  4 27.07017  t_3     2
## 15  5 26.21972  t_3     2

Bear in mind that the resulting data frame will not be saved in the environment unless you don’t assign the entire line of code to some name using <-!

You may have noticed that we didn’t put the variable names into quotes. This is another feature of tidyverse functions – they do not use quotes for names of columns.

7.2.2 `separate()`

Making progress, but we don’t really want to have “t_” and the number pasted together in this way. Let’s get rid of the prefix using tidyverse’s separate() function. It takes the following arguments:

data: data frame you want to do this separation to
col: column you want to do the separation to
into: character vector of the new column names you want to create
sep: character or regex (regular expression) pattern you want to use to split up the values in col

Let’s see how it works:

df %>%
  gather(time, score, t_1:t_3) %>%
  separate(time, c("x", "time"), "_")

##    ID      age x time score
## 1   1 27.40234 t    1     1
## 2   2 22.26648 t    1     1
## 3   3 21.51467 t    1     2
## 4   4 27.07017 t    1     3
## 5   5 26.21972 t    1     2
## 6   1 27.40234 t    2     2
## 7   2 22.26648 t    2     3
## 8   3 21.51467 t    2     1
## 9   4 27.07017 t    2     3
## 10  5 26.21972 t    2     1
## 11  1 27.40234 t    3     1
## 12  2 22.26648 t    3     3
## 13  3 21.51467 t    3     3
## 14  4 27.07017 t    3     2
## 15  5 26.21972 t    3     2

Notice that we can just extend our command pipeline by tacking the separate() function at the end of the line using the %>% pipe.

7.2.3 `select()`

Good but now we have a redundant column x. It can, however, be eliminated in any number of ways. The tidyverse way would be to use the select() function, like this:

df_long <- df %>%
  gather(time, score, t_1:t_3) %>%
  separate(time, c("x", "time"), "_") %>%
  select(-x)

The - sign tells R that we want to drop this column.

A slightly more straightforward way of achieving the same result would be using tidyverse’s mutate() function to modify the time column like this:

df %>%
  gather(time, score, t_1:t_3) %>%
  mutate(time = as.numeric(gsub("t_", "", time)))

The reason we did it the roundabout way was to introduce the select() and separate() functions. The latter is especially handy in situations when a column represents a combination of two variables. Let’s say my experimental design has 2 conditions (A and B) and 2 measurement events for each participant. The data might look like this:

#let's create some data
my_data <- data.frame(id = 1:5, age = rnorm(5, 25, 3),
                 A_1 = round(rnorm(5, 7, 1)),
                 A_2 = round(rnorm(5, 8, 1.2)),
                 B_1 = round(rnorm(5, 7, .9)),
                 B_2 = round(rnorm(5, 6, 1.7))
)

my_data

##   id      age A_1 A_2 B_1 B_2
## 1  1 23.71973   9   8   8   7
## 2  2 24.32916   7   8   7   6
## 3  3 27.15282   7   6   6   6
## 4  4 27.52434   8   9   6   7
## 5  5 24.61493   6   7   6   6

In this situation, separate() is very useful:

my_data_long <- my_data %>%
  gather(cond_measure, score, A_1:B_2) %>%
  separate(cond_measure, c("condition", "measure"), "_", convert = T)
my_data_long

##    id      age condition measure score
## 1   1 23.71973         A       1     9
## 2   2 24.32916         A       1     7
## 3   3 27.15282         A       1     7
## 4   4 27.52434         A       1     8
## 5   5 24.61493         A       1     6
## 6   1 23.71973         A       2     8
## 7   2 24.32916         A       2     8
## 8   3 27.15282         A       2     6
## 9   4 27.52434         A       2     9
## 10  5 24.61493         A       2     7
## 11  1 23.71973         B       1     8
## 12  2 24.32916         B       1     7
## 13  3 27.15282         B       1     6
## 14  4 27.52434         B       1     6
## 15  5 24.61493         B       1     6
## 16  1 23.71973         B       2     7
## 17  2 24.32916         B       2     6
## 18  3 27.15282         B       2     6
## 19  4 27.52434         B       2     7
## 20  5 24.61493         B       2     6

7.3 Descriptives in tidyverse

Another neat feature of tidyverse is its arsenal of functions for data description and summary statistics by groups. The logic is pretty straightforward and it chimes with the overarching philosophy of tidyverse pipelines. Two functions in particular are very useful and work very well together: group_by() %>% summarise().

group_by is pretty self-explanatory – it groups the data by levels of a given variable or combinations of variables. On its own it doesn’t do anything else and its action is not even visible but using it is the first step to getting descriptive statistics by factor(s).

summarise() should follow group_by in the pipeline. It is used to create summary data frame (well, actually a tibble) with specified variables. For instance, if we want get the mean score per each time point from our long-format df from the beginning of this lab sheet, we would just push it into %>% group_by(time) %>% summarise(mean_score = mean(score)). We can even create several summary variables, each with its own statistic. We don’t even need df to be in the long format. Instead, we can just extend our pipeline from earlier on like this:

df %>%
  gather(time, score, t_1:t_3) %>% 
  separate(time, c("x", "time"), "_") %>% 
  select(-x) %>% 
  group_by(time) %>%
  summarise(n = n(),
            mean_score = mean(score))

## # A tibble: 3 x 3
##   time      n mean_score
##   <chr> <int>      <dbl>
## 1 1         5        1.8
## 2 2         5        2  
## 3 3         5        2.2

Pretty neat, huh?

As mentioned above, the output is a tibble (augmented data frame) and can be kept for later by storing it in an R object. Objects like the one you’ve just created can be used for creating pretty fancy plots.