13 Reshaping/ restructuring repeated measures/ clustered datasets

Data Wrangling Recipes in R: Hilary Watt

“Fabulous… practical guidance. Makes the hard work of restructuring and preparing data significantly easier … I haven’t seen this focused content presented in one place in such an easy to access form before.” Dr Thomas Woodcock, senior research fellow, School of Public Health.

13.1 Long and wide data formats and basic commands

Both formats are really helpful. For any given analysis/ table/ graph, you will prefer/ require a specific format. For an analysis project with clustered data, you will probably require both formats. Switching between the two is relatively straight-forward when you know how. It takes some effort initially to figure it out.

One key purpose of this introductory section is to illustrate what the command can do. The other key purpose is to illustrate the command, with a simple example. If your example isn’t so simple, you need to pay attention to later sections.

These require library(tidyverse) – more specifically library(tidyr).

People who don’t know these commands can take hours/ days/ weeks doing tasks which can be done quickly. Without some prior knowledge/ knowing command names, googling can be challenging. These commands are useful when you have repeated measurements on the same individuals, or perhaps different measurements taken on different members of the household, or other types of cluster. They will move between the two formats illustrated here:

Here we have cat litters, with kittens weights by birth order (for 1st, 2nd, 3rd, 4th born kittens).

Long format of data (several lines per cluster – here cluster = cat litter):

print(kits)

Restructuring from long to wider format:
The following command says: take data-frame kits –save result to kitten data-frame/ tibble

  • takes data from variable named kit_no to become (part of) variable names in new wider version of dataset
  • prefix the variables names by kit
  • Take values for new variable kit from values in weight variable
# take dataframe kits and pivot wider
kitten <- pivot_wider(kits, # name data-frame to be restructured - save result to kitten
            names_from = kit_no, # take data from kitno to become part of new variable names
            names_prefix = "kit", # new column names start with kit
            values_from = weight) # data from weight goes into newly created columns
# The names_prefix option is required since "kit_no" does not contain valid 
# variable names

This data is now in wide format (one version of variable(s) for each member of the cluster): Now, we have the same data, but with one line of data per litter/ cat family, with 4 weight of kitten variables, representing weight of kitten1, kitten2, kitten3, kitten4 (if any). The cat’s mum’s name is also given.

Wide format

print(kitten)

Restructuring from wide to longer format:
Now, we are translating kitten back to long format (& saving to data-frame/ tibblekits). The variables starting with kit will be restructured in this way. The names of thesekit variables (excluding theirkit prefix, leaving numbers 1, 2, 3, 4) become data in the variable namedkit_no. The values/ data from within thesekit variables ends up in variable namedweight.

The last optional line dropped any rows where weight=NA.

# restructure "kittens" to longer version, save as "kits"
kits <-   pivot_longer(
    kittens, # name the data-frame to be restructured to long format
    cols = starts_with("kit"), # name the cols to be restructured
    names_to = "kit_no",     # data from variable names starting kit ends up here 
                        # (this then contains numbers on end of names, stripping out prefix below)
    names_prefix = "kit", # gives prefix for var names to be restructured 
                         # this implies "kit" part of variable names is not included in "kit_no" data
    values_to = "weight", # name the variable that will be created to contain 
                          # data from above named columns
        values_drop_na = TRUE # OPTIONALLY drop new rows with NA for newly created 
                          # weight variable 
  )

Here is the long format again:

print(kits)

It is possible to restructure several variables to long format at the same time; an example of this is shown in the next section. Similarly, restructuring wide can cope with 2 or more variables to restructure.

The following version of pivot_wider shows just how easy it is to extend the syntax to cope with restructuring several variables simultaneously (all those named invalues_from option).

head(phq3)
phq4 <- pivot_wider(phq3, # name dataframe to be restructured
                    values_from=c("phq", "time_since_baseline",  # name set of cols to be restructured
                                  "phq_change_from_baseline",
                                  "phq_change_from_previous_time",
                                  "time_from_previous", "datev"), 
                    names_from="nnn") # numbers from this column become latter part of variable names
head(phq4)

However, pivot_longer is slightly more complex when more than one set of columns is restructured, as shown in the next section.

13.2 Requirements for pivot_longer

We need a cluster ID variable (perhaps family ID/ region ID/ person ID for repeated measures data). This needs to be unique, so different on each row (with data in wide format).

The above format of the command requires variable names which all start the same but have a different ending (usually different numbers). In the above example, the names were kit1, kit2, kit3, kit4.

When we have 2 or more sets of variables to restructure, we need a way to separate these variable names (for these sets of variables) into a component that become variable names in the “long” version of the dataset, and a component that forms data in a new column. The following form of the command works, for variables named birthweight1, birthweight2, birthweight3, birthweight4 and then name1, name2, name3, name4, name5. Any other set of variables could be added, with variable names of the forms ###1, ###2, ###3, ###4, ###5 where ### is one or more letters.

print(pups, font_size = 7)

The syntax names the columns to be restructured, with the cols=starts_with option: cols=starts_with( c(“birthweight”, “name”) )

The above names 2 sets of variables to restructure. More names can be added to restructure more variables.

Whilst the names_to & names_sep are both required, it should be sufficient to copy them from that shown below, provided each set of variable names to restructure is of the form ###1, ###2, ###3, ###4,… where ### is any letter(s).

# Restructuring to long format with 2 groups of variables to restructure
# where variable names have format "name_stub" then number (where number
# denotes member of group)
pups2 <- pivot_longer(
    pups,
    cols=starts_with(c("birthweight", "name")), # define variables to be restructured
    names_to = c(".value", "pupno"), # 2nd element of var names become contents of var "pupno"
    #".value" above implies that column names comes from info in first part of var names as defined below:
           names_sep = "(?<=[A-Za-z])(?=[0-9])" # separate var names at first point a letter is followed by a number
    # names_sep is applied to variable names from cols= element
    # because we specified ".value" in names_to command we don't need to specify "values_to"
)

print(pups2)

For instance, with an additional set of columns named wt1, wt2, wt3, wt4, wt5 to restructure, the syntax is identical, except for the addition of wt in the cols=starts_with option. Therefore, with names of this format, it is easy to extend the syntax to cope with very many sets of variables to restructure.

print(pups, font_size=5)
# The command is the same as the previous block, except all columns that also start "wt" are transformed from wide to long format
pups2 <- pivot_longer(
    pups,
    cols=starts_with(c("birthweight", "name", "wt")),
    names_to = c(".value", "pupno"),
    names_sep = "(?<=[A-Za-z])(?=[0-9])" )

print(pups2)

Just in case it is useful, this is an example of output that is not fully restructured, so 5 sets of the wt variable remain. This is unlikely to be useful:

# The command is the same as the previous block, except all columns that also start "wt" are transformed from wide to long format
pups2 <- pivot_longer(
    pups,
    cols=starts_with(c("birthweight", "name")),
    names_to = c(".value", "pupno"),
    names_sep = "(?<=[A-Za-z])(?=[0-9])" )

print(pups2)

13.3 Requirements for pivot_wider

  1. You need a variable which uniquely defines the clusters, which could be family, region, hospital or (for repeated measures data) personid. The above example had kittens clustered within cat family/ litter. Either of the variables catlitterno or catmum_name or both could be selected, although catlitterno is the obvious choice (as used above).

  2. We need another variable (herekitno), which counts people/ items/ repeat visits within clusters. It usually takes usually a small number of integer values, although we could theoretically have 50 items within one cluster (requiring items numbers 1, 2, 3….50). The number of separate values it takes (1, 2, 3, 4 = 4 values in the kittens example above) corresponds to the number of new variables created (per variable to be restructured). Because there is one restructured variable (weight) in the kitten example, there are 4 new variables created (one for each value of kitno, called kit1, kit2, kit3, kit4). You might need to create this – see next section.

  3. All variables that vary within cluster (catlitterno here; patientid for repeated measures data) need to be either:

    • Dropped prior to pivot_wider
    • OR specified in the pivot_wider command

EXAMPLE OF pivot_wider NOT WORKING violating 3. above. Resulting dataset is wider but is just as long. This is rarely of any use. For example, if the above kit dataset contained values for the length as well as weight of each kitten, the identical pivot_wider command to above results in this useless format:

print(kitten)

If pivot_wider does not result in a shorter dataset, we need to either drop variable(s) prior to pivot_wider (such as length variable here): kits$length <- NULL or else specify all variables that vary by cluster as needing restructuring; the command becomes a little more complex.

13.3.1 Creating a count within cluster to prepare for pivot_wider

Suppose you have patientid, then a date variable to represent different visits, but no visit number variables. You need to initially create a visitno variable, before you can successfully pivot_wider. Otherwise, you end up with multiple versions of each restructured variable, one for each different date (from your date variable – all dates might possibly be different.

The code below usesarrange from tidyverse dplyr to sort the data. In the example, it is sorted by hserial (=household ID number, a form a cluster ID), then by pserial (=person ID number, where there are a few lines for each person within clusters/households).

Then the data is grouped by the cluster variable, using dpylr’s group_by( ) command. Use library(tidyverse) or library(dplyr). Then a new command is created, and since this is relying on dpylr’s grouping, we need to use dpylr’s mutate to create the new variable. The code below specifies the namepersonno for this new variable, which is a sequence from 1, going up in steps of 1 (by hserial, the group_by variable), until it reaches the total number (in that hserial cluster).

ihd_pain3 <- read.csv("ihd_pain3.csv")

ihd_pain3 # data on people (id pserial) within households (id hserial)

# The above does not have a variable numbering data within groups
ihd_pain4 <- arrange(ihd_pain3, hserial, pserial) # sort by hserial then pserial

ihd_pain4 <- group_by(ihd_pain4, hserial)  # group by hserial

# Note text "Group: hserial[8]" denoting 8 groups

# NOTE: mutate is Tidyverse command to create new var - must be used for n() to work
# create new var personno var that counts people within household
ihd_pain4 <- mutate(ihd_pain4, personno=seq(from=1,by=1, to=n())) 

print(ihd_pain4)

To help you understand the code above:

seq(from=1, by=1, to=9) gives a sequence from 1 to 9 in steps of 1, hence it results in: 1,2,3,4,5,6,7,8,9.

seq(from=1, by=1, to=n()) results in 1,2,3,4,5,6,7,…… n() where n() is the total in the cluster/ group_by variable.

If you are not familiar with “pipe %>%” format of commands, the above code is useful. Many people would write the above codein the following pipe format. It can be trickier to debug pipes, since you might not be certain where the error lies. To debug pipes, you can add in one line at a time to make sure each works (or remove lines one at a time, until you find code that runs).

ihd_pain4 <- ihd_pain3 %>% 
    arrange(hserial, pserial) %>% 
    group_by(hserial) %>% 
    mutate(personno=seq(from=1, by=1, to=n()))

The above syntax will define new variable visitno to equal 1, 2, 3, 4, 5…, starting at 1 again for each new pserial (personid).

For this specific example, several clusters contain only one person, so personno (which counts people within clusters) takes the value 1. There is just one hserial (household) which has 3 people in it (3 pserial’s) in this example 1022021 – with personno 1, 2 and 3 against it.

print(ihd_pain4)

13.4 Creating change variables for repeated measures data

Here is an example of creating changes from baseline and changes from previous measurement. This works on data in long format. This uses syntax from tidyverse’s dplyr so requires library(dplyr) or library(tidyverse).

# this enables dmy function to create var that R recognises as a date
require(lubridate)
# Creating variables that give differences from baseline and from last
# measurement for both dates and numeric data. This is easiest when data is in
# long format

# The following commands convert the string "date_visit" variable into date format
# then group by personid, and sort by personid and date
# Then creates a sequence that counts 1, 2, 3... within each personid

# keep only rows with non-missing personid
phq3 <- subset(phq, is.na(personid)==FALSE) 

# create date var from string "date" var "date_visit", with date in order date,
# month, year
phq3$datev <- dmy(phq3$date_visit)

# sort dataframe phq by personid and then datev variables
phq3 <- arrange(phq3, personid, datev)

# group by personid
phq3 <- group_by(phq3, personid) 

# DROP this variable - note negative - sign before its name
phq3 <- select(phq3, -date_visit) 

phq3<- mutate(phq3,
              # count visitno within group_by variable
              nnn=seq(from=1,by=1, to=n()),
              # first is first value (within group_by var)
              baseline_phq=first(phq), 
              # find minimum date (within group_by var)
              baseline_date=min((datev), na.rm=TRUE), 
              # find diff from baseline date
              time_since_baseline=(datev - baseline_date),
              # find diff in phq from baseline
              phq_change_from_baseline=phq-baseline_phq, 
              # lag(phq) gives the value of phq on row above (provided it has 
              # the same personid = same group_by variable) it results in 
              # missing value for the first row within each personid 
              # (our group_by variable)
              # ifelse fills in zero diff at baseline rather than missing
              phq_change_from_previous_time=ifelse(nnn==1,0,phq - lag(phq)),
              # similarly for date
              time_from_previous=ifelse(nnn==1, 0, datev - lag(datev)) 
              ) 

print(phq3, font_size=5)
# Restructuring to wide format with several groups of variables to restructure
# Note that it is important to specify all variables that vary by cluster so 
# that the following code does not produce data with lots of NAs with more lines
# than desired alternatively, variables that are not needed can be dropped 
# before this transformation
phq4 <- pivot_wider(phq3, 
                    values_from=c("phq", "time_since_baseline",
                                  "phq_change_from_baseline",
                                  "phq_change_from_previous_time",
                                  "time_from_previous", "datev"  ),
                    names_from="nnn"
  )

print(phq4, font_size=3)

This can be converted back to long format as follows:

# Restructuring back to long format
phq5 <- pivot_longer(phq4, 
              cols=starts_with(c("phq", "time_since_baseline",
                                 "phq_change_from_baseline",
                                 "phq_change_from_previous",
                                 "time_from_previous", "datev")),
               # take latter part of names and put into var="visitno"
               # ".value" above implies that column names comes from info in
               names_to = c(".value", "visitno"), 
               # first part of var names as defined below:
               names_sep = "(?<=[A-Za-z])_(?=[0-9])" # separate var names 
               # separate where there is a letter [A-Za-z] followed by "_" 
               # followed by a number [0-9] 
               # because we specified ".value" in names_to command we don't need
               # to specify "values_to"
  )

print(phq5, font_size=5)

13.5 Exercises

  1. Open puppies.csv, drop the names. Now reshape so that each puppy has its own line of data showing its birth-weight.

  2. Reopen the dataset puppies.csv, and this time keep the names. Now reshape so that each puppy has its own line of data, showing its birth-weight and name.

  3. Reshape kittens.csv, so that each kitten has its own line of data with its own birth-weight. Hint: you need to rename some variables first. Now use the reshape command again (changing “long” to “wide” in the command) to get back to the original data structure. Notice how the coding does and does not relate to variable names, in each version of the command.

  4. Reshape kittens_puppies.csv, so that each baby (kitten or puppy) has its birth-weight on its own line.

  5. Reshape husband_anx_deprn.csv so that the baseline and follow-up PHQ scores both appear on the same line.

  6. Reshape wife_deprn.csv so that the baseline and follow-up PHQ scores both appear on the same line. Hint: you will need to firstly create a variable that contains the visit number, using _n.

  7. Reshape couple_deprn_anxiety2.csv dataset, so that the baseline and the follow-up visits occur on separate lines. Reshape again, so that data for husbands and wives also occur on separate lines. (wife/baseline, wife/follow-up, husband/baseline and husband/follow-up, all on separate rows for each family).

  8. Reshape the ihddata.csv into wide format.


Datasets are available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.