6.4 Rectangling

Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:

unnest_longer() takes each element of a list-column and makes a new row.
unnest_wider() takes each element of a list-column and makes a new column.
unnest_auto() guesses whether you want unnest_longer() or unnest_wider().
hoist() is similar to unnest_wider() but only plucks out selected components, and can reach down multiple levels.

A very large number of data rectangling problems can be solved by combining these functions with a splash of dplyr (largely eliminating prior approaches that combined mutate() with multiple purrr::map()s).

To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.

library(repurrrsive)

6.4.1 Github users

We’ll start with gh_users, a list which contains information about six GitHub users.

listviewer::jsonedit(gh_users)

Each user is a named list, where each element represents a column:

To begin, we put the gh_users list into a data frame:

(users <- tibble(user = gh_users))
#> # A tibble: 6 x 1
#>   user             
#>   <list>           
#> 1 <named list [30]>
#> 2 <named list [30]>
#> 3 <named list [30]>
#> 4 <named list [30]>
#> 5 <named list [30]>
#> 6 <named list [30]>

Each element of column user is yet another list, where each element represents a column.

names(users$user[[1]])
#>  [1] "login"               "id"                  "avatar_url"         
#>  [4] "gravatar_id"         "url"                 "html_url"           
#>  [7] "followers_url"       "following_url"       "gists_url"          
#> [10] "starred_url"         "subscriptions_url"   "organizations_url"  
#> [13] "repos_url"           "events_url"          "received_events_url"
#> [16] "type"                "site_admin"          "name"               
#> [19] "company"             "blog"                "location"           
#> [22] "email"               "hireable"            "bio"                
#> [25] "public_repos"        "public_gists"        "followers"          
#> [28] "following"           "created_at"          "updated_at"

Obviously we could use unnest_wider() to turn the list components into columns:

users %>% unnest_wider(user)
#> # A tibble: 6 x 30
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo~ 6.60e5 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 2 jenn~ 5.99e5 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 3 jtle~ 1.57e6 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 4 juli~ 1.25e7 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 5 leep~ 3.51e6 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 6 masa~ 8.36e6 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> # ... with 22 more variables: gists_url <chr>, starred_url <chr>,
#> #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> #   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> #   public_repos <int>, public_gists <int>, followers <int>, following <int>,
#> #   created_at <chr>, updated_at <chr>, bio <chr>, hireable <lgl>

But in this case, there are many components and we don’t need most of them so we can instead use hoist(). hoist() allows us to pull out selected components using the same syntax as purrr::pluck():

users %>% hoist(user,
                followers = "followers",
                login = "login",
                url = "html_url")
#> # A tibble: 6 x 4
#>   followers login       url                            user             
#>       <int> <chr>       <chr>                          <list>           
#> 1       303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2       780 jennybc     https://github.com/jennybc     <named list [27]>
#> 3      3958 jtleek      https://github.com/jtleek      <named list [27]>
#> 4       115 juliasilge  https://github.com/juliasilge  <named list [27]>
#> 5       213 leeper      https://github.com/leeper      <named list [27]>
#> 6        34 masalmon    https://github.com/masalmon    <named list [27]>

hoist() 从列表列中提取出指明的元素作为新变量，保留余下的元素

hoist() removes the named components from the user list-column, so you can think of it as moving components out of the inner list into the top-level data frame

6.4.2 Github repos

We start off gh_repos similarly, by putting it in a tibble:

repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 x 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>

By comparison, gh_repos is more nested than gh_users, with elements in the 2nd hierarchy being repositorys that gh_users own, and thus requires one more level of information to record each repo.

listviewer::jsonedit(gh_repos)

This time the elements of user are a list of repositories that belong to that user. These are observations, so should become new rows, so we use unnest_longer() rather than unnest_wider():

repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 x 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> # ... with 170 more rows

Now each rwo representes a repository, then we can use unnest_wider() or hoist():

repos %>% hoist(repo, 
  login = list("owner", "login"), 
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)
#> # A tibble: 176 x 5
#>   login       name        homepage watchers repo             
#>   <chr>       <chr>       <chr>       <int> <list>           
#> 1 gaborcsardi after        <NA>           5 <named list [65]>
#> 2 gaborcsardi argufy       <NA>          19 <named list [65]>
#> 3 gaborcsardi ask          <NA>           5 <named list [65]>
#> 4 gaborcsardi baseimports  <NA>           0 <named list [65]>
#> 5 gaborcsardi citest       <NA>           0 <named list [65]>
#> 6 gaborcsardi clisymbols  ""             18 <named list [65]>
#> # ... with 170 more rows

Note the use of list("owner", "login"): this allows us to reach two levels deep inside of a list using the same syntax as purrr::pluck(). An alternative approach would be to pull out just owner and then put each element of it in a column:

repos %>% 
  hoist(repo, owner = "owner") %>% 
  unnest_wider(owner)
#> # A tibble: 176 x 18
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 2 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 3 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 4 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 5 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> 6 gabo~ 660288 https://a~ ""          http~ https:/~ https://api.~ https://api.~
#> # ... with 170 more rows, and 10 more variables: gists_url <chr>,
#> #   starred_url <chr>, subscriptions_url <chr>, organizations_url <chr>,
#> #   repos_url <chr>, events_url <chr>, received_events_url <chr>, type <chr>,
#> #   site_admin <lgl>, repo <list>

Instead of looking at the list and carefully thinking about whether it needs to become rows or columns, you can use unnest_auto(). It uses a handful of heuristics to figure out whether unnest_longer() or unnest_wider() is appropriate, and tells you about its reasoning.

tibble(repo = gh_repos) %>% 
  unnest_auto(repo) %>% 
  unnest_auto(repo)
#> Using `unnest_longer(repo)`; no element has names
#> Using `unnest_wider(repo)`; elements have 68 names in common
#> # A tibble: 176 x 67
#>       id name  full_name owner private html_url description fork  url  
#>    <int> <chr> <chr>     <lis> <lgl>   <chr>    <chr>       <lgl> <chr>
#> 1 6.12e7 after gaborcsa~ <nam~ FALSE   https:/~ Run Code i~ FALSE http~
#> 2 4.05e7 argu~ gaborcsa~ <nam~ FALSE   https:/~ Declarativ~ FALSE http~
#> 3 3.64e7 ask   gaborcsa~ <nam~ FALSE   https:/~ Friendly C~ FALSE http~
#> 4 3.49e7 base~ gaborcsa~ <nam~ FALSE   https:/~ Do we get ~ FALSE http~
#> 5 6.16e7 cite~ gaborcsa~ <nam~ FALSE   https:/~ Test R pac~ TRUE  http~
#> 6 3.39e7 clis~ gaborcsa~ <nam~ FALSE   https:/~ Unicode sy~ FALSE http~
#> # ... with 170 more rows, and 58 more variables: forks_url <chr>,
#> #   keys_url <chr>, collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
#> #   issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
#> #   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
#> #   git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
#> #   languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
#> #   subscribers_url <chr>, subscription_url <chr>, commits_url <chr>,
#> #   git_commits_url <chr>, comments_url <chr>, issue_comment_url <chr>,
#> #   contents_url <chr>, compare_url <chr>, merges_url <chr>, archive_url <chr>,
#> #   downloads_url <chr>, issues_url <chr>, pulls_url <chr>,
#> #   milestones_url <chr>, notifications_url <chr>, labels_url <chr>,
#> #   releases_url <chr>, deployments_url <chr>, created_at <chr>,
#> #   updated_at <chr>, pushed_at <chr>, git_url <chr>, ssh_url <chr>,
#> #   clone_url <chr>, svn_url <chr>, size <int>, stargazers_count <int>,
#> #   watchers_count <int>, language <chr>, has_issues <lgl>,
#> #   has_downloads <lgl>, has_wiki <lgl>, has_pages <lgl>, forks_count <int>,
#> #   open_issues_count <int>, forks <int>, open_issues <int>, watchers <int>,
#> #   default_branch <chr>, homepage <chr>

6.4.3 Game of Throne characters

got_chars has a similar structure to gh_users: it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character.

listviewer::jsonedit(got_chars)

We start in the same way, first by creating a data frame and then by unnesting each component into a column:

chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 x 1
#>   char             
#>   <list>           
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> # ... with 24 more rows

chars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 x 18
#>   url      id name  gender culture born  died  alive titles aliases father
#>   <chr> <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
#> 1 http~  1022 Theo~ Male   "Ironb~ "In ~ ""    TRUE  <chr ~ <chr [~ ""    
#> 2 http~  1052 Tyri~ Male   ""      "In ~ ""    TRUE  <chr ~ <chr [~ ""    
#> 3 http~  1074 Vict~ Male   "Ironb~ "In ~ ""    TRUE  <chr ~ <chr [~ ""    
#> 4 http~  1109 Will  Male   ""      ""    "In ~ FALSE <chr ~ <chr [~ ""    
#> 5 http~  1166 Areo~ Male   "Norvo~ "In ~ ""    TRUE  <chr ~ <chr [~ ""    
#> 6 http~  1267 Chett Male   ""      "At ~ "In ~ FALSE <chr ~ <chr [~ ""    
#> # ... with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
#> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> #   playedBy <list>

This is more complex than gh_users because some component of char are themselves a list, giving us a collection of list-columns:

chars2 %>% select_if(is.list)
#> select_if: dropped 11 variables (url, id, name, gender, culture, …)
#> # A tibble: 30 x 7
#>   titles    aliases    allegiances books     povBooks  tvSeries  playedBy 
#>   <list>    <list>     <list>      <list>    <list>    <list>    <list>   
#> 1 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]>  <??? [1]>   <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]>  <??? [1]>   <chr [2]> <chr [1]> <chr [1]> <chr [1]>
#> # ... with 24 more rows

What you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:

chars2 %>% 
  select(name, books, tvSeries) %>% 
  pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>% 
  unnest_longer(value)
#> select: dropped 15 variables (url, id, gender, culture, born, …)
#> pivot_longer: reorganized (books, tvSeries) into (media, value) [was 30x3, now 60x3]
#> # A tibble: 180 x 3
#>   name          media    value            
#>   <chr>         <chr>    <chr>            
#> 1 Theon Greyjoy books    A Game of Thrones
#> 2 Theon Greyjoy books    A Storm of Swords
#> 3 Theon Greyjoy books    A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1         
#> 5 Theon Greyjoy tvSeries Season 2         
#> 6 Theon Greyjoy tvSeries Season 3         
#> # ... with 174 more rows

Or maybe you want to build a table that lets you match title to name:

chars2 %>% 
  select(name, title = titles) %>% 
  unnest_longer(title)
#> select: renamed one variable (title) and dropped 16 variables
#> # A tibble: 60 x 2
#>   name              title                                               
#>   <chr>             <chr>                                               
#> 1 Theon Greyjoy     Prince of Winterfell                                
#> 2 Theon Greyjoy     Captain of Sea Bitch                                
#> 3 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister  Acting Hand of the King (former)                    
#> 5 Tyrion Lannister  Master of Coin (former)                             
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet                      
#> # ... with 54 more rows

6.4.4 Sharla Gelfand’s discography

We’ll finish off with the most complex list, from Sharla Gelfand’s discography. We’ll start the usual way: putting the list into a single column data frame, and then widening so each component is a column. I also parse the date_added column into a real date-time:

listviewer::jsonedit(discog)

array

[155]

{5}

displaying 100 of 155 items. show more. show all.

discs <- tibble(disc = discog) %>% 
  unnest_wider(disc) %>% 
  mutate(date_added = as.POSIXct(strptime(date_added, "%Y-%m-%dT%H:%M:%S"))) 
#> mutate: converted 'date_added' from character to double (0 new NA)

discs 
#> # A tibble: 155 x 5
#>   instance_id date_added          basic_information       id rating
#>         <int> <dttm>              <list>               <int>  <int>
#> 1   354823933 2019-02-16 17:48:59 <named list [11]>  7496378      0
#> 2   354092601 2019-02-13 14:13:11 <named list [11]>  4490852      0
#> 3   354091476 2019-02-13 14:07:23 <named list [11]>  9827276      0
#> 4   351244906 2019-02-02 11:39:58 <named list [11]>  9769203      0
#> 5   351244801 2019-02-02 11:39:37 <named list [11]>  7237138      0
#> 6   351052065 2019-02-01 20:40:53 <named list [11]> 13117042      0
#> # ... with 149 more rows

At this level, we see information about when each disc was added to Sharla’s discography, not any information about the disc itself. To do that we need to widen the basic_information column:

discs %>% unnest_wider(basic_information)
#> Error: Column name `id` must not be duplicated.

Unfortunately that fails because there’s an id column inside basic_information. We can quickly see what’s going on by setting names_repair = "unique"(default to "check_unique" which makes no name repair, but check they are unique):

discs %>% unnest_wider(basic_information, names_repair = "unique")
#> New names:
#> * id -> id...6
#> * id -> id...14
#> # A tibble: 155 x 15
#>   instance_id date_added          labels  year artists id...6 thumb title
#>         <int> <dttm>              <list> <int> <list>   <int> <chr> <chr>
#> 1   354823933 2019-02-16 17:48:59 <list~  2015 <list ~ 7.50e6 http~ Demo 
#> 2   354092601 2019-02-13 14:13:11 <list~  2013 <list ~ 4.49e6 http~ Obse~
#> 3   354091476 2019-02-13 14:07:23 <list~  2017 <list ~ 9.83e6 http~ I    
#> 4   351244906 2019-02-02 11:39:58 <list~  2017 <list ~ 9.77e6 http~ Oído~
#> 5   351244801 2019-02-02 11:39:37 <list~  2015 <list ~ 7.24e6 http~ A Ca~
#> 6   351052065 2019-02-01 20:40:53 <list~  2019 <list ~ 1.31e7 http~ Tash~
#> # ... with 149 more rows, and 7 more variables: formats <list>,
#> #   cover_image <chr>, resource_url <chr>, master_id <int>, master_url <chr>,
#> #   id...14 <int>, rating <int>

The problem is that basic_information repeats the id column that’s also stored at the top-level, so we can just drop that:

discs %>% 
  unnest_wider(basic_information, names_repair = "unique") %>% 
  select(starts_with("id"))
#> New names:
#> * id -> id...6
#> * id -> id...14
#> select: dropped 13 variables (instance_id, date_added, labels, year, artists, …)
#> # A tibble: 155 x 2
#>     id...6  id...14
#>      <int>    <int>
#> 1  7496378  7496378
#> 2  4490852  4490852
#> 3  9827276  9827276
#> 4  9769203  9769203
#> 5  7237138  7237138
#> 6 13117042 13117042
#> # ... with 149 more rows

discs %>% 
  select(-id) %>% 
  unnest_wider(basic_information)
#> select: dropped one variable (id)
#> # A tibble: 155 x 14
#>   instance_id date_added          labels  year artists     id thumb title
#>         <int> <dttm>              <list> <int> <list>   <int> <chr> <chr>
#> 1   354823933 2019-02-16 17:48:59 <list~  2015 <list ~ 7.50e6 http~ Demo 
#> 2   354092601 2019-02-13 14:13:11 <list~  2013 <list ~ 4.49e6 http~ Obse~
#> 3   354091476 2019-02-13 14:07:23 <list~  2017 <list ~ 9.83e6 http~ I    
#> 4   351244906 2019-02-02 11:39:58 <list~  2017 <list ~ 9.77e6 http~ Oído~
#> 5   351244801 2019-02-02 11:39:37 <list~  2015 <list ~ 7.24e6 http~ A Ca~
#> 6   351052065 2019-02-01 20:40:53 <list~  2019 <list ~ 1.31e7 http~ Tash~
#> # ... with 149 more rows, and 6 more variables: formats <list>,
#> #   cover_image <chr>, resource_url <chr>, master_id <int>, master_url <chr>,
#> #   rating <int>

Alternatively, we could use hoist()

discs %>% 
  hoist(basic_information,
    title = "title",
    year = "year",
    label = list("labels", 1, "name"),
    artist = list("artists", 1, "name")
  )
#> # A tibble: 155 x 9
#>   instance_id date_added          title  year label artist basic_informati~
#>         <int> <dttm>              <chr> <int> <chr> <chr>  <list>          
#> 1   354823933 2019-02-16 17:48:59 Demo   2015 Tobi~ Mollot <named list [9]>
#> 2   354092601 2019-02-13 14:13:11 Obse~  2013 La V~ Una B~ <named list [9]>
#> 3   354091476 2019-02-13 14:07:23 I      2017 La V~ S.H.I~ <named list [9]>
#> 4   351244906 2019-02-02 11:39:58 Oído~  2017 La V~ Rata ~ <named list [9]>
#> 5   351244801 2019-02-02 11:39:37 A Ca~  2015 Kato~ Ivy (~ <named list [9]>
#> 6   351052065 2019-02-01 20:40:53 Tash~  2019 High~ Tashme <named list [9]>
#> # ... with 149 more rows, and 2 more variables: id <int>, rating <int>

A more systematic approach would be to create separate tables for artist and label:

# table for artist
discs %>% 
  hoist(basic_information, artist = "artists") %>% 
  select(disk_id = id, artist) %>% 
  unnest_longer(artist) %>% 
  unnest_wider(artist)
#> select: renamed one variable (disk_id) and dropped 4 variables
#> # A tibble: 167 x 8
#>    disk_id join  name          anv   tracks role  resource_url                id
#>      <int> <chr> <chr>         <chr> <chr>  <chr> <chr>                    <int>
#> 1  7496378 ""    Mollot        ""    ""     ""    https://api.discogs.co~ 4.62e6
#> 2  4490852 ""    Una Bèstia I~ ""    ""     ""    https://api.discogs.co~ 3.19e6
#> 3  9827276 ""    S.H.I.T. (3)  ""    ""     ""    https://api.discogs.co~ 2.77e6
#> 4  9769203 ""    Rata Negra    ""    ""     ""    https://api.discogs.co~ 4.28e6
#> 5  7237138 ""    Ivy (18)      ""    ""     ""    https://api.discogs.co~ 3.60e6
#> 6 13117042 ""    Tashme        ""    ""     ""    https://api.discogs.co~ 5.21e6
#> # ... with 161 more rows

# table for label
discs %>% 
  hoist(basic_information, format = "formats") %>% 
  select(disk_id = id, format) %>% 
  unnest_longer(format) %>% 
  unnest_wider(format) %>% 
  unnest_longer(descriptions)
#> select: renamed one variable (disk_id) and dropped 4 variables
#> # A tibble: 281 x 5
#>   disk_id descriptions text  name     qty  
#>     <int> <chr>        <chr> <chr>    <chr>
#> 1 7496378 "Numbered"   Black Cassette 1    
#> 2 4490852 "LP"         <NA>  Vinyl    1    
#> 3 9827276 "7\""        <NA>  Vinyl    1    
#> 4 9827276 "45 RPM"     <NA>  Vinyl    1    
#> 5 9827276 "EP"         <NA>  Vinyl    1    
#> 6 9769203 "LP"         <NA>  Vinyl    1    
#> # ... with 275 more rows