6.4 Rectangling
Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:
unnest_longer()
takes each element of a list-column and makes a new row.unnest_wider()
takes each element of a list-column and makes a new column.unnest_auto()
guesses whether you wantunnest_longer()
orunnest_wider()
.hoist()
is similar to unnest_wider() but only plucks out selected components, and can reach down multiple levels.
A very large number of data rectangling problems can be solved by combining these functions with a splash of dplyr
(largely eliminating prior approaches that combined mutate()
with multiple purrr::map()
s).
To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.
6.4.1 Github users
We’ll start with gh_users
, a list which contains information about six GitHub users.
To begin, we put the gh_users list into a data frame:
(users <- tibble(user = gh_users))
#> # A tibble: 6 x 1
#> user
#> <list>
#> 1 <named list [30]>
#> 2 <named list [30]>
#> 3 <named list [30]>
#> 4 <named list [30]>
#> 5 <named list [30]>
#> 6 <named list [30]>
Each element of column user
is yet another list, where each element represents a column.
names(users$user[[1]])
#> [1] "login" "id" "avatar_url"
#> [4] "gravatar_id" "url" "html_url"
#> [7] "followers_url" "following_url" "gists_url"
#> [10] "starred_url" "subscriptions_url" "organizations_url"
#> [13] "repos_url" "events_url" "received_events_url"
#> [16] "type" "site_admin" "name"
#> [19] "company" "blog" "location"
#> [22] "email" "hireable" "bio"
#> [25] "public_repos" "public_gists" "followers"
#> [28] "following" "created_at" "updated_at"
Obviously we could use unnest_wider()
to turn the list components into columns:
users %>% unnest_wider(user)
#> # A tibble: 6 x 30
#> login id avatar_url gravatar_id url html_url followers_url following_url
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 gabo~ 6.60e5 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 2 jenn~ 5.99e5 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 3 jtle~ 1.57e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 4 juli~ 1.25e7 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 5 leep~ 3.51e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 6 masa~ 8.36e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> # ... with 22 more variables: gists_url <chr>, starred_url <chr>,
#> # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> # public_repos <int>, public_gists <int>, followers <int>, following <int>,
#> # created_at <chr>, updated_at <chr>, bio <chr>, hireable <lgl>
But in this case, there are many components and we don’t need most of them so we can instead use hoist()
. hoist()
allows us to pull out selected components using the same syntax as purrr::pluck()
:
users %>% hoist(user,
followers = "followers",
login = "login",
url = "html_url")
#> # A tibble: 6 x 4
#> followers login url user
#> <int> <chr> <chr> <list>
#> 1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2 780 jennybc https://github.com/jennybc <named list [27]>
#> 3 3958 jtleek https://github.com/jtleek <named list [27]>
#> 4 115 juliasilge https://github.com/juliasilge <named list [27]>
#> 5 213 leeper https://github.com/leeper <named list [27]>
#> 6 34 masalmon https://github.com/masalmon <named list [27]>
hoist()
从列表列中提取出指明的元素作为新变量,保留余下的元素
hoist()
removes the named components from the user list-column, so you can think of it as moving components out of the inner list into the top-level data frame
6.4.2 Github repos
We start off gh_repos
similarly, by putting it in a tibble:
repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 x 1
#> repo
#> <list>
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>
By comparison, gh_repos
is more nested than gh_users
, with elements in the 2nd hierarchy being repositorys that gh_users
own, and thus requires one more level of information to record each repo.
This time the elements of user are a list of repositories that belong to that user. These are observations, so should become new rows, so we use unnest_longer()
rather than unnest_wider()
:
repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 x 1
#> repo
#> <list>
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> # ... with 170 more rows
Now each rwo representes a repository, then we can use unnest_wider()
or hoist()
:
repos %>% hoist(repo,
login = list("owner", "login"),
name = "name",
homepage = "homepage",
watchers = "watchers_count"
)
#> # A tibble: 176 x 5
#> login name homepage watchers repo
#> <chr> <chr> <chr> <int> <list>
#> 1 gaborcsardi after <NA> 5 <named list [65]>
#> 2 gaborcsardi argufy <NA> 19 <named list [65]>
#> 3 gaborcsardi ask <NA> 5 <named list [65]>
#> 4 gaborcsardi baseimports <NA> 0 <named list [65]>
#> 5 gaborcsardi citest <NA> 0 <named list [65]>
#> 6 gaborcsardi clisymbols "" 18 <named list [65]>
#> # ... with 170 more rows
Note the use of list("owner", "login")
: this allows us to reach two levels deep inside of a list using the same syntax as purrr::pluck()
. An alternative approach would be to pull out just owner and then put each element of it in a column:
repos %>%
hoist(repo, owner = "owner") %>%
unnest_wider(owner)
#> # A tibble: 176 x 18
#> login id avatar_url gravatar_id url html_url followers_url following_url
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 2 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 3 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 4 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 5 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 6 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> # ... with 170 more rows, and 10 more variables: gists_url <chr>,
#> # starred_url <chr>, subscriptions_url <chr>, organizations_url <chr>,
#> # repos_url <chr>, events_url <chr>, received_events_url <chr>, type <chr>,
#> # site_admin <lgl>, repo <list>
Instead of looking at the list and carefully thinking about whether it needs to become rows or columns, you can use unnest_auto()
. It uses a handful of heuristics to figure out whether unnest_longer()
or unnest_wider()
is appropriate, and tells you about its reasoning.
tibble(repo = gh_repos) %>%
unnest_auto(repo) %>%
unnest_auto(repo)
#> Using `unnest_longer(repo)`; no element has names
#> Using `unnest_wider(repo)`; elements have 68 names in common
#> # A tibble: 176 x 67
#> id name full_name owner private html_url description fork url
#> <int> <chr> <chr> <lis> <lgl> <chr> <chr> <lgl> <chr>
#> 1 6.12e7 after gaborcsa~ <nam~ FALSE https:/~ Run Code i~ FALSE http~
#> 2 4.05e7 argu~ gaborcsa~ <nam~ FALSE https:/~ Declarativ~ FALSE http~
#> 3 3.64e7 ask gaborcsa~ <nam~ FALSE https:/~ Friendly C~ FALSE http~
#> 4 3.49e7 base~ gaborcsa~ <nam~ FALSE https:/~ Do we get ~ FALSE http~
#> 5 6.16e7 cite~ gaborcsa~ <nam~ FALSE https:/~ Test R pac~ TRUE http~
#> 6 3.39e7 clis~ gaborcsa~ <nam~ FALSE https:/~ Unicode sy~ FALSE http~
#> # ... with 170 more rows, and 58 more variables: forks_url <chr>,
#> # keys_url <chr>, collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
#> # issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
#> # branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
#> # git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
#> # languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
#> # subscribers_url <chr>, subscription_url <chr>, commits_url <chr>,
#> # git_commits_url <chr>, comments_url <chr>, issue_comment_url <chr>,
#> # contents_url <chr>, compare_url <chr>, merges_url <chr>, archive_url <chr>,
#> # downloads_url <chr>, issues_url <chr>, pulls_url <chr>,
#> # milestones_url <chr>, notifications_url <chr>, labels_url <chr>,
#> # releases_url <chr>, deployments_url <chr>, created_at <chr>,
#> # updated_at <chr>, pushed_at <chr>, git_url <chr>, ssh_url <chr>,
#> # clone_url <chr>, svn_url <chr>, size <int>, stargazers_count <int>,
#> # watchers_count <int>, language <chr>, has_issues <lgl>,
#> # has_downloads <lgl>, has_wiki <lgl>, has_pages <lgl>, forks_count <int>,
#> # open_issues_count <int>, forks <int>, open_issues <int>, watchers <int>,
#> # default_branch <chr>, homepage <chr>
6.4.3 Game of Throne characters
got_chars
has a similar structure to gh_users
: it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character.
We start in the same way, first by creating a data frame and then by unnesting each component into a column:
chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 x 1
#> char
#> <list>
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> # ... with 24 more rows
chars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 x 18
#> url id name gender culture born died alive titles aliases father
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
#> 1 http~ 1022 Theo~ Male "Ironb~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 2 http~ 1052 Tyri~ Male "" "In ~ "" TRUE <chr ~ <chr [~ ""
#> 3 http~ 1074 Vict~ Male "Ironb~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 4 http~ 1109 Will Male "" "" "In ~ FALSE <chr ~ <chr [~ ""
#> 5 http~ 1166 Areo~ Male "Norvo~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 6 http~ 1267 Chett Male "" "At ~ "In ~ FALSE <chr ~ <chr [~ ""
#> # ... with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
#> # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> # playedBy <list>
This is more complex than gh_users
because some component of char are themselves a list, giving us a collection of list-columns:
chars2 %>% select_if(is.list)
#> select_if: dropped 11 variables (url, id, name, gender, culture, …)
#> # A tibble: 30 x 7
#> titles aliases allegiances books povBooks tvSeries playedBy
#> <list> <list> <list> <list> <list> <list> <list>
#> 1 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]> <??? [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]> <??? [1]> <chr [2]> <chr [1]> <chr [1]> <chr [1]>
#> # ... with 24 more rows
What you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:
chars2 %>%
select(name, books, tvSeries) %>%
pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>%
unnest_longer(value)
#> select: dropped 15 variables (url, id, gender, culture, born, …)
#> pivot_longer: reorganized (books, tvSeries) into (media, value) [was 30x3, now 60x3]
#> # A tibble: 180 x 3
#> name media value
#> <chr> <chr> <chr>
#> 1 Theon Greyjoy books A Game of Thrones
#> 2 Theon Greyjoy books A Storm of Swords
#> 3 Theon Greyjoy books A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1
#> 5 Theon Greyjoy tvSeries Season 2
#> 6 Theon Greyjoy tvSeries Season 3
#> # ... with 174 more rows
Or maybe you want to build a table that lets you match title to name:
chars2 %>%
select(name, title = titles) %>%
unnest_longer(title)
#> select: renamed one variable (title) and dropped 16 variables
#> # A tibble: 60 x 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy Prince of Winterfell
#> 2 Theon Greyjoy Captain of Sea Bitch
#> 3 Theon Greyjoy Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister Acting Hand of the King (former)
#> 5 Tyrion Lannister Master of Coin (former)
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet
#> # ... with 54 more rows