6.3 Nesting

Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is a implicitly summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.

Since a nested data frame is no more than a data frame where one (or more) list-columns of data frames. You can create simple nested data frames by hand:

# df1 is a nested data frame
(df1 <- tibble(
  g = c(1, 2, 3),
  data = list(
    tibble(x = 1, y = 2),
    tibble(x = 4:5, y = 6:7),
    tibble(x = 10)
  )
))
#> # A tibble: 3 x 2
#>       g data            
#>   <dbl> <list>          
#> 1     1 <tibble [1 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [1 x 1]>

Or more commonly, we can create nested data frames using tidyr::nest(). df %>% nest(x, y) specifies the columns to be nested; i.e. the columns that will appear in the inner data frame. Alternatively, you can nest() a grouped data frame created by dplyr::group_by(). The grouping variables remain in the outer data frame and the others are nested. The result preserves the grouping of the input.

Variables supplied to nest() will override grouping variables so that df %>% group_by(x, y) %>% nest(z) will be equivalent to df %>% nest(z).

df2 <- tribble(
  ~g, ~x, ~y,
   1,  1,  2,
   2,  4,  6,
   2,  5,  7,
   3, 10,  NA
)
df2 %>% nest(data = c(x, y))
#> # A tibble: 3 x 2
#>       g data            
#>   <dbl> <list>          
#> 1     1 <tibble [1 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [1 x 2]>

# grouped nesting
df2 %>% 
  group_by(g) %>% 
  nest()
#> group_by: one grouping variable (g)
#> # A tibble: 3 x 2
#> # Groups:   g [3]
#>       g data            
#>   <dbl> <list>          
#> 1     1 <tibble [1 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [1 x 2]>

# equal to 
df2 %>%
  group_nest(g)
#> # A tibble: 3 x 2
#>       g data            
#>   <dbl> <list>          
#> 1     1 <tibble [1 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [1 x 2]>

Nesting is easiest to understand in connection to grouped data: each row in the output corresponds to one group in the input. We’ll see shortly this is particularly convenient when you have other per-group objects.

The opposite of nest() is unnest(). You give it the name of a list-column containing data frames, and it row-binds the data frames together, repeating the outer columns the right number of times to line up.

df1 %>% unnest(data)
#> # A tibble: 4 x 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 2     2     4     6
#> 3     2     5     7
#> 4     3    10    NA

dplyr::group_split() put each nested tibble in a list, similar to base::split():

df2 %>% group_split(g)
#> [[1]]
#> # A tibble: 1 x 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 
#> [[2]]
#> # A tibble: 2 x 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     2     4     6
#> 2     2     5     7
#> 
#> [[3]]
#> # A tibble: 1 x 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     3    10    NA
#> 
#> attr(,"ptype")
#> # A tibble: 0 x 3
#> # ... with 3 variables: g <dbl>, x <dbl>, y <dbl>

6.3.1 Example: Managing multiple models

Nested data is a great fit for problems where you have one of something for each group. A common place this arises is when you’re fitting multiple models.

gapminder <- gapminder::gapminder
gapminder_nest <- gapminder %>% 
  mutate(year1950 = year - 1950) %>% 
  group_nest(continent, country)
#> mutate: new variable 'year1950' with 12 unique values and 0% NA

gapminder_nest
#> # A tibble: 142 x 3
#>   continent country      data             
#>   <fct>     <fct>        <list>           
#> 1 Africa    Algeria      <tibble [12 x 5]>
#> 2 Africa    Angola       <tibble [12 x 5]>
#> 3 Africa    Benin        <tibble [12 x 5]>
#> 4 Africa    Botswana     <tibble [12 x 5]>
#> 5 Africa    Burkina Faso <tibble [12 x 5]>
#> 6 Africa    Burundi      <tibble [12 x 5]>
#> # ... with 136 more rows

Now gapminder_nest is a tibble with 142 rows representing 142 countries with their respective time series data from 1952 - 2007 stored in the list column data. Then we can combine mutate() and map to create a new column to fit a linear model for each country:

mod_fit <- function(data) {
  lm(lifeExp ~ year1950, data = data)
}

gapminder_model <- gapminder_nest %>% 
  mutate(model = map(data, mod_fit))
#> mutate: new variable 'model' with 142 unique values and 0% NA
gapminder_model
#> # A tibble: 142 x 4
#>   continent country      data              model 
#>   <fct>     <fct>        <list>            <list>
#> 1 Africa    Algeria      <tibble [12 x 5]> <lm>  
#> 2 Africa    Angola       <tibble [12 x 5]> <lm>  
#> 3 Africa    Benin        <tibble [12 x 5]> <lm>  
#> 4 Africa    Botswana     <tibble [12 x 5]> <lm>  
#> 5 Africa    Burkina Faso <tibble [12 x 5]> <lm>  
#> 6 Africa    Burundi      <tibble [12 x 5]> <lm>  
#> # ... with 136 more rows

Then use broom functions to generate “tidy” model summaries:

gapminder_summary <- gapminder_model %>% 
  mutate(
    glance = map(model, broom::glance),
    tidy = map(model, broom::tidy),
    augment = map(model, broom::augment)
  )
#> mutate: new variable 'glance' with 142 unique values and 0% NA
#>         new variable 'tidy' with 142 unique values and 0% NA
#>         new variable 'augment' with 142 unique values and 0% NA

gapminder_summary
#> # A tibble: 142 x 7
#>   continent country    data         model  glance       tidy        augment     
#>   <fct>     <fct>      <list>       <list> <list>       <list>      <list>      
#> 1 Africa    Algeria    <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> 2 Africa    Angola     <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> 3 Africa    Benin      <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> 4 Africa    Botswana   <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> 5 Africa    Burkina F~ <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> 6 Africa    Burundi    <tibble [12~ <lm>   <tibble [1 ~ <tibble [2~ <tibble [12~
#> # ... with 136 more rows

unnest() each column:

# which country has the best fit
gapminder_summary %>% 
  unnest(glance) %>% 
  arrange(desc(r.squared))
#> # A tibble: 142 x 17
#>   continent country data  model r.squared adj.r.squared sigma statistic  p.value
#>   <fct>     <fct>   <lis> <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl>
#> 1 Americas  Brazil  <tib~ <lm>      0.998         0.998 0.326     5111. 6.99e-15
#> 2 Africa    Maurit~ <tib~ <lm>      0.998         0.997 0.408     4290. 1.68e-14
#> 3 Europe    France  <tib~ <lm>      0.998         0.997 0.220     4200. 1.86e-14
#> 4 Europe    Switze~ <tib~ <lm>      0.997         0.997 0.215     3823. 2.98e-14
#> 5 Asia      Pakist~ <tib~ <lm>      0.997         0.997 0.403     3626. 3.88e-14
#> 6 Asia      Indone~ <tib~ <lm>      0.997         0.997 0.646     3455. 4.93e-14
#> # ... with 136 more rows, and 8 more variables: df <int>, logLik <dbl>,
#> #   AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>, tidy <list>,
#> #   augment <list>


gapminder_summary %>% 
  unnest(tidy)
#> # A tibble: 284 x 11
#>   continent country data  model glance term  estimate std.error statistic
#>   <fct>     <fct>   <lis> <lis> <list> <chr>    <dbl>     <dbl>     <dbl>
#> 1 Africa    Algeria <tib~ <lm>  <tibb~ (Int~   42.2      0.756      55.8 
#> 2 Africa    Algeria <tib~ <lm>  <tibb~ year~    0.569    0.0221     25.7 
#> 3 Africa    Angola  <tib~ <lm>  <tibb~ (Int~   31.7      0.804      39.4 
#> 4 Africa    Angola  <tib~ <lm>  <tibb~ year~    0.209    0.0235      8.90
#> 5 Africa    Benin   <tib~ <lm>  <tibb~ (Int~   38.9      0.671      58.0 
#> 6 Africa    Benin   <tib~ <lm>  <tibb~ year~    0.334    0.0196     17.0 
#> # ... with 278 more rows, and 2 more variables: p.value <dbl>, augment <list>

gapminder_summary %>% 
  unnest(augment)
#> # A tibble: 1,704 x 15
#>   continent country data  model glance tidy  lifeExp year1950 .fitted .se.fit
#>   <fct>     <fct>   <lis> <lis> <list> <lis>   <dbl>    <dbl>   <dbl>   <dbl>
#> 1 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    43.1        2    43.4   0.718
#> 2 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    45.7        7    46.2   0.627
#> 3 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    48.3       12    49.1   0.544
#> 4 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    51.4       17    51.9   0.472
#> 5 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    54.5       22    54.8   0.416
#> 6 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    58.0       27    57.6   0.386
#> # ... with 1,698 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
#> #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

A similar case can be found at 9.1

6.3.2 Example: Multicple hoice data

multiple_choice <- tibble(method = c(
  "CNNs",
  "Bayesian, Logistic Regression",
  "Data Visualization, Decision Trees",
  "Linear Regression, A/B Testing",
  "Data Visualization, Text Analytics"
))

multiple_choice %>% 
  mutate(method = str_split(method, ",")) %>% 
  unnest(method)
#> mutate: converted 'method' from character to list (0 new NA)
#> # A tibble: 9 x 1
#>   method                
#>   <chr>                 
#> 1 "CNNs"                
#> 2 "Bayesian"            
#> 3 " Logistic Regression"
#> 4 "Data Visualization"  
#> 5 " Decision Trees"     
#> 6 "Linear Regression"   
#> # ... with 3 more rows

The trick here is that str_split() creates a list column, and then unnest() can unnest the column. A more general function separate_rows() in this case can be found at 6.5.1

Then we can do count() and plot the most frequent methods mentioned.