6.3 Nesting

Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is a implicitly summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.

Since a nested data frame is no more than a data frame where one (or more) list-columns of data frames. You can create simple nested data frames by hand:

Or more commonly, we can create nested data frames using tidyr::nest(). df %>% nest(x, y) specifies the columns to be nested; i.e. the columns that will appear in the inner data frame. Alternatively, you can nest() a grouped data frame created by dplyr::group_by(). The grouping variables remain in the outer data frame and the others are nested. The result preserves the grouping of the input.

Variables supplied to nest() will override grouping variables so that df %>% group_by(x, y) %>% nest(z) will be equivalent to df %>% nest(z).

Nesting is easiest to understand in connection to grouped data: each row in the output corresponds to one group in the input. We’ll see shortly this is particularly convenient when you have other per-group objects.

The opposite of nest() is unnest(). You give it the name of a list-column containing data frames, and it row-binds the data frames together, repeating the outer columns the right number of times to line up.

dplyr::group_split() put each nested tibble in a list, similar to base::split():

6.3.1 Example: Managing multiple models

Nested data is a great fit for problems where you have one of something for each group. A common place this arises is when you’re fitting multiple models.

Now gapminder_nest is a tibble with 142 rows representing 142 countries with their respective time series data from 1952 - 2007 stored in the list column data. Then we can combine mutate() and map to create a new column to fit a linear model for each country:

Then use broom functions to generate “tidy” model summaries:

unnest() each column:

# which country has the best fit
gapminder_summary %>% 
  unnest(glance) %>% 
  arrange(desc(r.squared))
#> # A tibble: 142 x 17
#>   continent country data  model r.squared adj.r.squared sigma statistic  p.value
#>   <fct>     <fct>   <lis> <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl>
#> 1 Americas  Brazil  <tib~ <lm>      0.998         0.998 0.326     5111. 6.99e-15
#> 2 Africa    Maurit~ <tib~ <lm>      0.998         0.997 0.408     4290. 1.68e-14
#> 3 Europe    France  <tib~ <lm>      0.998         0.997 0.220     4200. 1.86e-14
#> 4 Europe    Switze~ <tib~ <lm>      0.997         0.997 0.215     3823. 2.98e-14
#> 5 Asia      Pakist~ <tib~ <lm>      0.997         0.997 0.403     3626. 3.88e-14
#> 6 Asia      Indone~ <tib~ <lm>      0.997         0.997 0.646     3455. 4.93e-14
#> # ... with 136 more rows, and 8 more variables: df <int>, logLik <dbl>,
#> #   AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>, tidy <list>,
#> #   augment <list>


gapminder_summary %>% 
  unnest(tidy)
#> # A tibble: 284 x 11
#>   continent country data  model glance term  estimate std.error statistic
#>   <fct>     <fct>   <lis> <lis> <list> <chr>    <dbl>     <dbl>     <dbl>
#> 1 Africa    Algeria <tib~ <lm>  <tibb~ (Int~   42.2      0.756      55.8 
#> 2 Africa    Algeria <tib~ <lm>  <tibb~ year~    0.569    0.0221     25.7 
#> 3 Africa    Angola  <tib~ <lm>  <tibb~ (Int~   31.7      0.804      39.4 
#> 4 Africa    Angola  <tib~ <lm>  <tibb~ year~    0.209    0.0235      8.90
#> 5 Africa    Benin   <tib~ <lm>  <tibb~ (Int~   38.9      0.671      58.0 
#> 6 Africa    Benin   <tib~ <lm>  <tibb~ year~    0.334    0.0196     17.0 
#> # ... with 278 more rows, and 2 more variables: p.value <dbl>, augment <list>

gapminder_summary %>% 
  unnest(augment)
#> # A tibble: 1,704 x 15
#>   continent country data  model glance tidy  lifeExp year1950 .fitted .se.fit
#>   <fct>     <fct>   <lis> <lis> <list> <lis>   <dbl>    <dbl>   <dbl>   <dbl>
#> 1 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    43.1        2    43.4   0.718
#> 2 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    45.7        7    46.2   0.627
#> 3 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    48.3       12    49.1   0.544
#> 4 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    51.4       17    51.9   0.472
#> 5 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    54.5       22    54.8   0.416
#> 6 Africa    Algeria <tib~ <lm>  <tibb~ <tib~    58.0       27    57.6   0.386
#> # ... with 1,698 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
#> #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

A similar case can be found at 9.1