Chapter 8 Describing data sets

When it comes to the description of data sets, it’s often about just getting a quick feel for the data. While dplyr’s summarize() is a great weapon for summarizing singular variables – and I have already shown you in Chapter @(wrangling) how to do this – you need a hassle-free tool to quickly get an overview. For this use-case, skimr is my personal weapon of choice. It is not part of the tidyverse – yet super-compatible – so we first need to load (and perhaps install) it.

if (!"skimr" %in% installed.packages()[,1]) install.packages("skimr")
library(skimr)
library(tidyverse)

8.1 Basic descriptives

skimr is designed around its main function skim(). This function handles tibbles as well as vectors. It returns a so-called skim_df which is basically a tibble with some added columns. However, its biggest strength is the fact that it provides descriptives fast and in an easily comprehensible manner. Moreover, you can then go on and further modify them using tidyverse function.

skim(mtcars) # it excels with tibbles as input
## ── Data Summary ────────────────────────
##                            Values
## Name                       mtcars
## Number of rows             32    
## Number of columns          11    
## _______________________          
## Column type frequency:           
##   numeric                  11    
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate    mean      sd    p0    p25    p50
##  1 mpg                   0             1  20.1     6.03  10.4   15.4   19.2 
##  2 cyl                   0             1   6.19    1.79   4      4      6   
##  3 disp                  0             1 231.    124.    71.1  121.   196.  
##  4 hp                    0             1 147.     68.6   52     96.5  123   
##  5 drat                  0             1   3.60    0.535  2.76   3.08   3.70
##  6 wt                    0             1   3.22    0.978  1.51   2.58   3.32
##  7 qsec                  0             1  17.8     1.79  14.5   16.9   17.7 
##  8 vs                    0             1   0.438   0.504  0      0      0   
##  9 am                    0             1   0.406   0.499  0      0      0   
## 10 gear                  0             1   3.69    0.738  3      3      4   
## 11 carb                  0             1   2.81    1.62   1      2      2   
##       p75   p100 hist 
##  1  22.8   33.9  ▃▇▅▁▂
##  2   8      8    ▆▁▃▁▇
##  3 326    472    ▇▃▃▃▂
##  4 180    335    ▇▇▆▃▁
##  5   3.92   4.93 ▇▃▇▅▁
##  6   3.61   5.42 ▃▃▇▁▂
##  7  18.9   22.9  ▃▇▇▂▁
##  8   1      1    ▇▁▁▁▆
##  9   1      1    ▇▁▁▁▆
## 10   4      5    ▇▁▆▁▂
## 11   4      8    ▇▂▅▁▁
skim(mtcars$mpg) # vectors are fine, too
## ── Data Summary ────────────────────────
##                            Values    
## Name                       mtcars$mpg
## Number of rows             32        
## Number of columns          1         
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            None      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75
## 1 data                  0             1  20.1  6.03  10.4  15.4  19.2  22.8
##    p100 hist 
## 1  33.9 ▃▇▅▁▂

Usually, data sets come with columns in different flavors. As an example, we can look at the IMDb data set. This set contains categorical and numeric variables:

imdb_raw <- read_csv("data/imdb2006-2016.csv")
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Title, Genre, Description, Director, Actors
## dbl (7): Rank, Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), M...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imdb_raw %>% skim()
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             1000      
## Number of columns          12        
## _______________________              
## Column type frequency:               
##   character                5         
##   numeric                  7         
## ________________________             
## Group variables            None      
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 Title                 0             1     2    61     0      999          0
## 2 Genre                 0             1     5    26     0      207          0
## 3 Description           0             1    42   421     0     1000          0
## 4 Director              0             1     3    32     0      644          0
## 5 Actors                0             1    43    77     0      996          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate      mean         sd     p0     p25
## 1 Rank                       0         1        500.      289.       1     251. 
## 2 Year                       0         1       2013.        3.21  2006    2010  
## 3 Runtime (Minutes)          0         1        113.       18.8     66     100  
## 4 Rating                     0         1          6.72      0.945    1.9     6.2
## 5 Votes                      0         1     169808.   188763.      61   36309  
## 6 Revenue (Millions)       128         0.872     83.0     103.       0      13.3
## 7 Metascore                 64         0.936     59.0      17.2     11      47  
##        p50      p75     p100 hist 
## 1    500.     750.     1000  ▇▇▇▇▇
## 2   2014     2016      2016  ▃▂▂▃▇
## 3    111      123       191  ▂▇▅▁▁
## 4      6.8      7.4       9  ▁▁▃▇▃
## 5 110799   239910.  1791916  ▇▁▁▁▁
## 6     48.0    114.      937. ▇▁▁▁▁
## 7     59.5     72       100  ▁▅▇▇▂

Of course, the type of variable determines the type of operation that can be performed and skimr smartly distinguishes between variable types.

8.1.1 Grouped descriptives

You can look at certain descriptives by group. Then, skim() will return descriptives for each group in a row-wise fashion. Just call group_by() before passing the tibble to the skim() call

mtcars %>% 
  group_by(cyl) %>% 
  skim()
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             32        
## Number of columns          11        
## _______________________              
## Column type frequency:               
##   numeric                  10        
## ________________________             
## Group variables            cyl       
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable   cyl n_missing complete_rate    mean     sd     p0    p25
##  1 mpg               4         0             1  26.7    4.51   21.4   22.8 
##  2 mpg               6         0             1  19.7    1.45   17.8   18.6 
##  3 mpg               8         0             1  15.1    2.56   10.4   14.4 
##  4 disp              4         0             1 105.    26.9    71.1   78.8 
##  5 disp              6         0             1 183.    41.6   145    160   
##  6 disp              8         0             1 353.    67.8   276.   302.  
##  7 hp                4         0             1  82.6   20.9    52     65.5 
##  8 hp                6         0             1 122.    24.3   105    110   
##  9 hp                8         0             1 209.    51.0   150    176.  
## 10 drat              4         0             1   4.07   0.365   3.69   3.81
## 11 drat              6         0             1   3.59   0.476   2.76   3.35
## 12 drat              8         0             1   3.23   0.372   2.76   3.07
## 13 wt                4         0             1   2.29   0.570   1.51   1.88
## 14 wt                6         0             1   3.12   0.356   2.62   2.82
## 15 wt                8         0             1   4.00   0.759   3.17   3.53
## 16 qsec              4         0             1  19.1    1.68   16.7   18.6 
## 17 qsec              6         0             1  18.0    1.71   15.5   16.7 
## 18 qsec              8         0             1  16.8    1.20   14.5   16.1 
## 19 vs                4         0             1   0.909  0.302   0      1   
## 20 vs                6         0             1   0.571  0.535   0      0   
## 21 vs                8         0             1   0      0       0      0   
## 22 am                4         0             1   0.727  0.467   0      0.5 
## 23 am                6         0             1   0.429  0.535   0      0   
## 24 am                8         0             1   0.143  0.363   0      0   
## 25 gear              4         0             1   4.09   0.539   3      4   
## 26 gear              6         0             1   3.86   0.690   3      3.5 
## 27 gear              8         0             1   3.29   0.726   3      3   
## 28 carb              4         0             1   1.55   0.522   1      1   
## 29 carb              6         0             1   3.43   1.81    1      2.5 
## 30 carb              8         0             1   3.5    1.56    2      2.25
##       p50    p75   p100 hist 
##  1  26     30.4   33.9  ▇▃▂▃▃
##  2  19.7   21     21.4  ▅▂▂▁▇
##  3  15.2   16.2   19.2  ▂▁▇▃▂
##  4 108    121.   147.   ▇▂▂▆▃
##  5 168.   196.   258    ▇▁▁▂▂
##  6 350.   390    472    ▇▅▃▂▅
##  7  91     96    113    ▃▆▁▇▃
##  8 110    123    175    ▇▃▁▁▂
##  9 192.   241.   335    ▇▂▃▁▁
## 10   4.08   4.16   4.93 ▇▅▃▁▂
## 11   3.9    3.91   3.92 ▂▂▁▂▇
## 12   3.12   3.22   4.22 ▃▇▁▁▁
## 13   2.2    2.62   3.19 ▇▅▇▂▅
## 14   3.22   3.44   3.46 ▅▂▁▂▇
## 15   3.76   4.01   5.42 ▇▇▁▁▃
## 16  18.9   20.0   22.9  ▃▇▇▁▂
## 17  18.3   19.2   20.2  ▃▇▃▃▇
## 18  17.2   17.6   18    ▂▂▁▅▇
## 19   1      1      1    ▁▁▁▁▇
## 20   1      1      1    ▆▁▁▁▇
## 21   0      0      0    ▁▁▇▁▁
## 22   1      1      1    ▃▁▁▁▇
## 23   0      1      1    ▇▁▁▁▆
## 24   0      0      1    ▇▁▁▁▁
## 25   4      4      5    ▁▁▇▁▂
## 26   4      4      5    ▃▁▇▁▂
## 27   3      3      5    ▇▁▁▁▁
## 28   2      2      2    ▇▁▁▁▇
## 29   4      4      6    ▃▁▇▁▂
## 30   3.5    4      8    ▇▇▁▁▁

8.1.2 Modifying the skim() output

You can also limit the variables that should be described in the call:

mtcars %>% 
  group_by(cyl) %>% 
  skim(mpg, hp, wt)
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             32        
## Number of columns          11        
## _______________________              
## Column type frequency:               
##   numeric                  3         
## ________________________             
## Group variables            cyl       
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable   cyl n_missing complete_rate   mean     sd     p0    p25    p50
## 1 mpg               4         0             1  26.7   4.51   21.4   22.8   26   
## 2 mpg               6         0             1  19.7   1.45   17.8   18.6   19.7 
## 3 mpg               8         0             1  15.1   2.56   10.4   14.4   15.2 
## 4 hp                4         0             1  82.6  20.9    52     65.5   91   
## 5 hp                6         0             1 122.   24.3   105    110    110   
## 6 hp                8         0             1 209.   51.0   150    176.   192.  
## 7 wt                4         0             1   2.29  0.570   1.51   1.88   2.2 
## 8 wt                6         0             1   3.12  0.356   2.62   2.82   3.22
## 9 wt                8         0             1   4.00  0.759   3.17   3.53   3.76
##      p75   p100 hist 
## 1  30.4   33.9  ▇▃▂▃▃
## 2  21     21.4  ▅▂▂▁▇
## 3  16.2   19.2  ▂▁▇▃▂
## 4  96    113    ▃▆▁▇▃
## 5 123    175    ▇▃▁▁▂
## 6 241.   335    ▇▂▃▁▁
## 7   2.62   3.19 ▇▅▇▂▅
## 8   3.44   3.46 ▅▂▁▂▇
## 9   4.01   5.42 ▇▇▁▁▃

When you transform the skim_df into a normal tibble, you can see that additional columns are added:

mtcars %>% 
  group_by(cyl) %>% 
  skim(mpg, hp, wt) %>% 
  as_tibble()
## # A tibble: 9 × 13
##   skim_type skim_variable   cyl n_missing complete_rate numeric.mean numeric.sd
##   <chr>     <chr>         <dbl>     <int>         <dbl>        <dbl>      <dbl>
## 1 numeric   mpg               4         0             1        26.7       4.51 
## 2 numeric   mpg               6         0             1        19.7       1.45 
## 3 numeric   mpg               8         0             1        15.1       2.56 
## 4 numeric   hp                4         0             1        82.6      20.9  
## 5 numeric   hp                6         0             1       122.       24.3  
## 6 numeric   hp                8         0             1       209.       51.0  
## 7 numeric   wt                4         0             1         2.29      0.570
## 8 numeric   wt                6         0             1         3.12      0.356
## 9 numeric   wt                8         0             1         4.00      0.759
## # … with 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>,
## #   numeric.p50 <dbl>, numeric.p75 <dbl>, numeric.p100 <dbl>,
## #   numeric.hist <chr>

Those are: skim_type which denotes the type of variable (usually “categorical” or “numeric”) and skim_variable containing the name of the variable that is summarized. If there are grouping variables, those are included with their original name (in this case, the number of cylinder, “cyl”).

Therefore, normal dplyr syntax works in a pipeline with a skim_df:

mtcars %>% 
  group_by(cyl) %>% 
  skim(mpg, hp, wt) %>%
  select(skim_type:numeric.sd)
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             32        
## Number of columns          11        
## _______________________              
## Column type frequency:               
##   numeric                  3         
## ________________________             
## Group variables            cyl       
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable   cyl n_missing complete_rate   mean     sd
## 1 mpg               4         0             1  26.7   4.51 
## 2 mpg               6         0             1  19.7   1.45 
## 3 mpg               8         0             1  15.1   2.56 
## 4 hp                4         0             1  82.6  20.9  
## 5 hp                6         0             1 122.   24.3  
## 6 hp                8         0             1 209.   51.0  
## 7 wt                4         0             1   2.29  0.570
## 8 wt                6         0             1   3.12  0.356
## 9 wt                8         0             1   4.00  0.759

There are also two handy shortcuts to distinguish between the different types of data. partition() takes a skim_df object and returns a list containing tibbles for each variable type. yank() can be used to deliberately choose a variable type upfront.

imdb_raw %>% 
  skim() %>% 
  partition() 
## $character
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 Title                 0             1     2    61     0      999          0
## 2 Genre                 0             1     5    26     0      207          0
## 3 Description           0             1    42   421     0     1000          0
## 4 Director              0             1     3    32     0      644          0
## 5 Actors                0             1    43    77     0      996          0
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate      mean         sd     p0     p25
## 1 Rank                       0         1        500.      289.       1     251. 
## 2 Year                       0         1       2013.        3.21  2006    2010  
## 3 Runtime (Minutes)          0         1        113.       18.8     66     100  
## 4 Rating                     0         1          6.72      0.945    1.9     6.2
## 5 Votes                      0         1     169808.   188763.      61   36309  
## 6 Revenue (Millions)       128         0.872     83.0     103.       0      13.3
## 7 Metascore                 64         0.936     59.0      17.2     11      47  
##        p50      p75     p100 hist 
## 1    500.     750.     1000  ▇▇▇▇▇
## 2   2014     2016      2016  ▃▂▂▃▇
## 3    111      123       191  ▂▇▅▁▁
## 4      6.8      7.4       9  ▁▁▃▇▃
## 5 110799   239910.  1791916  ▇▁▁▁▁
## 6     48.0    114.      937. ▇▁▁▁▁
## 7     59.5     72       100  ▁▅▇▇▂
imdb_raw %>% 
  skim() %>% 
  yank("numeric")
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate      mean         sd     p0     p25
## 1 Rank                       0         1        500.      289.       1     251. 
## 2 Year                       0         1       2013.        3.21  2006    2010  
## 3 Runtime (Minutes)          0         1        113.       18.8     66     100  
## 4 Rating                     0         1          6.72      0.945    1.9     6.2
## 5 Votes                      0         1     169808.   188763.      61   36309  
## 6 Revenue (Millions)       128         0.872     83.0     103.       0      13.3
## 7 Metascore                 64         0.936     59.0      17.2     11      47  
##        p50      p75     p100 hist 
## 1    500.     750.     1000  ▇▇▇▇▇
## 2   2014     2016      2016  ▃▂▂▃▇
## 3    111      123       191  ▂▇▅▁▁
## 4      6.8      7.4       9  ▁▁▃▇▃
## 5 110799   239910.  1791916  ▇▁▁▁▁
## 6     48.0    114.      937. ▇▁▁▁▁
## 7     59.5     72       100  ▁▅▇▇▂

8.1.3 Further descriptives

Note that skimr only supports a limited number of measures. If you want to add further descriptives, you can use dplyr::summarize() combined with the across() function to compute the measures. Then you would have to spread the tibbles using tidyr::pivot_longer() and tidyr::pivot_wider() and, finally, join together the tibbles. For the imdb data set and the measures median and variance, this would look like as follows:

imdb_raw %>% 
  summarize(across(where(is.numeric), 
                   list(
                     median = ~median(.x, na.rm = TRUE),
                     variance = ~var(.x, na.rm = TRUE)
  ))) %>% 
  pivot_longer(
    cols = everything(),
    names_to = c("skim_variable", ".value"),
    names_sep = "_"
  ) %>% 
  right_join(imdb_raw %>% skim() %>% yank("numeric"))
## Joining, by = "skim_variable"
## # A tibble: 7 × 13
##   skim_variable    median variance n_missing complete_rate   mean      sd     p0
##   <chr>             <dbl>    <dbl>     <int>         <dbl>  <dbl>   <dbl>  <dbl>
## 1 Rank             5.00e2 8.34e+ 4         0         1     5.00e2 2.89e+2    1  
## 2 Year             2.01e3 1.03e+ 1         0         1     2.01e3 3.21e+0 2006  
## 3 Runtime (Minut…  1.11e2 3.54e+ 2         0         1     1.13e2 1.88e+1   66  
## 4 Rating           6.8 e0 8.94e- 1         0         1     6.72e0 9.45e-1    1.9
## 5 Votes            1.11e5 3.56e+10         0         1     1.70e5 1.89e+5   61  
## 6 Revenue (Milli…  4.80e1 1.07e+ 4       128         0.872 8.30e1 1.03e+2    0  
## 7 Metascore        5.95e1 2.96e+ 2        64         0.936 5.90e1 1.72e+1   11  
## # … with 5 more variables: p25 <dbl>, p50 <dbl>, p75 <dbl>, p100 <dbl>,
## #   hist <chr>

8.2 Communicating results

By default, when used in RMarkdown documents, skimr outputs quite decent looking tables which look like this:

imdb_raw %>% 
  skim() %>% 
  yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rank 0 1.00 500.50 288.82 1.0 250.75 500.50 750.25 1000.00 ▇▇▇▇▇
Year 0 1.00 2012.78 3.21 2006.0 2010.00 2014.00 2016.00 2016.00 ▃▂▂▃▇
Runtime (Minutes) 0 1.00 113.17 18.81 66.0 100.00 111.00 123.00 191.00 ▂▇▅▁▁
Rating 0 1.00 6.72 0.95 1.9 6.20 6.80 7.40 9.00 ▁▁▃▇▃
Votes 0 1.00 169808.26 188762.65 61.0 36309.00 110799.00 239909.75 1791916.00 ▇▁▁▁▁
Revenue (Millions) 128 0.87 82.96 103.25 0.0 13.27 47.98 113.72 936.63 ▇▁▁▁▁
Metascore 64 0.94 58.99 17.19 11.0 47.00 59.50 72.00 100.00 ▁▅▇▇▂

As you may have noticed, the output looked differently before. This is because, in the RMarkdown chunks above, I set my chunk options to render = knitr::normal_print to avoid the printing of the decent looking table but rather the raw output.

8.2.1 Tables with knitr::kable() and kableExtra

The problem with the default output is the lack of modifiability. The column names, for instance, would suffice for small reports but are inappropriate for something you would want to hand in somewhere or even publish. In RMarkdown, the proper tool for modifying and printing tables is the knitr::kable() function with extended functionalities from the kableExtra package. In the following, I will provide a brief and coarse introduction to the package. It is so coarse that you basically will have to work through the vignettes yourself as for the tables everything is relevant and I am not willing to copy the entire vignettes. So, hit the internet and look at the kableExtra vignette for html and pdf.

I will exemplify how it can look like to make a table with descriptives in an HTML-document. I will also add a description for how you can get this into a Word or \(\LaTeX\) document so that you are not limited to RMarkdown when reporting results.

The kableExtra package needs to be loaded (or even installed) first:

if (!"kableExtra" %in% installed.packages()[,1]) install.packages("kableExtra")
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

The basic command is kable(). It takes a tibble and outputs it as an html-table in the so-called Twitter bootstrap style.

mtcars %>% kable()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

8.2.2 Use case: reporting skimr results

8.2.2.1 Basics for HTML and PDF

First, I need to get the results in a proper tibble. I use the mtcars data set.

desc_mtcars <- skim(mtcars) %>% as_tibble()

First, I select and rename the variables I deem interesting. To save myself some time, I choose variables based on their position:

for_table <- desc_mtcars %>% 
  select(Variable = 2,
         Mean = 5,
         SD = 6,
         Minimum = 7,
         Maximum = 11) %>% 
  mutate(across(where(is.numeric), ~round(., 1))) # round the output

Now, I have put it into a tibble that, in theory, can be just printed down as a table (you might wish to change the names of the described variables, but let’s save this for another time; if you’re willing to: case_when() may be your best shot). I proceed with the kable() call. By including kable_styling(), you can make some general specifications. column_spec() allows you to customize individual columns. footnote() gives you the opportunity for footnote customization. Those “kables” are extremely customizable with other functions you will find in the manual I linked below.

for_table %>% 
  kable(caption = "Example for a table with some descriptives") %>% 
  kable_styling(
    bootstrap_options = "striped", # several design options
    full_width = FALSE, # defaults to TRUE
    position = "center", # where is it positioned?
    fixed_thead = TRUE # whether header is fixed when scrolling through -- only for longer tables
  ) %>% 
  column_spec(1, bold = TRUE, border_right = TRUE) %>% # column specifications can be easily modified
  footnote(general = "You can add some footnotes with certain signs, too.", # this is how you add a footnote
           number = c("Footnote 1; ", "Footnote 2; "),
           alphabet = c("Footnote A; ", "Footnote B; "),
           symbol = c("Footnote Symbol 1; ", "Footnote Symbol 2"),
           general_title = "General: ", number_title = "Type I: ",
           alphabet_title = "Type II: ", symbol_title = "Type III: ",
           footnote_as_chunk = T, title_format = c("italic", "underline")
           )
Table 8.1: Example for a table with some descriptives
Variable Mean SD Minimum Maximum
mpg 20.1 6.0 10.4 33.9
cyl 6.2 1.8 4.0 8.0
disp 230.7 123.9 71.1 472.0
hp 146.7 68.6 52.0 335.0
drat 3.6 0.5 2.8 4.9
wt 3.2 1.0 1.5 5.4
qsec 17.8 1.8 14.5 22.9
vs 0.4 0.5 0.0 1.0
am 0.4 0.5 0.0 1.0
gear 3.7 0.7 3.0 5.0
carb 2.8 1.6 1.0 8.0
General: You can add some footnotes with certain signs, too.
Type I: 1 Footnote 1; 2 Footnote 2;
Type II: a Footnote A; b Footnote B;
Type III: * Footnote Symbol 1; Footnote Symbol 2

For \(\LaTeX\) tables this would generally look the same. Depending on whether you knit the RMarkdown file to PDF or HTML, the output will change. You also have some different specification options for \(\LaTeX\) output.

8.2.2.2 Getting it into a Word file

Unfortunately, you cannot yet output the tables produced with kable into a proper Word file. In order to achieve this, however, you can use flextable and then export the tables in different formats. flextable is similarly capable as kable. An introduction can be found here.

A coarse example follows.

First, it needs to be loaded.

if (!"flextable" %in% installed.packages()[,1]) install.packages("flextable")
library(flextable)
## 
## Attaching package: 'flextable'
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## The following object is masked from 'package:purrr':
## 
##     compose

Then, I can produce the table and export it to DOCX or PPTX format using the save_as_*() function.

for_table %>% 
  flextable() %>% 
  save_as_docx(path = "example_table_docx.docx", sep = "/")

for_table %>% 
  flextable() %>% 
  save_as_pptx(path = "example_table_pptx.pptx", sep = "/")

Then, further manipulations can be performed in MS Word or PowerPoint (or LibreOffice or whatever).

Also, if you have a working version of Word or PowerPoint on your machine, you can use the following commands to start up an interactive session:

for_table %>% 
  flextable() %>% 
  print(preview = "docx")

for_table %>% 
  flextable() %>% 
  print(preview = "pptx")