Chapter 2 Everyday data wrangling

Suppose there exists a test dataset and the user is tasked with proving their proficiency in exploratory data analysis (EDA). Alternatively, the user may be handed a piece of data (e.g., maybe an Excel sheet) and is asked to run some preliminary analysis. Prior to jumping straight into EDA and statistical modeling, it’s a good idea to transform the original data into a desirable format. This includes, but is not limited to, missing data imputation, scalar transformation, feature selection, group aggregation, and data filtering. The following notebook outlines the fundamentals of data wrangling, using practical base R and tidyverse solutions interchangeably.

library(tidyverse)

2.1 Renaming column headers

The first few steps in data analysis is to gain intuition around the data at hand - typically, a m x n dataset is structured in a way such that there are m records (also referred to as instances or samples) and m features (also referred to as predictors). Getting a good sense and understanding of the features we’re working with often requires cleaning the data before jumping into things like feature selection and feature engineering.

Suppose we have a 32 x 11 dataset mtcars where rownames correspond to the car model:

data(mtcars)
mtcars <- mtcars %>% as_tibble(rownames = 'CAR')

There are 11 columns here and since this is a toy dataset, it is already neat and tidy. However, sometimes we may need to rename the columns to trim specific strings or to make downstream analysis easier. Renaming column headers with dplyr::rename() is as simple as rename(new_name = old_name), but with rename_with() there’s more flexibility:

mtcars %>% rename_with(toupper)
## # A tibble: 32 × 12
##    CAR      MPG   CYL  DISP    HP  DRAT    WT  QSEC    VS    AM
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: GEAR <dbl>, CARB <dbl>

You can pass any function within rename_with() to alter the column headers and use the .cols = argument to define which columns you’d apply the function to:

mtcars %>% rename_with(function(x) paste0(x, "_1"), .cols = -1)
## # A tibble: 32 × 12
##    CAR       mpg_1 cyl_1 disp_1  hp_1 drat_1  wt_1 qsec_1  vs_1
##    <chr>     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 Mazda RX4  21       6   160    110   3.9   2.62   16.5     0
##  2 Mazda RX…  21       6   160    110   3.9   2.88   17.0     0
##  3 Datsun 7…  22.8     4   108     93   3.85  2.32   18.6     1
##  4 Hornet 4…  21.4     6   258    110   3.08  3.22   19.4     1
##  5 Hornet S…  18.7     8   360    175   3.15  3.44   17.0     0
##  6 Valiant    18.1     6   225    105   2.76  3.46   20.2     1
##  7 Duster 3…  14.3     8   360    245   3.21  3.57   15.8     0
##  8 Merc 240D  24.4     4   147.    62   3.69  3.19   20       1
##  9 Merc 230   22.8     4   141.    95   3.92  3.15   22.9     1
## 10 Merc 280   19.2     6   168.   123   3.92  3.44   18.3     1
## # ℹ 22 more rows
## # ℹ 3 more variables: am_1 <dbl>, gear_1 <dbl>, carb_1 <dbl>

Alternatively with base R, you can call the names attribute and alter that directly:

names(mtcars)[-1] <- paste0(names(mtcars[-1]), '_1') # not run

2.2 Grouped operations

Grouped operations are most useful when you’re working with data where there is a categorical variable. Grouped summaries are powerful tools easily done with dplyr::summarise():

mtcars %>% group_by(cyl) %>% summarise(avg_mpg = mean(mpg))
## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1

Using multiple columns at once and renaming the output columns:

mtcars %>% group_by(cyl) %>% 
  summarise(across(c(mpg, disp, hp), mean, .names = "mean_{col}"))
## # A tibble: 3 × 4
##     cyl mean_mpg mean_disp mean_hp
##   <dbl>    <dbl>     <dbl>   <dbl>
## 1     4     26.7      105.    82.6
## 2     6     19.7      183.   122. 
## 3     8     15.1      353.   209.
mtcars %>% group_by(cyl) %>% 
  summarise(across(where(is.numeric), mean, .names = "mean_{col}"))
## # A tibble: 3 × 11
##     cyl mean_mpg mean_disp mean_hp mean_drat mean_wt mean_qsec
##   <dbl>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>     <dbl>
## 1     4     26.7      105.    82.6      4.07    2.29      19.1
## 2     6     19.7      183.   122.       3.59    3.12      18.0
## 3     8     15.1      353.   209.       3.23    4.00      16.8
## # ℹ 4 more variables: mean_vs <dbl>, mean_am <dbl>,
## #   mean_gear <dbl>, mean_carb <dbl>

Alternatively, in base R, you can use aggregate() with the ~; you can read this as grouping mpg by cyl where by corresponds to the ~:

aggregate(mpg ~ cyl, data = mtcars, FUN = mean)
##   cyl      mpg
## 1   4 26.66364
## 2   6 19.74286
## 3   8 15.10000

Since aggregate() uses the formula syntax (~), using the . placeholder on the LHS calls all variables not called in the RHS:

aggregate(. ~ cyl, data = mtcars, FUN = mean)

When calculating the size of the groups, you can use n():

mtcars %>% group_by(cyl) %>% summarise(count = n())
## # A tibble: 3 × 2
##     cyl count
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

The n() function is useful when filtering by groups of size greater than a certain value:

mtcars %>% group_by(cyl) %>% filter(n() > 7) %>% ungroup()
## # A tibble: 25 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  2 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  3 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  4 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  5 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
##  6 Merc …  16.4     8  276.   180  3.07  4.07  17.4     0     0
##  7 Merc …  17.3     8  276.   180  3.07  3.73  17.6     0     0
##  8 Merc …  15.2     8  276.   180  3.07  3.78  18       0     0
##  9 Cadil…  10.4     8  472    205  2.93  5.25  18.0     0     0
## 10 Linco…  10.4     8  460    215  3     5.42  17.8     0     0
## # ℹ 15 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Using n() gives us access to finding the group sizes using tally() or count():

mtcars %>% group_by(cyl) %>% tally()
## # A tibble: 3 × 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

We can label encode (i.e., give a unique integer identifier for the current groups) using cur_group_id():

mtcars %>% group_by(cyl) %>%
  mutate(ID = cur_group_id()) %>% select(ID, cyl)
## # A tibble: 32 × 2
## # Groups:   cyl [3]
##       ID   cyl
##    <int> <dbl>
##  1     2     6
##  2     2     6
##  3     1     4
##  4     2     6
##  5     3     8
##  6     2     6
##  7     3     8
##  8     1     4
##  9     1     4
## 10     2     6
## # ℹ 22 more rows

2.3 Data transformation

Sometimes it’s useful to transform numeric columns; this is particularly helpful when performing feature transformations (e.g., logarithmic, absolute value) or creating new columns entirely from preexisting columns.

mtcars %>% mutate(mpg = mpg*2)
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  42       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  42       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  45.6     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  42.8     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  37.4     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  36.2     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  28.6     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  48.8     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  45.6     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  38.4     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Using across() allows you to specifically select which columns you’d like to apply a given function on. The first part of the across() argument should therefore specify the columns of interest. Here, where() is particularly useful as it returns indices (kind of like grep()) after evaluating a logical expression, such as is.numeric(). The second part of across() takes the function you’d like to apply, while the optional .names argument allows you to create new columns rather than overwrite the old ones.

mtcars %>% 
  mutate(across(where(is.numeric), function(x) x*2, .names = "double_{col}"))
## # A tibble: 32 × 23
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 13 more variables: gear <dbl>, carb <dbl>,
## #   double_mpg <dbl>, double_cyl <dbl>, double_disp <dbl>,
## #   double_hp <dbl>, double_drat <dbl>, double_wt <dbl>,
## #   double_qsec <dbl>, double_vs <dbl>, double_am <dbl>,
## #   double_gear <dbl>, double_carb <dbl>
mtcars %>%
  mutate(across(where(is.character),
                function(x) gsub('[a-z]$', '', x))) %>%
  select(CAR)
## # A tibble: 32 × 1
##    CAR             
##    <chr>           
##  1 Mazda RX4       
##  2 Mazda RX4 Wa    
##  3 Datsun 710      
##  4 Hornet 4 Driv   
##  5 Hornet Sportabou
##  6 Valian          
##  7 Duster 360      
##  8 Merc 240D       
##  9 Merc 230        
## 10 Merc 280        
## # ℹ 22 more rows

Conditional mutate() can also be done using mutate_if(), though the _if() functions, along with variants _at() and _all() have been effectively replaced by across(), as seen above.

You can overwrite the existing columns completely; also keep in mind that you can pass any function within across() as such:

mtcars %>% mutate(across(c(mpg, disp), function(x) x*3))
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  63       6  480    110  3.9   2.62  16.5     0     1
##  2 Mazda…  63       6  480    110  3.9   2.88  17.0     0     1
##  3 Datsu…  68.4     4  324     93  3.85  2.32  18.6     1     1
##  4 Horne…  64.2     6  774    110  3.08  3.22  19.4     1     0
##  5 Horne…  56.1     8 1080    175  3.15  3.44  17.0     0     0
##  6 Valia…  54.3     6  675    105  2.76  3.46  20.2     1     0
##  7 Duste…  42.9     8 1080    245  3.21  3.57  15.8     0     0
##  8 Merc …  73.2     4  440.    62  3.69  3.19  20       1     0
##  9 Merc …  68.4     4  422.    95  3.92  3.15  22.9     1     0
## 10 Merc …  57.6     6  503.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Iterating across the columns using purrr::modify() instead:

mtcars %>% modify_if(is.numeric, ~ .x*3)
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  63      18  480    330 11.7   7.86  49.4     0     3
##  2 Mazda…  63      18  480    330 11.7   8.62  51.1     0     3
##  3 Datsu…  68.4    12  324    279 11.6   6.96  55.8     3     3
##  4 Horne…  64.2    18  774    330  9.24  9.64  58.3     3     0
##  5 Horne…  56.1    24 1080    525  9.45 10.3   51.1     0     0
##  6 Valia…  54.3    18  675    315  8.28 10.4   60.7     3     0
##  7 Duste…  42.9    24 1080    735  9.63 10.7   47.5     0     0
##  8 Merc …  73.2    12  440.   186 11.1   9.57  60       3     0
##  9 Merc …  68.4    12  422.   285 11.8   9.45  68.7     3     0
## 10 Merc …  57.6    18  503.   369 11.8  10.3   54.9     3     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Mutating columns of characters is also straightforward; this is helpful when working with messy columns that require cleaning.

mtcars %>% mutate(CAR = gsub('Merc', 'Mercedes', CAR))
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merce…  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merce…  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merce…  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>
mtcars %>% mutate(CAR = gsub(' ', '_', CAR)) %>% mutate(across(CAR, tolower))
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 merc_…  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 merc_…  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 merc_…  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

2.4 Joining and separating character columns

For the CAR column comprised of strings, you can separate the individual strings into multiple columns; the extra = 'merge' argument tells the function to separate the string based on the first instance of ” ” and merge the rest:

mtcars %>% separate(CAR, c('Brand', 'Model'), sep = " ", extra = 'merge')
## # A tibble: 32 × 13
##    Brand  Model   mpg   cyl  disp    hp  drat    wt  qsec    vs
##    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda  RX4    21       6  160    110  3.9   2.62  16.5     0
##  2 Mazda  RX4 …  21       6  160    110  3.9   2.88  17.0     0
##  3 Datsun 710    22.8     4  108     93  3.85  2.32  18.6     1
##  4 Hornet 4 Dr…  21.4     6  258    110  3.08  3.22  19.4     1
##  5 Hornet Spor…  18.7     8  360    175  3.15  3.44  17.0     0
##  6 Valia… <NA>   18.1     6  225    105  2.76  3.46  20.2     1
##  7 Duster 360    14.3     8  360    245  3.21  3.57  15.8     0
##  8 Merc   240D   24.4     4  147.    62  3.69  3.19  20       1
##  9 Merc   230    22.8     4  141.    95  3.92  3.15  22.9     1
## 10 Merc   280    19.2     6  168.   123  3.92  3.44  18.3     1
## # ℹ 22 more rows
## # ℹ 3 more variables: am <dbl>, gear <dbl>, carb <dbl>

Chaining multiple tidyverse verbs can be done using the pipe %>%; this %>% symbol is part of the magrittr package, which is loaded with tidyverse. Recently as of R version 4.1.0, the native pipe |> was introduced, which behaves in mostly a similar manner as the magrittr pipe (though there are a few important differences). Here, I will stick to the %>% notation.

mtcars %>% separate(CAR, c('Brand', 'Model'), sep = " ", extra = 'merge') %>%
  group_by(Brand) %>% summarise(count = n(), mean_mpg = mean(mpg)) %>%
  arrange(desc(count))
## # A tibble: 22 × 3
##    Brand    count mean_mpg
##    <chr>    <int>    <dbl>
##  1 Merc         7     19.0
##  2 Fiat         2     29.8
##  3 Hornet       2     20.0
##  4 Mazda        2     21  
##  5 Toyota       2     27.7
##  6 AMC          1     15.2
##  7 Cadillac     1     10.4
##  8 Camaro       1     13.3
##  9 Chrysler     1     14.7
## 10 Datsun       1     22.8
## # ℹ 12 more rows

The pipe notation is frequently used between tidyverse verbs but you can also use it in some base R operations as well.

Separation of columns by strings, using base R instead involves using strsplit() which outputs a list and therefore we need to unpack it afterwards:

mtcars$Brand <- unlist(
  lapply(strsplit(mtcars$CAR, split = ' '), function(x) x[1])
) # not run
mtcars$Model <- unlist(
  lapply(strsplit(mtcars$CAR, split = ' '), function(x) x[2])
) # not run

The opposite of separate() is unite() which combines columns into a single column:

data(iris)
iris <- as_tibble(iris)
iris %>% unite('Petal.Dimensions', c(`Petal.Length`, `Petal.Width`),
               sep = " x ", remove = FALSE)
## # A tibble: 150 × 6
##    Sepal.Length Sepal.Width Petal.Dimensions Petal.Length
##           <dbl>       <dbl> <chr>                   <dbl>
##  1          5.1         3.5 1.4 x 0.2                 1.4
##  2          4.9         3   1.4 x 0.2                 1.4
##  3          4.7         3.2 1.3 x 0.2                 1.3
##  4          4.6         3.1 1.5 x 0.2                 1.5
##  5          5           3.6 1.4 x 0.2                 1.4
##  6          5.4         3.9 1.7 x 0.4                 1.7
##  7          4.6         3.4 1.4 x 0.3                 1.4
##  8          5           3.4 1.5 x 0.2                 1.5
##  9          4.4         2.9 1.4 x 0.2                 1.4
## 10          4.9         3.1 1.5 x 0.1                 1.5
## # ℹ 140 more rows
## # ℹ 2 more variables: Petal.Width <dbl>, Species <fct>

In base R, we create a new column using the $ notation:

iris$Petal.Dimensions <- paste(iris$Petal.Length, iris$Petal.Width, sep = " x ") # not run

2.5 Filtering rows

Typically, filtering involves a condition based on a column to select the corresponding rows. Here I am combining stringr::str_detect() with dplyr::filter() since the column I am filtering on is a character vector.

mtcars %>% filter(str_detect(CAR, 'Mazda'))
## # A tibble: 2 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda …    21     6   160   110   3.9  2.62  16.5     0     1
## 2 Mazda …    21     6   160   110   3.9  2.88  17.0     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

This function accepts logical operators and regex as well:

mtcars %>% filter(str_detect(CAR, 'Mazda | Merc'))
## # A tibble: 2 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda …    21     6   160   110   3.9  2.62  16.5     0     1
## 2 Mazda …    21     6   160   110   3.9  2.88  17.0     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Ignoring upper/lower case distinction using regex(..., ignore_case = TRUE):

mtcars %>% filter(str_detect(CAR, regex('mazda', ignore_case = TRUE)))
## # A tibble: 2 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda …    21     6   160   110   3.9  2.62  16.5     0     1
## 2 Mazda …    21     6   160   110   3.9  2.88  17.0     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Alternatively, using base R using grepl():

mtcars[grepl('Mazda', mtcars$CAR),] # not run
mtcars[grepl('Mazda|Merc', mtcars$CAR),] # not run

Using tolower() to make sure we’re on the same page in regards to case:

mtcars[grepl(tolower('Mazda'), tolower(mtcars$CAR)),]
## # A tibble: 2 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda …    21     6   160   110   3.9  2.62  16.5     0     1
## 2 Mazda …    21     6   160   110   3.9  2.88  17.0     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Filtering rows based on a numeric column is also trivial:

mtcars %>% filter(between(mpg, 18, 20))
## # A tibble: 5 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet…  18.7     8  360    175  3.15  3.44  17.0     0     0
## 2 Valiant  18.1     6  225    105  2.76  3.46  20.2     1     0
## 3 Merc 2…  19.2     6  168.   123  3.92  3.44  18.3     1     0
## 4 Pontia…  19.2     8  400    175  3.08  3.84  17.0     0     0
## 5 Ferrar…  19.7     6  145    175  3.62  2.77  15.5     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>
mtcars %>% filter(cyl %in% c(6, 8))
## # A tibble: 21 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  4 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  5 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  6 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  7 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
##  8 Merc …  17.8     6  168.   123  3.92  3.44  18.9     1     0
##  9 Merc …  16.4     8  276.   180  3.07  4.07  17.4     0     0
## 10 Merc …  17.3     8  276.   180  3.07  3.73  17.6     0     0
## # ℹ 11 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

A useful function is the negation of %in%, which we can define manually:

`%nin%` <- Negate(`%in%`)

mtcars %>% filter(cyl %nin% c(6, 8))
## # A tibble: 11 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Datsu…  22.8     4 108      93  3.85  2.32  18.6     1     1
##  2 Merc …  24.4     4 147.     62  3.69  3.19  20       1     0
##  3 Merc …  22.8     4 141.     95  3.92  3.15  22.9     1     0
##  4 Fiat …  32.4     4  78.7    66  4.08  2.2   19.5     1     1
##  5 Honda…  30.4     4  75.7    52  4.93  1.62  18.5     1     1
##  6 Toyot…  33.9     4  71.1    65  4.22  1.84  19.9     1     1
##  7 Toyot…  21.5     4 120.     97  3.7   2.46  20.0     1     0
##  8 Fiat …  27.3     4  79      66  4.08  1.94  18.9     1     1
##  9 Porsc…  26       4 120.     91  4.43  2.14  16.7     0     1
## 10 Lotus…  30.4     4  95.1   113  3.77  1.51  16.9     1     1
## 11 Volvo…  21.4     4 121     109  4.11  2.78  18.6     1     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

In base R we can use which() instead to subset:

mtcars[which(mtcars$mpg > 18 & mtcars$mpg < 20),] # not run
mtcars[which(mtcars$cyl %in% c(6,8)),] # not run

Base R function subset() also works:

subset(mtcars, cyl == 6 & vs == 0)
## # A tibble: 3 × 12
##   CAR       mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1
## 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1
## 3 Ferrar…  19.7     6   145   175  3.62  2.77  15.5     0     1
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

2.6 Subsetting columns based on strings

Subsetting columns typically involves using square brackets [ ] in base R, but can be done easily with dplyr::select(); select() supports multiple selector functions such as starts_with(), ends_with(), contains(), and match(). The selector everything() also exists, which is used to - you guessed it - select every column.

As a word of caution: sometimes it’s good practice to explicitly state select() we’re calling is from dplyr (i.e., dplyr::select()) to avoid confusion when working with many different packages at once that could potentially introduce conflicts.

mtcars %>% select(contains('m'))
## # A tibble: 32 × 2
##      mpg    am
##    <dbl> <dbl>
##  1  21       1
##  2  21       1
##  3  22.8     1
##  4  21.4     0
##  5  18.7     0
##  6  18.1     0
##  7  14.3     0
##  8  24.4     0
##  9  22.8     0
## 10  19.2     0
## # ℹ 22 more rows

You can use multiple helper functions at once:

mtcars %>% select(starts_with('m'), ends_with('c'))
## # A tibble: 32 × 2
##      mpg  qsec
##    <dbl> <dbl>
##  1  21    16.5
##  2  21    17.0
##  3  22.8  18.6
##  4  21.4  19.4
##  5  18.7  17.0
##  6  18.1  20.2
##  7  14.3  15.8
##  8  24.4  20  
##  9  22.8  22.9
## 10  19.2  18.3
## # ℹ 22 more rows

Using regex allows more flexibility in terms of pattern matching, as well as with anchors ^ and $:

mtcars %>% select(matches('^m'), matches('c$'))
## # A tibble: 32 × 2
##      mpg  qsec
##    <dbl> <dbl>
##  1  21    16.5
##  2  21    17.0
##  3  22.8  18.6
##  4  21.4  19.4
##  5  18.7  17.0
##  6  18.1  20.2
##  7  14.3  15.8
##  8  24.4  20  
##  9  22.8  22.9
## 10  19.2  18.3
## # ℹ 22 more rows

Alternatively, in base R and regex:

mtcars[,grepl('^m|c$', names(mtcars))]
## # A tibble: 32 × 2
##      mpg  qsec
##    <dbl> <dbl>
##  1  21    16.5
##  2  21    17.0
##  3  22.8  18.6
##  4  21.4  19.4
##  5  18.7  17.0
##  6  18.1  20.2
##  7  14.3  15.8
##  8  24.4  20  
##  9  22.8  22.9
## 10  19.2  18.3
## # ℹ 22 more rows

We can also define a vector containing the column names and use select() using the helper functions such as any_of() and all_of().

my_cols <- c('mpg', 'cyl', 'am')
mtcars %>% select(any_of(my_cols))
## # A tibble: 32 × 3
##      mpg   cyl    am
##    <dbl> <dbl> <dbl>
##  1  21       6     1
##  2  21       6     1
##  3  22.8     4     1
##  4  21.4     6     0
##  5  18.7     8     0
##  6  18.1     6     0
##  7  14.3     8     0
##  8  24.4     4     0
##  9  22.8     4     0
## 10  19.2     6     0
## # ℹ 22 more rows

Standard slicing and dicing works just as well:

mtcars[,my_cols]
## # A tibble: 32 × 3
##      mpg   cyl    am
##    <dbl> <dbl> <dbl>
##  1  21       6     1
##  2  21       6     1
##  3  22.8     4     1
##  4  21.4     6     0
##  5  18.7     8     0
##  6  18.1     6     0
##  7  14.3     8     0
##  8  24.4     4     0
##  9  22.8     4     0
## 10  19.2     6     0
## # ℹ 22 more rows

any_of() is particularly useful when the vector of strings contains columns that don’t exist in the data. For example, if my_cols contained elements called banana and apple, running mtcars[,my_cols] will throw an error - you can’t subset columns that don’t exist. Using any_of() safely finds columns that do exist in my_cols and returns them safely.

2.7 Long and wide data formats

Transforming datasets into long/wide formats is a typical task in exploratory data analysis and this can be done using tidyr:

mtcars_long <- mtcars %>% pivot_longer(-1, names_to = 'Metric', values_to = 'Values')
mtcars_long
## # A tibble: 352 × 3
##    CAR       Metric Values
##    <chr>     <chr>   <dbl>
##  1 Mazda RX4 mpg     21   
##  2 Mazda RX4 cyl      6   
##  3 Mazda RX4 disp   160   
##  4 Mazda RX4 hp     110   
##  5 Mazda RX4 drat     3.9 
##  6 Mazda RX4 wt       2.62
##  7 Mazda RX4 qsec    16.5 
##  8 Mazda RX4 vs       0   
##  9 Mazda RX4 am       1   
## 10 Mazda RX4 gear     4   
## # ℹ 342 more rows

Having the dataset in this long format allows us to create visualizations such as the boxplot much easier.

Pivoting the long format back to a wide format as also straightforward with tidyr:

mtcars_wide <- mtcars_long %>% 
  pivot_wider(names_from = 'Metric', values_from = 'Values')
mtcars_wide
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Having the data in a wide format may be useful when plotting visualizations such as a heatmap, where row and column annotations correspond to the heatmap perimeter.

In base R, you can use reshape() with the argument direction = but I don’t find its syntax very intuitive. Alternatively, using data.table gets us the solution much easier:

library(data.table)
melt(setDT(mtcars), id.vars = c('CAR'), variable.name = 'Metric')
##                    CAR Metric value
##   1:         Mazda RX4    mpg  21.0
##   2:     Mazda RX4 Wag    mpg  21.0
##   3:        Datsun 710    mpg  22.8
##   4:    Hornet 4 Drive    mpg  21.4
##   5: Hornet Sportabout    mpg  18.7
##  ---                               
## 348:      Lotus Europa   carb   2.0
## 349:    Ford Pantera L   carb   4.0
## 350:      Ferrari Dino   carb   6.0
## 351:     Maserati Bora   carb   8.0
## 352:        Volvo 142E   carb   2.0

Note that using data.table returns a data.table object (due to the setDT() function), which differs from the tibble we’ve been using for packages within tidyverse. There are some advantages to using data.table instead, especially when working with very large datasets. For the sake of this book I will mostly use base R and tidyverse, other than exceptional cases. However, for a brief discourse on data.table, refer to Chapter 4.

2.8 Trimming strings

Trimming character columns, then re-encoding them as factors:

iris %>% mutate(Species = strtrim(Species, 3)) %>%
  mutate(Species = factor(Species, levels = c('set', 'ver', 'vir')))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 set    
##  2          4.9         3            1.4         0.2 set    
##  3          4.7         3.2          1.3         0.2 set    
##  4          4.6         3.1          1.5         0.2 set    
##  5          5           3.6          1.4         0.2 set    
##  6          5.4         3.9          1.7         0.4 set    
##  7          4.6         3.4          1.4         0.3 set    
##  8          5           3.4          1.5         0.2 set    
##  9          4.4         2.9          1.4         0.2 set    
## 10          4.9         3.1          1.5         0.1 set    
## # ℹ 140 more rows

Sometimes when you’re importing datasets from external and/or untrustworthy sources (e.g., an Excel sheet) it’s worth checking for any whitespaces. In that case you can use stringr::str_trim() to remove all whitespaces prior to data analysis.

iris %>% mutate(Species = str_trim(Species))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

2.9 Iterating over list of dataframes

Analogously to base R’s split(), using group_split() allows us to convert the dataset to a list based on a column; the length of this output list would correspond to the unique number of column entries:

mtcars <- as_tibble(mtcars)
mtcars_lst <- mtcars %>% group_split(cyl)

Iteration will be covered in more detail in a future chapter (see Chapter 3), but using base R’s apply family is straightforward. In this case since we’re working with lists, we will use lapply() and pass a function (in this case, rename_with()):

mtcars_lst <- lapply(mtcars_lst, function(x) 
  rename_with(x, function(y) paste0(y, "_", as.character(unique(x$cyl))), .cols = -1))
mtcars_lst[[1]]
## # A tibble: 11 × 12
##    CAR       mpg_4 cyl_4 disp_4  hp_4 drat_4  wt_4 qsec_4  vs_4
##    <chr>     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 Datsun 7…  22.8     4  108      93   3.85  2.32   18.6     1
##  2 Merc 240D  24.4     4  147.     62   3.69  3.19   20       1
##  3 Merc 230   22.8     4  141.     95   3.92  3.15   22.9     1
##  4 Fiat 128   32.4     4   78.7    66   4.08  2.2    19.5     1
##  5 Honda Ci…  30.4     4   75.7    52   4.93  1.62   18.5     1
##  6 Toyota C…  33.9     4   71.1    65   4.22  1.84   19.9     1
##  7 Toyota C…  21.5     4  120.     97   3.7   2.46   20.0     1
##  8 Fiat X1-9  27.3     4   79      66   4.08  1.94   18.9     1
##  9 Porsche …  26       4  120.     91   4.43  2.14   16.7     0
## 10 Lotus Eu…  30.4     4   95.1   113   3.77  1.51   16.9     1
## 11 Volvo 14…  21.4     4  121     109   4.11  2.78   18.6     1
## # ℹ 3 more variables: am_4 <dbl>, gear_4 <dbl>, carb_4 <dbl>

A tidyverse alternative to apply() is purrr::map() and its variations:

mtcars_lst <- mtcars %>% group_split(cyl)
mtcars_lst <- mtcars_lst %>% 
  map(~ rename_with(.x, function(y) paste0(y, "_", as.character(unique(.x$cyl))), 
                    .cols = -1))
mtcars_lst[[1]]
## # A tibble: 11 × 12
##    CAR       mpg_4 cyl_4 disp_4  hp_4 drat_4  wt_4 qsec_4  vs_4
##    <chr>     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 Datsun 7…  22.8     4  108      93   3.85  2.32   18.6     1
##  2 Merc 240D  24.4     4  147.     62   3.69  3.19   20       1
##  3 Merc 230   22.8     4  141.     95   3.92  3.15   22.9     1
##  4 Fiat 128   32.4     4   78.7    66   4.08  2.2    19.5     1
##  5 Honda Ci…  30.4     4   75.7    52   4.93  1.62   18.5     1
##  6 Toyota C…  33.9     4   71.1    65   4.22  1.84   19.9     1
##  7 Toyota C…  21.5     4  120.     97   3.7   2.46   20.0     1
##  8 Fiat X1-9  27.3     4   79      66   4.08  1.94   18.9     1
##  9 Porsche …  26       4  120.     91   4.43  2.14   16.7     0
## 10 Lotus Eu…  30.4     4   95.1   113   3.77  1.51   16.9     1
## 11 Volvo 14…  21.4     4  121     109   4.11  2.78   18.6     1
## # ℹ 3 more variables: am_4 <dbl>, gear_4 <dbl>, carb_4 <dbl>

Fitting a linear model is easy using iterations; in this case we fit lm() on the variables mpg and gear to identify their relationship:

mtcars_lst <- mtcars %>% group_split(cyl)
mtcars_lst %>% 
  map(~ lm(mpg ~ gear, data = .x)) %>%
  map(coef)
## [[1]]
## (Intercept)        gear 
##    15.08125     2.83125 
## 
## [[2]]
## (Intercept)        gear 
##       19.82       -0.02 
## 
## [[3]]
## (Intercept)        gear 
##      14.525       0.175

Using broom::tidy() to clean up modelling result and output the model estimates:

library(broom)
mtcars_lst %>%
  map(~ lm(mpg ~ gear, data = .x)) %>%
  map(tidy) %>%
  bind_rows()
## # A tibble: 6 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  15.1       10.8      1.39   0.197  
## 2 gear          2.83       2.62     1.08   0.308  
## 3 (Intercept)  19.8        3.68     5.38   0.00299
## 4 gear         -0.0200     0.942   -0.0212 0.984  
## 5 (Intercept)  14.5        3.41     4.25   0.00112
## 6 gear          0.175      1.02     0.172  0.866

Using base R and the apply() family instead:

models <- lapply(mtcars_lst, function(x) lm(mpg ~ gear, data = x))
coefs <- lapply(models, coef)
coefs[[1]]
## (Intercept)        gear 
##    15.08125     2.83125

2.10 Rowwise operations

df <- tibble(name = c('Brian', 'Connor'),
             coffee = sample(1:10, 2),
             wine = sample(1:5, 2),
             juice = sample(1:5, 2))
df
## # A tibble: 2 × 4
##   name   coffee  wine juice
##   <chr>   <int> <int> <int>
## 1 Brian       3     2     3
## 2 Connor      9     4     2

Sometimes it’s useful to run calculations in a row-wise fashion; for example using mutate() with the helper function c_across():

df %>% rowwise() %>% mutate(total = sum(c_across(coffee:juice)),
                            avg = mean(c_across(coffee:juice)))
## # A tibble: 2 × 6
## # Rowwise: 
##   name   coffee  wine juice total   avg
##   <chr>   <int> <int> <int> <int> <dbl>
## 1 Brian       3     2     3     8  2.67
## 2 Connor      9     4     2    15  5

Calculating the proportions using a combination of row-wise and column-wise operations with c_across() and across():

df %>% rowwise() %>% mutate(total = sum(c_across(coffee:juice))) %>%
  ungroup() %>%
  mutate(across(coffee:juice, function(x) x/total,
                .names = "{col}_prop."))
## # A tibble: 2 × 8
##   name   coffee  wine juice total coffee_prop. wine_prop.
##   <chr>   <int> <int> <int> <int>        <dbl>      <dbl>
## 1 Brian       3     2     3     8        0.375      0.25 
## 2 Connor      9     4     2    15        0.6        0.267
## # ℹ 1 more variable: juice_prop. <dbl>

2.11 Conditional transformation

Conditional transformations are useful when creating categories based on a series of logical statements; this is done using case_when() to define the conditions:

mtcars %>% mutate(mileage_class = 
  case_when(mpg > 20 ~ 'High',
            mpg < 20 ~ 'Low')) %>%
  relocate(mileage_class, .after = mpg)
## # A tibble: 32 × 13
##    CAR          mpg mileage_class   cyl  disp    hp  drat    wt
##    <chr>      <dbl> <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4   21   High              6  160    110  3.9   2.62
##  2 Mazda RX4…  21   High              6  160    110  3.9   2.88
##  3 Datsun 710  22.8 High              4  108     93  3.85  2.32
##  4 Hornet 4 …  21.4 High              6  258    110  3.08  3.22
##  5 Hornet Sp…  18.7 Low               8  360    175  3.15  3.44
##  6 Valiant     18.1 Low               6  225    105  2.76  3.46
##  7 Duster 360  14.3 Low               8  360    245  3.21  3.57
##  8 Merc 240D   24.4 High              4  147.    62  3.69  3.19
##  9 Merc 230    22.8 High              4  141.    95  3.92  3.15
## 10 Merc 280    19.2 Low               6  168.   123  3.92  3.44
## # ℹ 22 more rows
## # ℹ 5 more variables: qsec <dbl>, vs <dbl>, am <dbl>,
## #   gear <dbl>, carb <dbl>

In cases where we’re working with character columns, it’s useful to use grepl to match specific strings. After all, formula within case_when() needs to evaluate to a boolean on the left-hand side (e.g., mpg > 20). Additionally, in cases where the condition is not specified or met, you can use TRUE ~ within case_when() as such:

mtcars %>% mutate(CAR = case_when(grepl('Merc', CAR) ~ toupper(CAR), TRUE ~ CAR))
## # A tibble: 32 × 12
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 MERC …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 MERC …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 MERC …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>

Conditional mutate using base R involves using ifelse():

mtcars$mileage_class <- ifelse(
  mtcars$mpg > 20, 'High', 'Low'
)
subset(mtcars, select = c(CAR, mpg, mileage_class, cyl:carb))
## # A tibble: 32 × 13
##    CAR          mpg mileage_class   cyl  disp    hp  drat    wt
##    <chr>      <dbl> <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4   21   High              6  160    110  3.9   2.62
##  2 Mazda RX4…  21   High              6  160    110  3.9   2.88
##  3 Datsun 710  22.8 High              4  108     93  3.85  2.32
##  4 Hornet 4 …  21.4 High              6  258    110  3.08  3.22
##  5 Hornet Sp…  18.7 Low               8  360    175  3.15  3.44
##  6 Valiant     18.1 Low               6  225    105  2.76  3.46
##  7 Duster 360  14.3 Low               8  360    245  3.21  3.57
##  8 Merc 240D   24.4 High              4  147.    62  3.69  3.19
##  9 Merc 230    22.8 High              4  141.    95  3.92  3.15
## 10 Merc 280    19.2 Low               6  168.   123  3.92  3.44
## # ℹ 22 more rows
## # ℹ 5 more variables: qsec <dbl>, vs <dbl>, am <dbl>,
## #   gear <dbl>, carb <dbl>

2.12 Missing values

Handling missing values is tedious but often required when working with dodgy data.

First, for the sake of our exercise we insert NAs randomly in the mtcars dataset:

mtcars_NA <- map_df(mtcars, function(x) 
  {x[sample(c(TRUE, NA), prob = c(0.95, 0.01), size = length(x), replace = TRUE)]})
mtcars_NA
## # A tibble: 32 × 13
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 22 more rows
## # ℹ 3 more variables: gear <dbl>, carb <dbl>,
## #   mileage_class <chr>

Remove rows where any NA occurs:

mtcars_NA %>% na.omit()
## # A tibble: 26 × 13
##    CAR      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0
## # ℹ 16 more rows
## # ℹ 3 more variables: gear <dbl>, carb <dbl>,
## #   mileage_class <chr>

Identify columns with NAs and the number of occurrences:

vapply(mtcars_NA, function(x) sum(is.na(x)), double(1))
##           CAR           mpg           cyl          disp 
##             1             0             0             0 
##            hp          drat            wt          qsec 
##             0             0             1             1 
##            vs            am          gear          carb 
##             0             1             0             0 
## mileage_class 
##             3

Remove columns with more than one missing value:

mtcars_NA %>% select(where(function(x) sum(is.na(x)) < 1))
## # A tibble: 32 × 8
##      mpg   cyl  disp    hp  drat    vs  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9      0     4     4
##  2  21       6  160    110  3.9      0     4     4
##  3  22.8     4  108     93  3.85     1     4     1
##  4  21.4     6  258    110  3.08     1     3     1
##  5  18.7     8  360    175  3.15     0     3     2
##  6  18.1     6  225    105  2.76     1     3     1
##  7  14.3     8  360    245  3.21     0     3     4
##  8  24.4     4  147.    62  3.69     1     4     2
##  9  22.8     4  141.    95  3.92     1     4     2
## 10  19.2     6  168.   123  3.92     1     4     4
## # ℹ 22 more rows

Replace missing values with zero using tidyr::replace_na():

mtcars_NA %>% map_dfc(~ replace_na(.x, 0))

Base R and using is.na() instead:

mtcars_NA[is.na(mtcars_NA)] <- 0 # not run

2.13 Joining dataframes

Mutating joins in tidyverse are analogous to the inner and outer joins in SQL syntax:

df1 <- tibble(
  name = c('Brian', 'Connor', 'Jon'),
  city = c('Tokyo', 'London', 'Milan'),
  age = c(28, 25, 21)
)
df2 <- tibble(
  name = c('Brian', 'Connor'),
  hair = c('black', 'brown'),
  eyes = c('dark', 'hazel')
)
df1 %>% inner_join(df2, by = 'name')
## # A tibble: 2 × 5
##   name   city     age hair  eyes 
##   <chr>  <chr>  <dbl> <chr> <chr>
## 1 Brian  Tokyo     28 black dark 
## 2 Connor London    25 brown hazel
df1 %>% left_join(df2, by = 'name')
## # A tibble: 3 × 5
##   name   city     age hair  eyes 
##   <chr>  <chr>  <dbl> <chr> <chr>
## 1 Brian  Tokyo     28 black dark 
## 2 Connor London    25 brown hazel
## 3 Jon    Milan     21 <NA>  <NA>

For a great visualization of the different join functions in tidyverse, check out the chapter on relational data in Hadley’s book R for Data Science here.

Base R uses merge() with the argument all.x = and all_y =:

merge(df1, df2, by = 'name')
##     name   city age  hair  eyes
## 1  Brian  Tokyo  28 black  dark
## 2 Connor London  25 brown hazel
merge(df1, df2, by = 'name', all.x = TRUE)
##     name   city age  hair  eyes
## 1  Brian  Tokyo  28 black  dark
## 2 Connor London  25 brown hazel
## 3    Jon  Milan  21  <NA>  <NA>