Chapter 2 Everyday data wrangling
Suppose there exists a test dataset and the user is tasked with proving their proficiency in exploratory data analysis (EDA). Alternatively, the user may be handed a piece of data (e.g., maybe an Excel sheet) and is asked to run some preliminary analysis. Prior to jumping straight into EDA and statistical modeling, it’s a good idea to transform the original data into a desirable format. This includes, but is not limited to, missing data imputation, scalar transformation, feature selection, group aggregation, and data filtering. The following notebook outlines the fundamentals of data wrangling, using practical base R and tidyverse
solutions interchangeably.
2.1 Renaming column headers
The first few steps in data analysis is to gain intuition around the data at hand - typically, a m x n
dataset is structured in a way such that there are m
records (also referred to as instances or samples) and m
features (also referred to as predictors). Getting a good sense and understanding of the features we’re working with often requires cleaning the data before jumping into things like feature selection and feature engineering.
Suppose we have a 32 x 11
dataset mtcars
where rownames
correspond to the car model:
There are 11 columns here and since this is a toy dataset, it is already neat and tidy. However, sometimes we may need to rename the columns to trim specific strings or to make downstream analysis easier. Renaming column headers with dplyr::rename()
is as simple as rename(new_name = old_name)
, but with rename_with()
there’s more flexibility:
## # A tibble: 32 × 12
## CAR MPG CYL DISP HP DRAT WT QSEC VS AM GEAR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: CARB <dbl>
You can pass any function within rename_with()
to alter the column headers and use the .cols =
argument to define which columns you’d apply the function to:
## # A tibble: 32 × 12
## CAR mpg_1 cyl_1 disp_1 hp_1 drat_1 wt_1 qsec_1 vs_1 am_1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1
## 2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 1
## 3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1
## 4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 0
## 5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 0
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0
## 7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 0
## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0
## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0
## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear_1 <dbl>, carb_1 <dbl>
Alternatively with base R, you can call the names
attribute and alter that directly:
2.2 Grouped operations
Grouped operations are most useful when you’re working with data where there is a categorical variable. Grouped summaries are powerful tools easily done with dplyr::summarise()
:
## # A tibble: 3 × 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Using multiple columns at once and renaming the output columns:
## # A tibble: 3 × 4
## cyl mean_mpg mean_disp mean_hp
## <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 105. 82.6
## 2 6 19.7 183. 122.
## 3 8 15.1 353. 209.
## # A tibble: 3 × 11
## cyl mean_mpg mean_disp mean_hp mean_drat mean_wt mean_qsec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 105. 82.6 4.07 2.29 19.1
## 2 6 19.7 183. 122. 3.59 3.12 18.0
## 3 8 15.1 353. 209. 3.23 4.00 16.8
## # ℹ 4 more variables: mean_vs <dbl>, mean_am <dbl>, mean_gear <dbl>,
## # mean_carb <dbl>
Alternatively, in base R, you can use aggregate()
with the ~
; you can read this as grouping mpg by cyl where by corresponds to the ~
:
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
Since aggregate()
uses the formula syntax (~
), using the .
placeholder on the LHS calls all variables not called in the RHS:
When calculating the size of the groups, you can use n()
:
## # A tibble: 3 × 2
## cyl count
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
The n()
function is useful when filtering by groups of size greater than a certain value:
## # A tibble: 25 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 2 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 3 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 4 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 5 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 6 Merc … 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
## 7 Merc … 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
## 8 Merc … 15.2 8 276. 180 3.07 3.78 18 0 0 3
## 9 Cadil… 10.4 8 472 205 2.93 5.25 18.0 0 0 3
## 10 Linco… 10.4 8 460 215 3 5.42 17.8 0 0 3
## # ℹ 15 more rows
## # ℹ 1 more variable: carb <dbl>
Using n()
gives us access to finding the group sizes using tally()
or count()
:
## # A tibble: 3 × 2
## cyl n
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
We can label encode (i.e., give a unique integer identifier for the current groups) using cur_group_id()
:
## # A tibble: 32 × 2
## # Groups: cyl [3]
## ID cyl
## <int> <dbl>
## 1 2 6
## 2 2 6
## 3 1 4
## 4 2 6
## 5 3 8
## 6 2 6
## 7 3 8
## 8 1 4
## 9 1 4
## 10 2 6
## # ℹ 22 more rows
2.3 Data transformation
Sometimes it’s useful to transform numeric columns; this is particularly helpful when performing feature transformations (e.g., logarithmic, absolute value) or creating new columns entirely from preexisting columns.
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 42 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 42 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 45.6 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 42.8 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 37.4 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 36.2 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 28.6 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merc … 48.8 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc … 45.6 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 38.4 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
Using across()
allows you to specifically select which columns you’d like to apply a given function on. The first part of the across()
argument should therefore specify the columns of interest. Here, where()
is particularly useful as it returns indices (kind of like grep()
) after evaluating a logical expression, such as is.numeric()
. The second part of across()
takes the function you’d like to apply, while the optional .names
argument allows you to create new columns rather than overwrite the old ones.
## # A tibble: 32 × 23
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 12 more variables: carb <dbl>, double_mpg <dbl>,
## # double_cyl <dbl>, double_disp <dbl>, double_hp <dbl>,
## # double_drat <dbl>, double_wt <dbl>, double_qsec <dbl>,
## # double_vs <dbl>, double_am <dbl>, double_gear <dbl>,
## # double_carb <dbl>
## # A tibble: 32 × 1
## CAR
## <chr>
## 1 Mazda RX4
## 2 Mazda RX4 Wa
## 3 Datsun 710
## 4 Hornet 4 Driv
## 5 Hornet Sportabou
## 6 Valian
## 7 Duster 360
## 8 Merc 240D
## 9 Merc 230
## 10 Merc 280
## # ℹ 22 more rows
Conditional mutate()
can also be done using mutate_if()
, though the _if()
functions, along with variants _at()
and _all()
have been effectively replaced by across()
, as seen above.
You can overwrite the existing columns completely; also keep in mind that you can pass any function within across()
as such:
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 63 6 480 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 63 6 480 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 68.4 4 324 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 64.2 6 774 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 56.1 8 1080 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 54.3 6 675 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 42.9 8 1080 245 3.21 3.57 15.8 0 0 3
## 8 Merc … 73.2 4 440. 62 3.69 3.19 20 1 0 4
## 9 Merc … 68.4 4 422. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 57.6 6 503. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
Iterating across the columns using purrr::modify()
instead:
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 63 18 480 330 11.7 7.86 49.4 0 3 12
## 2 Mazda… 63 18 480 330 11.7 8.62 51.1 0 3 12
## 3 Datsu… 68.4 12 324 279 11.6 6.96 55.8 3 3 12
## 4 Horne… 64.2 18 774 330 9.24 9.64 58.3 3 0 9
## 5 Horne… 56.1 24 1080 525 9.45 10.3 51.1 0 0 9
## 6 Valia… 54.3 18 675 315 8.28 10.4 60.7 3 0 9
## 7 Duste… 42.9 24 1080 735 9.63 10.7 47.5 0 0 9
## 8 Merc … 73.2 12 440. 186 11.1 9.57 60 3 0 12
## 9 Merc … 68.4 12 422. 285 11.8 9.45 68.7 3 0 12
## 10 Merc … 57.6 18 503. 369 11.8 10.3 54.9 3 0 12
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
Mutating columns of characters is also straightforward; this is helpful when working with messy columns that require cleaning.
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merce… 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merce… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merce… 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 merc_… 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 merc_… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 merc_… 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
2.4 Joining and separating character columns
For the CAR
column comprised of strings, you can separate the individual strings into multiple columns; the extra = 'merge'
argument tells the function to separate the string based on the first instance of ” ” and merge the rest:
## # A tibble: 32 × 13
## Brand Model mpg cyl disp hp drat wt qsec vs am
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1
## 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1
## 4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0
## 5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0
## 6 Valia… <NA> 18.1 6 225 105 2.76 3.46 20.2 1 0
## 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0
## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0
## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0
## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0
## # ℹ 22 more rows
## # ℹ 2 more variables: gear <dbl>, carb <dbl>
Chaining multiple tidyverse
verbs can be done using the pipe %>%
; this %>%
symbol is part of the magrittr
package, which is loaded with tidyverse
. Recently as of R version 4.1.0, the native pipe |>
was introduced, which behaves in mostly a similar manner as the magrittr
pipe (though there are a few important differences). Here, I will stick to the %>%
notation.
mtcars %>% separate(CAR, c('Brand', 'Model'), sep = " ", extra = 'merge') %>%
group_by(Brand) %>% summarise(count = n(), mean_mpg = mean(mpg)) %>%
arrange(desc(count))
## # A tibble: 22 × 3
## Brand count mean_mpg
## <chr> <int> <dbl>
## 1 Merc 7 19.0
## 2 Fiat 2 29.8
## 3 Hornet 2 20.0
## 4 Mazda 2 21
## 5 Toyota 2 27.7
## 6 AMC 1 15.2
## 7 Cadillac 1 10.4
## 8 Camaro 1 13.3
## 9 Chrysler 1 14.7
## 10 Datsun 1 22.8
## # ℹ 12 more rows
The pipe notation is frequently used between tidyverse
verbs but you can also use it in some base R operations as well.
Separation of columns by strings, using base R instead involves using strsplit()
which outputs a list and therefore we need to unpack it afterwards:
mtcars$Brand <- unlist(
lapply(strsplit(mtcars$CAR, split = ' '), function(x) x[1])
) # not run
mtcars$Model <- unlist(
lapply(strsplit(mtcars$CAR, split = ' '), function(x) x[2])
) # not run
The opposite of separate()
is unite()
which combines columns into a single column:
data(iris)
iris <- as_tibble(iris)
iris %>% unite('Petal.Dimensions', c(`Petal.Length`, `Petal.Width`),
sep = " x ", remove = FALSE)
## # A tibble: 150 × 6
## Sepal.Length Sepal.Width Petal.Dimensions Petal.Length Petal.Width
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 5.1 3.5 1.4 x 0.2 1.4 0.2
## 2 4.9 3 1.4 x 0.2 1.4 0.2
## 3 4.7 3.2 1.3 x 0.2 1.3 0.2
## 4 4.6 3.1 1.5 x 0.2 1.5 0.2
## 5 5 3.6 1.4 x 0.2 1.4 0.2
## 6 5.4 3.9 1.7 x 0.4 1.7 0.4
## 7 4.6 3.4 1.4 x 0.3 1.4 0.3
## 8 5 3.4 1.5 x 0.2 1.5 0.2
## 9 4.4 2.9 1.4 x 0.2 1.4 0.2
## 10 4.9 3.1 1.5 x 0.1 1.5 0.1
## # ℹ 140 more rows
## # ℹ 1 more variable: Species <fct>
In base R, we create a new column using the $
notation:
2.5 Filtering rows
Typically, filtering involves a condition based on a column to select the corresponding rows. Here I am combining stringr::str_detect()
with dplyr::filter()
since the column I am filtering on is a character vector.
## # A tibble: 2 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4
## # ℹ 1 more variable: carb <dbl>
This function accepts logical operators and regex as well:
## # A tibble: 2 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4
## # ℹ 1 more variable: carb <dbl>
Ignoring upper/lower case distinction using regex(..., ignore_case = TRUE)
:
## # A tibble: 2 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4
## # ℹ 1 more variable: carb <dbl>
Alternatively, using base R using grepl()
:
Using tolower()
to make sure we’re on the same page in regards to case:
## # A tibble: 2 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4
## # ℹ 1 more variable: carb <dbl>
Filtering rows based on a numeric column is also trivial:
## # A tibble: 5 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 2 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 3 Merc 2… 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## 4 Pontia… 19.2 8 400 175 3.08 3.84 17.0 0 0 3
## 5 Ferrar… 19.7 6 145 175 3.62 2.77 15.5 0 1 5
## # ℹ 1 more variable: carb <dbl>
## # A tibble: 21 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 4 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 5 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 6 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 7 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## 8 Merc … 17.8 6 168. 123 3.92 3.44 18.9 1 0 4
## 9 Merc … 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
## 10 Merc … 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
## # ℹ 11 more rows
## # ℹ 1 more variable: carb <dbl>
A useful function is the negation of %in%
, which we can define manually:
## # A tibble: 11 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 2 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 3 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 4 Fiat … 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4
## 5 Honda… 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4
## 6 Toyot… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4
## 7 Toyot… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3
## 8 Fiat … 27.3 4 79 66 4.08 1.94 18.9 1 1 4
## 9 Porsc… 26 4 120. 91 4.43 2.14 16.7 0 1 5
## 10 Lotus… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5
## 11 Volvo… 21.4 4 121 109 4.11 2.78 18.6 1 1 4
## # ℹ 1 more variable: carb <dbl>
In base R we can use which()
instead to subset:
mtcars[which(mtcars$mpg > 18 & mtcars$mpg < 20),] # not run
mtcars[which(mtcars$cyl %in% c(6,8)),] # not run
Base R function subset()
also works:
## # A tibble: 3 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Ferrar… 19.7 6 145 175 3.62 2.77 15.5 0 1 5
## # ℹ 1 more variable: carb <dbl>
2.6 Subsetting columns based on strings
Subsetting columns typically involves using square brackets [ ]
in base R, but can be done easily with dplyr::select()
; select()
supports multiple selector functions such as starts_with()
, ends_with()
, contains()
, and match()
. The selector everything()
also exists, which is used to - you guessed it - select every column.
As a word of caution: sometimes it’s good practice to explicitly state select()
we’re calling is from dplyr
(i.e., dplyr::select()
) to avoid confusion when working with many different packages at once that could potentially introduce conflicts.
## # A tibble: 32 × 2
## mpg am
## <dbl> <dbl>
## 1 21 1
## 2 21 1
## 3 22.8 1
## 4 21.4 0
## 5 18.7 0
## 6 18.1 0
## 7 14.3 0
## 8 24.4 0
## 9 22.8 0
## 10 19.2 0
## # ℹ 22 more rows
You can use multiple helper functions at once:
## # A tibble: 32 × 2
## mpg qsec
## <dbl> <dbl>
## 1 21 16.5
## 2 21 17.0
## 3 22.8 18.6
## 4 21.4 19.4
## 5 18.7 17.0
## 6 18.1 20.2
## 7 14.3 15.8
## 8 24.4 20
## 9 22.8 22.9
## 10 19.2 18.3
## # ℹ 22 more rows
Using regex allows more flexibility in terms of pattern matching, as well as with anchors ^
and $
:
## # A tibble: 32 × 2
## mpg qsec
## <dbl> <dbl>
## 1 21 16.5
## 2 21 17.0
## 3 22.8 18.6
## 4 21.4 19.4
## 5 18.7 17.0
## 6 18.1 20.2
## 7 14.3 15.8
## 8 24.4 20
## 9 22.8 22.9
## 10 19.2 18.3
## # ℹ 22 more rows
Alternatively, in base R and regex:
## # A tibble: 32 × 2
## mpg qsec
## <dbl> <dbl>
## 1 21 16.5
## 2 21 17.0
## 3 22.8 18.6
## 4 21.4 19.4
## 5 18.7 17.0
## 6 18.1 20.2
## 7 14.3 15.8
## 8 24.4 20
## 9 22.8 22.9
## 10 19.2 18.3
## # ℹ 22 more rows
We can also define a vector containing the column names and use select()
using the helper functions such as any_of()
and all_of()
.
## # A tibble: 32 × 3
## mpg cyl am
## <dbl> <dbl> <dbl>
## 1 21 6 1
## 2 21 6 1
## 3 22.8 4 1
## 4 21.4 6 0
## 5 18.7 8 0
## 6 18.1 6 0
## 7 14.3 8 0
## 8 24.4 4 0
## 9 22.8 4 0
## 10 19.2 6 0
## # ℹ 22 more rows
Standard slicing and dicing works just as well:
## # A tibble: 32 × 3
## mpg cyl am
## <dbl> <dbl> <dbl>
## 1 21 6 1
## 2 21 6 1
## 3 22.8 4 1
## 4 21.4 6 0
## 5 18.7 8 0
## 6 18.1 6 0
## 7 14.3 8 0
## 8 24.4 4 0
## 9 22.8 4 0
## 10 19.2 6 0
## # ℹ 22 more rows
any_of()
is particularly useful when the vector of strings contains columns that don’t exist in the data. For example, if my_cols
contained elements called banana
and apple
, running mtcars[,my_cols]
will throw an error - you can’t subset columns that don’t exist. Using any_of()
safely finds columns that do exist in my_cols
and returns them safely.
2.7 Long and wide data formats
Transforming datasets into long/wide formats is a typical task in exploratory data analysis and this can be done using tidyr
:
## # A tibble: 352 × 3
## CAR Metric Values
## <chr> <chr> <dbl>
## 1 Mazda RX4 mpg 21
## 2 Mazda RX4 cyl 6
## 3 Mazda RX4 disp 160
## 4 Mazda RX4 hp 110
## 5 Mazda RX4 drat 3.9
## 6 Mazda RX4 wt 2.62
## 7 Mazda RX4 qsec 16.5
## 8 Mazda RX4 vs 0
## 9 Mazda RX4 am 1
## 10 Mazda RX4 gear 4
## # ℹ 342 more rows
Having the dataset in this long format allows us to create visualizations such as the boxplot much easier.
Pivoting the long format back to a wide format as also straightforward with tidyr
:
mtcars_wide <- mtcars_long %>%
pivot_wider(names_from = 'Metric', values_from = 'Values')
mtcars_wide
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
Having the data in a wide format may be useful when plotting visualizations such as a heatmap, where row and column annotations correspond to the heatmap perimeter.
In base R, you can use reshape()
with the argument direction =
but I don’t find its syntax very intuitive. Alternatively, using data.table
gets us the solution much easier:
## CAR Metric value
## 1: Mazda RX4 mpg 21.0
## 2: Mazda RX4 Wag mpg 21.0
## 3: Datsun 710 mpg 22.8
## 4: Hornet 4 Drive mpg 21.4
## 5: Hornet Sportabout mpg 18.7
## ---
## 348: Lotus Europa carb 2.0
## 349: Ford Pantera L carb 4.0
## 350: Ferrari Dino carb 6.0
## 351: Maserati Bora carb 8.0
## 352: Volvo 142E carb 2.0
Note that using data.table
returns a data.table
object (due to the setDT()
function), which differs from the tibble
we’ve been using for packages within tidyverse
. There are some advantages to using data.table
instead, especially when working with very large datasets. For the sake of this book I will mostly use base R and tidyverse
, other than exceptional cases. However, for a brief discourse on data.table
, refer to Chapter 4.
2.8 Trimming strings
Trimming character columns, then re-encoding them as factors:
iris %>% mutate(Species = strtrim(Species, 3)) %>%
mutate(Species = factor(Species, levels = c('set', 'ver', 'vir')))
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 set
## 2 4.9 3 1.4 0.2 set
## 3 4.7 3.2 1.3 0.2 set
## 4 4.6 3.1 1.5 0.2 set
## 5 5 3.6 1.4 0.2 set
## 6 5.4 3.9 1.7 0.4 set
## 7 4.6 3.4 1.4 0.3 set
## 8 5 3.4 1.5 0.2 set
## 9 4.4 2.9 1.4 0.2 set
## 10 4.9 3.1 1.5 0.1 set
## # ℹ 140 more rows
Sometimes when you’re importing datasets from external and/or untrustworthy sources (e.g., an Excel sheet) it’s worth checking for any whitespaces. In that case you can use stringr::str_trim()
to remove all whitespaces prior to data analysis.
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
2.9 Iterating over list of dataframes
Analogously to base R’s split()
, using group_split()
allows us to convert the dataset to a list based on a column; the length of this output list would correspond to the unique number of column entries:
Iteration will be covered in more detail in a future chapter (see Chapter 3), but using base R’s apply
family is straightforward. In this case since we’re working with lists, we will use lapply()
and pass a function (in this case, rename_with()
):
mtcars_lst <- lapply(mtcars_lst, function(x)
rename_with(x, function(y) paste0(y, "_", as.character(unique(x$cyl))), .cols = -1))
mtcars_lst[[1]]
## # A tibble: 11 × 12
## CAR mpg_4 cyl_4 disp_4 hp_4 drat_4 wt_4 qsec_4 vs_4 am_4
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0
## 4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1
## 5 Honda Ci… 30.4 4 75.7 52 4.93 1.62 18.5 1 1
## 6 Toyota C… 33.9 4 71.1 65 4.22 1.84 19.9 1 1
## 7 Toyota C… 21.5 4 120. 97 3.7 2.46 20.0 1 0
## 8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1
## 9 Porsche … 26 4 120. 91 4.43 2.14 16.7 0 1
## 10 Lotus Eu… 30.4 4 95.1 113 3.77 1.51 16.9 1 1
## 11 Volvo 14… 21.4 4 121 109 4.11 2.78 18.6 1 1
## # ℹ 2 more variables: gear_4 <dbl>, carb_4 <dbl>
A tidyverse
alternative to apply()
is purrr::map()
and its variations:
mtcars_lst <- mtcars %>% group_split(cyl)
mtcars_lst <- mtcars_lst %>%
map(~ rename_with(.x, function(y) paste0(y, "_", as.character(unique(.x$cyl))),
.cols = -1))
mtcars_lst[[1]]
## # A tibble: 11 × 12
## CAR mpg_4 cyl_4 disp_4 hp_4 drat_4 wt_4 qsec_4 vs_4 am_4
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0
## 4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1
## 5 Honda Ci… 30.4 4 75.7 52 4.93 1.62 18.5 1 1
## 6 Toyota C… 33.9 4 71.1 65 4.22 1.84 19.9 1 1
## 7 Toyota C… 21.5 4 120. 97 3.7 2.46 20.0 1 0
## 8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1
## 9 Porsche … 26 4 120. 91 4.43 2.14 16.7 0 1
## 10 Lotus Eu… 30.4 4 95.1 113 3.77 1.51 16.9 1 1
## 11 Volvo 14… 21.4 4 121 109 4.11 2.78 18.6 1 1
## # ℹ 2 more variables: gear_4 <dbl>, carb_4 <dbl>
Fitting a linear model is easy using iterations; in this case we fit lm()
on the variables mpg
and gear
to identify their relationship:
mtcars_lst <- mtcars %>% group_split(cyl)
mtcars_lst %>%
map(~ lm(mpg ~ gear, data = .x)) %>%
map(coef)
## [[1]]
## (Intercept) gear
## 15.08125 2.83125
##
## [[2]]
## (Intercept) gear
## 19.82 -0.02
##
## [[3]]
## (Intercept) gear
## 14.525 0.175
Using broom::tidy()
to clean up modelling result and output the model estimates:
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.1 10.8 1.39 0.197
## 2 gear 2.83 2.62 1.08 0.308
## 3 (Intercept) 19.8 3.68 5.38 0.00299
## 4 gear -0.0200 0.942 -0.0212 0.984
## 5 (Intercept) 14.5 3.41 4.25 0.00112
## 6 gear 0.175 1.02 0.172 0.866
Using base R and the apply()
family instead:
models <- lapply(mtcars_lst, function(x) lm(mpg ~ gear, data = x))
coefs <- lapply(models, coef)
coefs[[1]]
## (Intercept) gear
## 15.08125 2.83125
2.10 Rowwise operations
df <- tibble(name = c('Brian', 'Connor'),
coffee = sample(1:10, 2),
wine = sample(1:5, 2),
juice = sample(1:5, 2))
df
## # A tibble: 2 × 4
## name coffee wine juice
## <chr> <int> <int> <int>
## 1 Brian 8 3 4
## 2 Connor 7 1 5
Sometimes it’s useful to run calculations in a row-wise fashion; for example using mutate()
with the helper function c_across()
:
df %>% rowwise() %>% mutate(total = sum(c_across(coffee:juice)),
avg = mean(c_across(coffee:juice)))
## # A tibble: 2 × 6
## # Rowwise:
## name coffee wine juice total avg
## <chr> <int> <int> <int> <int> <dbl>
## 1 Brian 8 3 4 15 5
## 2 Connor 7 1 5 13 4.33
Calculating the proportions using a combination of row-wise and column-wise operations with c_across()
and across()
:
df %>% rowwise() %>% mutate(total = sum(c_across(coffee:juice))) %>%
ungroup() %>%
mutate(across(coffee:juice, function(x) x/total,
.names = "{col}_prop."))
## # A tibble: 2 × 8
## name coffee wine juice total coffee_prop. wine_prop. juice_prop.
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Brian 8 3 4 15 0.533 0.2 0.267
## 2 Connor 7 1 5 13 0.538 0.0769 0.385
2.11 Conditional transformation
Conditional transformations are useful when creating categories based on a series of logical statements; this is done using case_when()
to define the conditions:
mtcars %>% mutate(mileage_class =
case_when(mpg > 20 ~ 'High',
mpg < 20 ~ 'Low')) %>%
relocate(mileage_class, .after = mpg)
## # A tibble: 32 × 13
## CAR mpg mileage_class cyl disp hp drat wt qsec
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 High 6 160 110 3.9 2.62 16.5
## 2 Mazda RX4… 21 High 6 160 110 3.9 2.88 17.0
## 3 Datsun 710 22.8 High 4 108 93 3.85 2.32 18.6
## 4 Hornet 4 … 21.4 High 6 258 110 3.08 3.22 19.4
## 5 Hornet Sp… 18.7 Low 8 360 175 3.15 3.44 17.0
## 6 Valiant 18.1 Low 6 225 105 2.76 3.46 20.2
## 7 Duster 360 14.3 Low 8 360 245 3.21 3.57 15.8
## 8 Merc 240D 24.4 High 4 147. 62 3.69 3.19 20
## 9 Merc 230 22.8 High 4 141. 95 3.92 3.15 22.9
## 10 Merc 280 19.2 Low 6 168. 123 3.92 3.44 18.3
## # ℹ 22 more rows
## # ℹ 4 more variables: vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
When we want to deal with a simple if and else
conditional statement, we can use if_else()
, which is very similar to base R’s ifelse()
, which I will use shortly.
mtcars %>% mutate(Car_Brand = if_else(
str_detect(CAR, 'Mazda'), 'Mazda', 'Not Mazda')) %>%
relocate(Car_Brand, .after = CAR)
## # A tibble: 32 × 13
## CAR Car_Brand mpg cyl disp hp drat wt qsec vs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda R… Mazda 21 6 160 110 3.9 2.62 16.5 0
## 2 Mazda R… Mazda 21 6 160 110 3.9 2.88 17.0 0
## 3 Datsun … Not Mazda 22.8 4 108 93 3.85 2.32 18.6 1
## 4 Hornet … Not Mazda 21.4 6 258 110 3.08 3.22 19.4 1
## 5 Hornet … Not Mazda 18.7 8 360 175 3.15 3.44 17.0 0
## 6 Valiant Not Mazda 18.1 6 225 105 2.76 3.46 20.2 1
## 7 Duster … Not Mazda 14.3 8 360 245 3.21 3.57 15.8 0
## 8 Merc 24… Not Mazda 24.4 4 147. 62 3.69 3.19 20 1
## 9 Merc 230 Not Mazda 22.8 4 141. 95 3.92 3.15 22.9 1
## 10 Merc 280 Not Mazda 19.2 6 168. 123 3.92 3.44 18.3 1
## # ℹ 22 more rows
## # ℹ 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
In cases where we’re working with character columns, it’s useful to use grepl
to match specific strings. After all, formula within case_when()
needs to evaluate to a boolean on the left-hand side (e.g., mpg > 20
). Additionally, in cases where the condition is not specified or met, you can use TRUE ~
within case_when()
as such:
## # A tibble: 32 × 12
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 MERC … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 MERC … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 MERC … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 1 more variable: carb <dbl>
Conditional mutate using base R involves using ifelse()
:
mtcars$mileage_class <- ifelse(
mtcars$mpg > 20, 'High', 'Low'
)
subset(mtcars, select = c(CAR, mpg, mileage_class, cyl:carb))
## # A tibble: 32 × 13
## CAR mpg mileage_class cyl disp hp drat wt qsec
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 High 6 160 110 3.9 2.62 16.5
## 2 Mazda RX4… 21 High 6 160 110 3.9 2.88 17.0
## 3 Datsun 710 22.8 High 4 108 93 3.85 2.32 18.6
## 4 Hornet 4 … 21.4 High 6 258 110 3.08 3.22 19.4
## 5 Hornet Sp… 18.7 Low 8 360 175 3.15 3.44 17.0
## 6 Valiant 18.1 Low 6 225 105 2.76 3.46 20.2
## 7 Duster 360 14.3 Low 8 360 245 3.21 3.57 15.8
## 8 Merc 240D 24.4 High 4 147. 62 3.69 3.19 20
## 9 Merc 230 22.8 High 4 141. 95 3.92 3.15 22.9
## 10 Merc 280 19.2 Low 6 168. 123 3.92 3.44 18.3
## # ℹ 22 more rows
## # ℹ 4 more variables: vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
2.12 Missing values
Handling missing values is tedious but often required when working with dodgy data.
First, for the sake of our exercise we insert NAs randomly in the mtcars
dataset:
mtcars_NA <- map_df(mtcars, function(x)
{x[sample(c(TRUE, NA), prob = c(0.95, 0.01), size = length(x), replace = TRUE)]})
mtcars_NA
## # A tibble: 32 × 13
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duste… 14.3 8 360 NA 3.21 3.57 15.8 0 0 3
## 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ℹ 22 more rows
## # ℹ 2 more variables: carb <dbl>, mileage_class <chr>
Remove rows where any NA occurs:
## # A tibble: 29 × 13
## CAR mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 8 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 9 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## 10 Merc … 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
## # ℹ 19 more rows
## # ℹ 2 more variables: carb <dbl>, mileage_class <chr>
Identify columns with NAs and the number of occurrences:
## CAR mpg cyl disp
## 0 0 0 0
## hp drat wt qsec
## 1 1 0 1
## vs am gear carb
## 0 0 0 0
## mileage_class
## 0
Remove columns with more than one missing value:
## # A tibble: 32 × 10
## CAR mpg cyl disp wt vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 2.62 0 1 4 4
## 2 Mazda RX4 Wag 21 6 160 2.88 0 1 4 4
## 3 Datsun 710 22.8 4 108 2.32 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 3.22 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 3.44 0 0 3 2
## 6 Valiant 18.1 6 225 3.46 1 0 3 1
## 7 Duster 360 14.3 8 360 3.57 0 0 3 4
## 8 Merc 240D 24.4 4 147. 3.19 1 0 4 2
## 9 Merc 230 22.8 4 141. 3.15 1 0 4 2
## 10 Merc 280 19.2 6 168. 3.44 1 0 4 4
## # ℹ 22 more rows
## # ℹ 1 more variable: mileage_class <chr>
Replace missing values with zero using tidyr::replace_na()
:
Base R and using is.na()
instead:
2.13 Joining dataframes
Mutating joins in tidyverse
are analogous to the inner and outer joins in SQL syntax:
df1 <- tibble(
name = c('Brian', 'Connor', 'Jon'),
city = c('Tokyo', 'London', 'Milan'),
age = c(28, 25, 21)
)
df2 <- tibble(
name = c('Brian', 'Connor'),
hair = c('black', 'brown'),
eyes = c('dark', 'hazel')
)
## # A tibble: 2 × 5
## name city age hair eyes
## <chr> <chr> <dbl> <chr> <chr>
## 1 Brian Tokyo 28 black dark
## 2 Connor London 25 brown hazel
## # A tibble: 3 × 5
## name city age hair eyes
## <chr> <chr> <dbl> <chr> <chr>
## 1 Brian Tokyo 28 black dark
## 2 Connor London 25 brown hazel
## 3 Jon Milan 21 <NA> <NA>
For a great visualization of the different join functions in tidyverse
, check out the chapter on relational data in Hadley’s book R for Data Science here.
Base R uses merge()
with the argument all.x =
and all_y =
:
## name city age hair eyes
## 1 Brian Tokyo 28 black dark
## 2 Connor London 25 brown hazel
## name city age hair eyes
## 1 Brian Tokyo 28 black dark
## 2 Connor London 25 brown hazel
## 3 Jon Milan 21 <NA> <NA>