Chapter 4 Interlude I: A brief glimpse into data.table
In the first chapter, we saw various practical solutions in data wrangling using tidyverse
and base R. One topic that has not been discussed is the idea of computational efficiency and runtime - this matter has been trivial so far since the datasets have been considerably tiny. However, when working with large data, it may be in the user’s interest to try using an alternative package designed for reducing programming and computation time - data.table
. The vignette is available here.
Similarly to how a tibble
is an enhanced form of a data.frame
in tidyverse
, data.table
uses an object class called a data.table
. Using fread()
to read in data - whether the argument corresponds to a local file or a URL pointing to a dataset - automatically generates a data.table
object. Converting a regular data.frame
to a data.table
is done with setDT()
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## [1] "data.table"
4.1 Data wrangling operations
The vignette does a great job explaining the syntax of data.table
in detail, but the takeaway message is that it subsets the data by i
, performs an operation according to j
, then groups it using by =
. That is, DT[i, j, by]
.
## mean_mileage
## 1: 19.74286
Above, data.table
has subsetted the object based on cyl == 6
then calculated the mean mileage. The j
is wrapped around .()
, which is equivalent to list()
- this is because the columns in a table are analogous to a list
object and we want to return a data.table
as our output rather than an atomic vector.
## [1] "data.table" "data.frame"
Multiple calculations can be performed in j
:
## mean_mileage median_wt
## 1: 19.75 3.1575
Calculating the number of rows in j
uses a special variable .N
.
## [1] 4
The j
argument can be used to select columns after subsetting rows with i
; this is analogous to filter()
and select()
in dplyr
:
## mpg wt gear
## 1: 21.0 2.620 4
## 2: 21.0 2.875 4
## 3: 22.8 2.320 4
## 4: 21.4 3.215 3
## 5: 18.7 3.440 3
## 6: 18.1 3.460 3
## 7: 14.3 3.570 3
## 8: 24.4 3.190 4
## 9: 22.8 3.150 4
## 10: 19.2 3.440 4
## mpg wt gear
## 1: 21.0 2.620 4
## 2: 21.0 2.875 4
## 3: 22.8 2.320 4
## 4: 21.4 3.215 3
## 5: 18.7 3.440 3
## 6: 18.1 3.460 3
## 7: 14.3 3.570 3
## 8: 24.4 3.190 4
## 9: 22.8 3.150 4
## 10: 19.2 3.440 4
Using by =
argument is similar to group_by()
in dplyr
:
## cyl mean_mileage median_wt
## 1: 6 19.74286 3.215
## 2: 4 26.66364 2.200
## 3: 8 15.10000 3.755
## cyl N
## 1: 6 7
## 2: 4 11
## 3: 8 14
## cyl gear N
## 1: 6 4 2
## 2: 8 3 12
## 3: 4 5 1
## 4: 8 5 2
## 5: 6 5 1
Piping multiple operations together in data.table
is straightforward:
## cyl gear mean_mpg
## 1: 6 4 21.00
## 2: 8 3 15.05
## 3: 4 5 26.00
## 4: 8 5 15.40
## 5: 6 5 19.70
4.2 .SD
, .SDcols
, and :=
For slightly more difficult operations, we need to define three new concepts: firstly, the .SD
variable points to the current subset of data.
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 5: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 6: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 7: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
In above context, the .SD
doesn’t do much. but this special variable is useful when you’re doing operations over multiple columns. Using .SDcols
with .SD
allows user to specifically point to columns across the current subset of data.
## disp hp drat
## 1: 160.0 110 3.90
## 2: 160.0 110 3.90
## 3: 258.0 110 3.08
## 4: 225.0 105 2.76
## 5: 167.6 123 3.92
## 6: 167.6 123 3.92
## 7: 145.0 175 3.62
This means we can easily perform operations across a subset of columns:
## cyl disp hp drat
## 1: 6 183.3143 122.28571 3.585714
## 2: 4 105.1364 82.63636 4.070909
## 3: 8 353.1000 209.21429 3.229286
.SDcols
is flexible because it also accepts indices:
col_idx <- colnames(mtcars) %in% c('disp', 'hp', 'drat')
mtcars[, lapply(.SD, mean), by = cyl, .SDcols = col_idx]
## cyl disp hp drat
## 1: 6 183.3143 122.28571 3.585714
## 2: 4 105.1364 82.63636 4.070909
## 3: 8 353.1000 209.21429 3.229286
Using the :=
operator allows user to define new columns in one of two ways: firstly, in a simple LHS := RHS
syntax; this creates a new column but does not print the result to the console.
## mpg cyl disp hp drat wt qsec vs am gear carb HpPerMpg
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.238095
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.238095
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.078947
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5.140187
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9.358289
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5.801105
This allows users to remove columns by setting the RHS to NULL
:
Subsetting using i
allows for condition-based operations, similar to mutate(case_when())
in dplyr
:
## mpg cyl disp hp drat wt qsec vs am gear carb HpPerMpg
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.238095
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.238095
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.078947
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5.140187
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9.358289
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5.801105
## CylThreshold
## 1: Over 6
## 2: Over 6
## 3: Under 6
## 4: Over 6
## 5: Under 6
## 6: Over 6
Secondly, :=
can be used in a functional form:
## mpg cyl disp hp drat wt qsec vs am gear carb HpPerMpg
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.238095
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.238095
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.078947
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5.140187
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9.358289
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5.801105
## CylThreshold MpgXCyl
## 1: Over 6 126.0
## 2: Over 6 126.0
## 3: Under 6 91.2
## 4: Over 6 128.4
## 5: Under 6 149.6
## 6: Over 6 108.6
Combining the :=
with by =
:
## mpg cyl disp hp drat wt qsec vs am gear carb HpPerMpg
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.238095
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.238095
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.078947
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5.140187
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9.358289
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5.801105
## CylThreshold MpgXCyl mean_mileage
## 1: Over 6 126.0 20.56667
## 2: Over 6 126.0 20.56667
## 3: Under 6 91.2 26.73000
## 4: Over 6 128.4 19.12500
## 5: Under 6 149.6 15.10000
## 6: Over 6 108.6 19.12500
Combining .SD
with the :=
operator:
mtcars[, c('max_disp', 'max_hp', 'max_wt') := lapply(.SD, max),
by = cyl, .SDcols = c('disp', 'hp', 'wt')]
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb HpPerMpg
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.238095
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.238095
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.078947
## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5.140187
## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9.358289
## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5.801105
## CylThreshold MpgXCyl mean_mileage max_disp max_hp max_wt
## 1: Over 6 126.0 20.56667 258.0 175 3.460
## 2: Over 6 126.0 20.56667 258.0 175 3.460
## 3: Under 6 91.2 26.73000 146.7 113 3.190
## 4: Over 6 128.4 19.12500 258.0 175 3.460
## 5: Under 6 149.6 15.10000 472.0 335 5.424
## 6: Over 6 108.6 19.12500 258.0 175 3.460
Finally, a strange behaviour is observed when we start making copies of data; for example:
## [1] TRUE
Now see what happens when we change one of the columns in iris2
using :=
:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.002 setosa
## 2: 4.9 3.0 1.4 0.002 setosa
## 3: 4.7 3.2 1.3 0.002 setosa
## 4: 4.6 3.1 1.5 0.002 setosa
## 5: 5.0 3.6 1.4 0.002 setosa
## ---
## 146: 6.7 3.0 5.2 0.023 virginica
## 147: 6.3 2.5 5.0 0.019 virginica
## 148: 6.5 3.0 5.2 0.020 virginica
## 149: 6.2 3.4 5.4 0.023 virginica
## 150: 5.9 3.0 5.1 0.018 virginica
It turns out that changing iris2
has also changed the original data iris
:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.002 setosa
## 2: 4.9 3.0 1.4 0.002 setosa
## 3: 4.7 3.2 1.3 0.002 setosa
## 4: 4.6 3.1 1.5 0.002 setosa
## 5: 5.0 3.6 1.4 0.002 setosa
## 6: 5.4 3.9 1.7 0.004 setosa
## [1] TRUE
However, if we use <-
to change one of the columns of iris2
, the original iris
data does not change:
## [1] FALSE
The rationale for this behaviour is well-explained in this stackoverflow post, but essentially what happens is that :=
operator modifies by reference. Both iris2
and iris
are pointing to the same location after copying initially with <-
. Thus when we modify the copy of iris
by reference, there is no need to copy the entire dataset iris
to alter its copy. On the other hand, changing iris2
using <-
will copy the entire thing even if we’re only changing just one column. This behaviour is undesirable when we’re working with very large data.
To avoid changing the original dataset but still use :=
to update a copy, data.table
uses the copy()
function:
iris3 <- copy(iris)
iris3[, Petal.Width := Petal.Width/100]
identical(iris, iris3) # only the iris3 object was changed here
## [1] FALSE
4.3 Reshaping data using melt
and dcast
In the first chapter, I briefly touched on melt()
as an alternative to tidr::pivot_longer()
. Base R’s equivalent reshape()
is rather clunky to use, so I much prefer the tidyr
or data.table
solutions.
DT <- data.table(
Team = c('Tottenham', 'Arsenal', 'Chelsea', 'ManUnited'),
Wins = c(7, 3, 4, 6),
Goals = c(29, 18, 22, 26),
CleanSheets = c(3, 1, 2, 3)
)
DT
## Team Wins Goals CleanSheets
## 1: Tottenham 7 29 3
## 2: Arsenal 3 18 1
## 3: Chelsea 4 22 2
## 4: ManUnited 6 26 3
DT_long <- melt(DT, id.vars = 'Team', measure.vars = c('Wins', 'Goals', 'CleanSheets'),
variable.name = 'Stat', value.name = 'Value')
DT_long
## Team Stat Value
## 1: Tottenham Wins 7
## 2: Arsenal Wins 3
## 3: Chelsea Wins 4
## 4: ManUnited Wins 6
## 5: Tottenham Goals 29
## ---
## 8: ManUnited Goals 26
## 9: Tottenham CleanSheets 3
## 10: Arsenal CleanSheets 1
## 11: Chelsea CleanSheets 2
## 12: ManUnited CleanSheets 3
The data.table
equivalent to tidyr::pivot_wider()
is dcast()
; this function takes in a formula as an argument (~
) where the LHS corresponds to the id.vars
and the RHS corresponds to the column that originated from the measure.vars
. Running this yields the original dataset:
## Team Wins Goals CleanSheets
## 1: Arsenal 3 18 1
## 2: Chelsea 4 22 2
## 3: ManUnited 6 26 3
## 4: Tottenham 7 29 3