Chapter 4 Interlude I: A brief glimpse into data.table

In the first chapter, we saw various practical solutions in data wrangling using tidyverse and base R. One topic that has not been discussed is the idea of computational efficiency and runtime - this matter has been trivial so far since the datasets have been considerably tiny. However, when working with large data, it may be in the user’s interest to try using an alternative package designed for reducing programming and computation time - data.table. The vignette is available here.

library(data.table)
options(datatable.print.nrows=10)

Similarly to how a tibble is an enhanced form of a data.frame in tidyverse, data.table uses an object class called a data.table. Using fread() to read in data - whether the argument corresponds to a local file or a URL pointing to a dataset - automatically generates a data.table object. Converting a regular data.frame to a data.table is done with setDT().

data(mtcars)
setDT(mtcars)
head(mtcars, 10)
##      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
##  1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
##  2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
##  3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
##  4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
##  5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
##  6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
##  7: 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
##  8: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
##  9: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
class(mtcars)[1]
## [1] "data.table"

4.1 Data wrangling operations

The vignette does a great job explaining the syntax of data.table in detail, but the takeaway message is that it subsets the data by i, performs an operation according to j, then groups it using by =. That is, DT[i, j, by].

mtcars[cyl == 6, .(mean_mileage = mean(mpg))]
##    mean_mileage
## 1:     19.74286

Above, data.table has subsetted the object based on cyl == 6 then calculated the mean mileage. The j is wrapped around .(), which is equivalent to list() - this is because the columns in a table are analogous to a list object and we want to return a data.table as our output rather than an atomic vector.

class(mtcars[cyl == 6, .(mean_mileage = mean(mpg))])
## [1] "data.table" "data.frame"

Multiple calculations can be performed in j:

mtcars[cyl == 6 & gear == 4, .(mean_mileage = mean(mpg), median_wt = median(wt))]
##    mean_mileage median_wt
## 1:        19.75    3.1575

Calculating the number of rows in j uses a special variable .N.

mtcars[cyl == 6 & gear == 4, .N]
## [1] 4

The j argument can be used to select columns after subsetting rows with i; this is analogous to filter() and select() in dplyr:

mtcars[, .(mpg, wt, gear)][1:10]
##      mpg    wt gear
##  1: 21.0 2.620    4
##  2: 21.0 2.875    4
##  3: 22.8 2.320    4
##  4: 21.4 3.215    3
##  5: 18.7 3.440    3
##  6: 18.1 3.460    3
##  7: 14.3 3.570    3
##  8: 24.4 3.190    4
##  9: 22.8 3.150    4
## 10: 19.2 3.440    4
my_cols <- c('mpg', 'wt', 'gear')
mtcars[, ..my_cols][1:10]
##      mpg    wt gear
##  1: 21.0 2.620    4
##  2: 21.0 2.875    4
##  3: 22.8 2.320    4
##  4: 21.4 3.215    3
##  5: 18.7 3.440    3
##  6: 18.1 3.460    3
##  7: 14.3 3.570    3
##  8: 24.4 3.190    4
##  9: 22.8 3.150    4
## 10: 19.2 3.440    4

Using by = argument is similar to group_by() in dplyr:

mtcars[, .(mean_mileage = mean(mpg), median_wt = median(wt)), by = cyl]
##    cyl mean_mileage median_wt
## 1:   6     19.74286     3.215
## 2:   4     26.66364     2.200
## 3:   8     15.10000     3.755
mtcars[, .N, by = cyl]
##    cyl  N
## 1:   6  7
## 2:   4 11
## 3:   8 14
mtcars[vs == 0, .N, by = .(cyl, gear)]
##    cyl gear  N
## 1:   6    4  2
## 2:   8    3 12
## 3:   4    5  1
## 4:   8    5  2
## 5:   6    5  1

Piping multiple operations together in data.table is straightforward:

mtcars[vs == 0, .(mpg, cyl, gear)][,.(mean_mpg = mean(mpg)), by = .(cyl, gear)]
##    cyl gear mean_mpg
## 1:   6    4    21.00
## 2:   8    3    15.05
## 3:   4    5    26.00
## 4:   8    5    15.40
## 5:   6    5    19.70

4.2 .SD, .SDcols, and :=

For slightly more difficult operations, we need to define three new concepts: firstly, the .SD variable points to the current subset of data.

mtcars[cyl == 6, .SD]
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 4: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 5: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 6: 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 7: 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

In above context, the .SD doesn’t do much. but this special variable is useful when you’re doing operations over multiple columns. Using .SDcols with .SD allows user to specifically point to columns across the current subset of data.

mtcars[cyl == 6, .SD, .SDcols = c('disp', 'hp', 'drat')]
##     disp  hp drat
## 1: 160.0 110 3.90
## 2: 160.0 110 3.90
## 3: 258.0 110 3.08
## 4: 225.0 105 2.76
## 5: 167.6 123 3.92
## 6: 167.6 123 3.92
## 7: 145.0 175 3.62

This means we can easily perform operations across a subset of columns:

mtcars[, lapply(.SD, mean), by = cyl, .SDcols = c('disp', 'hp', 'drat')]
##    cyl     disp        hp     drat
## 1:   6 183.3143 122.28571 3.585714
## 2:   4 105.1364  82.63636 4.070909
## 3:   8 353.1000 209.21429 3.229286

.SDcols is flexible because it also accepts indices:

col_idx <- colnames(mtcars) %in% c('disp', 'hp', 'drat')
mtcars[, lapply(.SD, mean), by = cyl, .SDcols = col_idx]
##    cyl     disp        hp     drat
## 1:   6 183.3143 122.28571 3.585714
## 2:   4 105.1364  82.63636 4.070909
## 3:   8 353.1000 209.21429 3.229286

Using the := operator allows user to define new columns in one of two ways: firstly, in a simple LHS := RHS syntax; this creates a new column but does not print the result to the console.

mtcars[, HpPerMpg := .(hp/mpg)]
head(mtcars)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb HpPerMpg
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 5.238095
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 5.238095
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 4.078947
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 5.140187
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 9.358289
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 5.801105

This allows users to remove columns by setting the RHS to NULL:

# not run
mtcars[, HpPerMpg := NULL]

Subsetting using i allows for condition-based operations, similar to mutate(case_when()) in dplyr:

mtcars[cyl == 6, CylThreshold := 'Over 6'][cyl != 6, CylThreshold := 'Under 6']
head(mtcars)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb HpPerMpg
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 5.238095
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 5.238095
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 4.078947
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 5.140187
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 9.358289
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 5.801105
##    CylThreshold
## 1:       Over 6
## 2:       Over 6
## 3:      Under 6
## 4:       Over 6
## 5:      Under 6
## 6:       Over 6

Secondly, := can be used in a functional form:

mtcars[, `:=`(HpPerMpg = hp/mpg, MpgXCyl = mpg*cyl)]
head(mtcars)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb HpPerMpg
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 5.238095
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 5.238095
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 4.078947
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 5.140187
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 9.358289
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 5.801105
##    CylThreshold MpgXCyl
## 1:       Over 6   126.0
## 2:       Over 6   126.0
## 3:      Under 6    91.2
## 4:       Over 6   128.4
## 5:      Under 6   149.6
## 6:       Over 6   108.6

Combining the := with by =:

mtcars[, `:=`(mean_mileage = mean(mpg)), by = .(cyl, vs)]
head(mtcars)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb HpPerMpg
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 5.238095
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 5.238095
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 4.078947
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 5.140187
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 9.358289
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 5.801105
##    CylThreshold MpgXCyl mean_mileage
## 1:       Over 6   126.0     20.56667
## 2:       Over 6   126.0     20.56667
## 3:      Under 6    91.2     26.73000
## 4:       Over 6   128.4     19.12500
## 5:      Under 6   149.6     15.10000
## 6:       Over 6   108.6     19.12500

Combining .SD with the := operator:

mtcars[, c('max_disp', 'max_hp', 'max_wt') := lapply(.SD, max), 
       by = cyl, .SDcols = c('disp', 'hp', 'wt')]
head(mtcars)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb HpPerMpg
## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 5.238095
## 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 5.238095
## 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 4.078947
## 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 5.140187
## 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 9.358289
## 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 5.801105
##    CylThreshold MpgXCyl mean_mileage max_disp max_hp max_wt
## 1:       Over 6   126.0     20.56667    258.0    175  3.460
## 2:       Over 6   126.0     20.56667    258.0    175  3.460
## 3:      Under 6    91.2     26.73000    146.7    113  3.190
## 4:       Over 6   128.4     19.12500    258.0    175  3.460
## 5:      Under 6   149.6     15.10000    472.0    335  5.424
## 6:       Over 6   108.6     19.12500    258.0    175  3.460

Finally, a strange behaviour is observed when we start making copies of data; for example:

data(iris)
setDT(iris)
iris2 <- iris
identical(iris, iris2)
## [1] TRUE

Now see what happens when we change one of the columns in iris2 using :=:

iris2[, Petal.Width := Petal.Width/100]
iris2
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
##   1:          5.1         3.5          1.4       0.002
##   2:          4.9         3.0          1.4       0.002
##   3:          4.7         3.2          1.3       0.002
##   4:          4.6         3.1          1.5       0.002
##   5:          5.0         3.6          1.4       0.002
##  ---                                                  
## 146:          6.7         3.0          5.2       0.023
## 147:          6.3         2.5          5.0       0.019
## 148:          6.5         3.0          5.2       0.020
## 149:          6.2         3.4          5.4       0.023
## 150:          5.9         3.0          5.1       0.018
##        Species
##   1:    setosa
##   2:    setosa
##   3:    setosa
##   4:    setosa
##   5:    setosa
##  ---          
## 146: virginica
## 147: virginica
## 148: virginica
## 149: virginica
## 150: virginica

It turns out that changing iris2 has also changed the original data iris:

head(iris)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          5.1         3.5          1.4       0.002  setosa
## 2:          4.9         3.0          1.4       0.002  setosa
## 3:          4.7         3.2          1.3       0.002  setosa
## 4:          4.6         3.1          1.5       0.002  setosa
## 5:          5.0         3.6          1.4       0.002  setosa
## 6:          5.4         3.9          1.7       0.004  setosa
identical(iris, iris2)
## [1] TRUE

However, if we use <- to change one of the columns of iris2, the original iris data does not change:

iris2$Petal.Length <- iris2$Petal.Length/100
identical(iris, iris2)
## [1] FALSE

The rationale for this behaviour is well-explained in this stackoverflow post, but essentially what happens is that := operator modifies by reference. Both iris2 and iris are pointing to the same location after copying initially with <-. Thus when we modify the copy of iris by reference, there is no need to copy the entire dataset iris to alter its copy. On the other hand, changing iris2 using <- will copy the entire thing even if we’re only changing just one column. This behaviour is undesirable when we’re working with very large data.

To avoid changing the original dataset but still use := to update a copy, data.table uses the copy() function:

iris3 <- copy(iris)
iris3[, Petal.Width := Petal.Width/100]
identical(iris, iris3) # only the iris3 object was changed here
## [1] FALSE

4.3 Reshaping data using melt and dcast

In the first chapter, I briefly touched on melt() as an alternative to tidr::pivot_longer(). Base R’s equivalent reshape() is rather clunky to use, so I much prefer the tidyr or data.table solutions.

DT <- data.table(
  Team = c('Tottenham', 'Arsenal', 'Chelsea', 'ManUnited'),
  Wins = c(7, 3, 4, 6),
  Goals = c(29, 18, 22, 26),
  CleanSheets = c(3, 1, 2, 3)
)
DT
##         Team Wins Goals CleanSheets
## 1: Tottenham    7    29           3
## 2:   Arsenal    3    18           1
## 3:   Chelsea    4    22           2
## 4: ManUnited    6    26           3
DT_long <- melt(DT, id.vars = 'Team', measure.vars = c('Wins', 'Goals', 'CleanSheets'),
                variable.name = 'Stat', value.name = 'Value')
DT_long
##          Team        Stat Value
##  1: Tottenham        Wins     7
##  2:   Arsenal        Wins     3
##  3:   Chelsea        Wins     4
##  4: ManUnited        Wins     6
##  5: Tottenham       Goals    29
## ---                            
##  8: ManUnited       Goals    26
##  9: Tottenham CleanSheets     3
## 10:   Arsenal CleanSheets     1
## 11:   Chelsea CleanSheets     2
## 12: ManUnited CleanSheets     3

The data.table equivalent to tidyr::pivot_wider() is dcast(); this function takes in a formula as an argument (~) where the LHS corresponds to the id.vars and the RHS corresponds to the column that originated from the measure.vars. Running this yields the original dataset:

dcast(DT_long, Team ~ Stat, value.var = 'Value')
##         Team Wins Goals CleanSheets
## 1:   Arsenal    3    18           1
## 2:   Chelsea    4    22           2
## 3: ManUnited    6    26           3
## 4: Tottenham    7    29           3