Chapter 4 data.table

data.table is an excellent extension of the data.frame class. If used as a data.frame it will look and feel like a data frame. If, however, it is used with it’s unique capabilities, it will prove faster and easier to manipulate.

Let’s start with importing some freely available car sales data from Kaggle.

library(data.table)
library(magrittr)
auto <- fread('data/autos.csv')

## Warning in fread("data/autos.csv"): Found and resolved improper
## quoting out-of-sample. First healed line 5263: <<2016-03-29
## 16:46:46,"_SPARDOSE"______Polo_1_4___6N1___60PS___5Tuerer____FESTPREIS,privat,Angebot,
## 500,control,limousine,1999,manuell,60,polo,150000,12,benzin,volkswagen,ja,
## 2016-03-25 00:00:00,0,59581,2016-03-30 11:46:58>>. If the fields are
## not quoted (e.g. field separator does not appear within any field), try
## quote="" to avoid this warning.

View(auto)

dim(auto) #  Rows and columns

## [1] 371824     20

names(auto) # Variable names

##  [1] "dateCrawled"         "name"                "seller"             
##  [4] "offerType"           "price"               "abtest"             
##  [7] "vehicleType"         "yearOfRegistration"  "gearbox"            
## [10] "powerPS"             "model"               "kilometer"          
## [13] "monthOfRegistration" "fuelType"            "brand"              
## [16] "notRepairedDamage"   "dateCreated"         "nrOfPictures"       
## [19] "postalCode"          "lastSeen"

class(auto) # Object class

## [1] "data.table" "data.frame"

file.info('data/autos.csv') # File info on disk

##                    size isdir mode               mtime               ctime
## data/autos.csv 68439217 FALSE  644 2019-02-05 17:38:33 2019-02-05 17:38:33
##                              atime  uid  gid uname grname
## data/autos.csv 2020-12-02 13:04:40 1026 1027   ron    ron

gdata::humanReadable(68439217)

## [1] "65.3 MiB"

object.size(auto) %>% print(units = 'auto') # File size in memory

## 103.3 Mb

Things to note:

The import has been done with fread instead of read.csv. This is more efficient, and directly creates a data.table object.
The import is very fast.
The data after import is slightly larger than when stored on disk (in this case).

Let’s start with verifying that it behaves like a data.frame when expected.

auto[,2] %>% head

##                                                  name
## 1:                                         Golf_3_1.6
## 2:                               A5_Sportback_2.7_Tdi
## 3:                     Jeep_Grand_Cherokee_"Overland"
## 4:                              GOLF_4_1_4__3T\xdcRER
## 5:                     Skoda_Fabia_1.4_TDI_PD_Classic
## 6: BMW_316i___e36_Limousine___Bastlerfahrzeug__Export

auto[[2]] %>% head

## [1] "Golf_3_1.6"                                        
## [2] "A5_Sportback_2.7_Tdi"                              
## [3] "Jeep_Grand_Cherokee_\"Overland\""                  
## [4] "GOLF_4_1_4__3T\xdcRER"                             
## [5] "Skoda_Fabia_1.4_TDI_PD_Classic"                    
## [6] "BMW_316i___e36_Limousine___Bastlerfahrzeug__Export"

auto[1,2] %>% head

##          name
## 1: Golf_3_1.6

But notice the difference between data.frame and data.table when subsetting multiple rows. Uhh!

auto[1:3] %>% dim # data.table will exctract *rows*

## [1]  3 20

as.data.frame(auto)[1:3] %>% dim # data.frame will exctract *columns*

## [1] 371824      3

Just use columns (,) and be explicit regarding the dimension you are extracting…

Now let’s do some data.table specific operations. The general syntax has the form DT[i,j,by]. SQL users may think of i as WHERE, j as SELECT, and by as GROUP BY. We don’t need to name the arguments explicitly. Also, the Tab key will typically help you to fill in column names.

auto[,vehicleType,] %>% table # Exctract column and tabulate

## .
##                andere        bus     cabrio      coupe kleinwagen 
##      37899       3362      30220      22914      19026      80098 
##      kombi  limousine        suv 
##      67626      95963      14716

auto[vehicleType=='coupe',,] %>% dim # Exctract rows

## [1] 19026    20

auto[,gearbox:model,] %>% head # exctract column range

##      gearbox powerPS model
## 1:   manuell       0  golf
## 2:   manuell     190      
## 3: automatik     163 grand
## 4:   manuell      75  golf
## 5:   manuell      69 fabia
## 6:   manuell     102   3er

auto[,gearbox,] %>% table

## .
##           automatik   manuell 
##     20223     77169    274432

auto[vehicleType=='coupe' & gearbox=='automatik',,] %>% dim # intersect conditions

## [1] 6008   20

auto[,table(vehicleType),] # uhh? why would this even work?!?

## vehicleType
##                andere        bus     cabrio      coupe kleinwagen 
##      37899       3362      30220      22914      19026      80098 
##      kombi  limousine        suv 
##      67626      95963      14716

auto[, mean(price), by=vehicleType] # average price by car group

## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.

##    vehicleType         V1
## 1:              20124.688
## 2:       coupe  25951.506
## 3:         suv  13252.392
## 4:  kleinwagen   5691.167
## 5:   limousine  11111.107
## 6:      cabrio  15072.998
## 7:         bus  10300.686
## 8:       kombi   7739.518
## 9:      andere 676327.100

The .N operator is very useful if you need to count the length of the result. Notice where I use it:

auto[.N-1,,] # will exctract the *last* row

##            dateCrawled                   name seller offerType price
## 1: 2016-03-20 19:41:08 VW_Golf_Kombi_1_9l_TDI privat   Angebot  3400
##    abtest vehicleType yearOfRegistration gearbox powerPS model kilometer
## 1:   test       kombi               2002 manuell     100  golf    150000
##    monthOfRegistration fuelType      brand notRepairedDamage
## 1:                   6   diesel volkswagen                  
##            dateCreated nrOfPictures postalCode            lastSeen
## 1: 2016-03-20 00:00:00            0      40764 2016-03-24 12:45:21

auto[,.N] # will count rows

## [1] 371824

auto[,.N, vehicleType] # will count rows by type

##    vehicleType     N
## 1:             37899
## 2:       coupe 19026
## 3:         suv 14716
## 4:  kleinwagen 80098
## 5:   limousine 95963
## 6:      cabrio 22914
## 7:         bus 30220
## 8:       kombi 67626
## 9:      andere  3362

You may concatenate results into a vector:

auto[,c(mean(price), mean(powerPS)),]

## [1] 17286.2996   115.5414

This c() syntax no longer behaves well if splitting:

auto[,c(mean(price), mean(powerPS)), by=vehicleType]

##     vehicleType           V1
##  1:              20124.68801
##  2:                 71.23249
##  3:       coupe  25951.50589
##  4:       coupe    172.97614
##  5:         suv  13252.39182
##  6:         suv    166.01903
##  7:  kleinwagen   5691.16738
##  8:  kleinwagen     68.75733
##  9:   limousine  11111.10661
## 10:   limousine    132.26936
## 11:      cabrio  15072.99782
## 12:      cabrio    145.17684
## 13:         bus  10300.68561
## 14:         bus    113.58137
## 15:       kombi   7739.51760
## 16:       kombi    136.40654
## 17:      andere 676327.09964
## 18:      andere    102.11154

Use a list() instead of c(), within data.table commands:

auto[,list(mean(price), mean(powerPS)), by=vehicleType]

## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.

##    vehicleType         V1        V2
## 1:              20124.688  71.23249
## 2:       coupe  25951.506 172.97614
## 3:         suv  13252.392 166.01903
## 4:  kleinwagen   5691.167  68.75733
## 5:   limousine  11111.107 132.26936
## 6:      cabrio  15072.998 145.17684
## 7:         bus  10300.686 113.58137
## 8:       kombi   7739.518 136.40654
## 9:      andere 676327.100 102.11154

You can add names to your new variables:

auto[,list(Price=mean(price), Power=mean(powerPS)), by=vehicleType]

## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.

##    vehicleType      Price     Power
## 1:              20124.688  71.23249
## 2:       coupe  25951.506 172.97614
## 3:         suv  13252.392 166.01903
## 4:  kleinwagen   5691.167  68.75733
## 5:   limousine  11111.107 132.26936
## 6:      cabrio  15072.998 145.17684
## 7:         bus  10300.686 113.58137
## 8:       kombi   7739.518 136.40654
## 9:      andere 676327.100 102.11154

You can use .() to replace the longer list() command:

auto[,.(Price=mean(price), Power=mean(powerPS)), by=vehicleType]

## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.

##    vehicleType      Price     Power
## 1:              20124.688  71.23249
## 2:       coupe  25951.506 172.97614
## 3:         suv  13252.392 166.01903
## 4:  kleinwagen   5691.167  68.75733
## 5:   limousine  11111.107 132.26936
## 6:      cabrio  15072.998 145.17684
## 7:         bus  10300.686 113.58137
## 8:       kombi   7739.518 136.40654
## 9:      andere 676327.100 102.11154

And split by multiple variables:

auto[,.(Price=mean(price), Power=mean(powerPS)), by=.(vehicleType,fuelType)] %>% head

## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.

##    vehicleType fuelType     Price     Power
## 1:               benzin 11820.443  70.14477
## 2:       coupe   diesel 51170.248 179.48704
## 3:         suv   diesel 15549.369 168.16115
## 4:  kleinwagen   benzin  5786.514  68.74309
## 5:  kleinwagen   diesel  4295.550  76.83666
## 6:   limousine   benzin  6974.360 127.87025

Compute with variables created on the fly:

auto[,sum(price<1e4),] # Count prices higher than 10,000

## [1] 310497

auto[,mean(price<1e4),] # Proportion of prices larger than 10,000

## [1] 0.8350644

auto[,.(Power=mean(powerPS)), by=.(PriceRange=price>1e4)]

##    PriceRange    Power
## 1:      FALSE 101.8838
## 2:       TRUE 185.9029

You may sort along one or more columns

auto[order(-price), price,] %>% head # Order along price. Descending

## [1] 2147483647   99999999   99999999   99999999   99999999   99999999

auto[order(price, -lastSeen), price,] %>% head# Order along price and last seen . Ascending and descending.

## [1] 0 0 0 0 0 0

You may apply a function to ALL columns using a Subset of the Data using .SD

count.uniques <- function(x) length(unique(x))
auto[,lapply(.SD, count.uniques), vehicleType]

##    vehicleType dateCrawled  name seller offerType price abtest
## 1:                   36714 32891      1         2  1378      2
## 2:       coupe       18745 13182      1         2  1994      2
## 3:         suv       14549  9707      1         1  1667      2
## 4:  kleinwagen       75591 49302      2         2  1927      2
## 5:   limousine       89352 58581      2         1  2986      2
## 6:      cabrio       22497 13411      1         1  2014      2
## 7:         bus       29559 19651      1         2  1784      2
## 8:       kombi       64415 41976      2         1  2529      2
## 9:      andere        3352  3185      1         1   562      2
##    yearOfRegistration gearbox powerPS model kilometer monthOfRegistration
## 1:                101       3     374   244        13                  13
## 2:                 75       3     414   117        13                  13
## 3:                 73       3     342   122        13                  13
## 4:                 75       3     317   163        13                  13
## 5:                 83       3     506   210        13                  13
## 6:                 88       3     363    95        13                  13
## 7:                 65       3     251   106        13                  13
## 8:                 64       3     393   177        13                  13
## 9:                 81       3     230   162        13                  13
##    fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode
## 1:        8    40                 3          65            1       6304
## 2:        8    35                 3          51            1       5159
## 3:        8    37                 3          61            1       4932
## 4:        8    38                 3          68            1       7343
## 5:        8    39                 3          82            1       7513
## 6:        7    38                 3          70            1       5524
## 7:        8    33                 3          63            1       6112
## 8:        8    38                 3          75            1       7337
## 9:        8    38                 3          41            1       2220
##    lastSeen
## 1:    32813
## 2:    16568
## 3:    13367
## 4:    59354
## 5:    65813
## 6:    19125
## 7:    26094
## 8:    50668
## 9:     3294

Things to note:

.SD is the data subset after splitting along the by argument.
Recall that lapply applies the same function to all elements of a list. In this example, to all columns of .SD.

If you want to apply a function only to a subset of columns, use the .SDcols argument

auto[,lapply(.SD, count.uniques), by=vehicleType, .SDcols=price:gearbox]

##    vehicleType price abtest vehicleType yearOfRegistration gearbox
## 1:              1378      2           1                101       3
## 2:       coupe  1994      2           1                 75       3
## 3:         suv  1667      2           1                 73       3
## 4:  kleinwagen  1927      2           1                 75       3
## 5:   limousine  2986      2           1                 83       3
## 6:      cabrio  2014      2           1                 88       3
## 7:         bus  1784      2           1                 65       3
## 8:       kombi  2529      2           1                 64       3
## 9:      andere   562      2           1                 81       3

4.1 Make your own variables

It is very easy to compute new variables

auto[,log(price/powerPS),] %>% head # This makes no sense

## [1]      Inf 4.567632 4.096387 2.995732 3.954583 1.852000

And if you want to store the result in a new variable, use the := operator

auto[,newVar:=log(price/powerPS),]

Or create multiple variables at once. The syntax c("A","B"):=.(expression1,expression2)is read “save the list of results from expression1 and expression2 using the vector of names A, and B”.

auto[,c('newVar','newVar2'):=.(log(price/powerPS),price^2/powerPS),]

4.2 Join

data.table can be used for joining. A join is the operation of aligning two (or more) data frames/tables along some index. The index can be a single variable, or a combination thereof.

Here is a simple example of aligning age and gender from two different data tables:

DT1 <- data.table(Names=c("Alice","Bob"), Age=c(29,31))
DT2 <- data.table(Names=c("Alice","Bob","Carl"), Gender=c("F","M","M"))
setkey(DT1, Names)
setkey(DT2, Names)
DT1[DT2,,]

##    Names Age Gender
## 1: Alice  29      F
## 2:   Bob  31      M
## 3:  Carl  NA      M

DT2[DT1,,]

##    Names Gender Age
## 1: Alice      F  29
## 2:   Bob      M  31

Things to note:

A join with data.tables is performed by indexing one data.table with another. Which is the outer and which is the inner will affect the result.
The indexing variable needs to be set using the setkey function.

There are several types of joins:

Inner join: Returns the rows along the intersection of keys, i.e., rows that appear in all data sets.
Outer join: Returns the rows along the union of keys, i.e., rows that appear in any of the data sets.
Left join: Returns the rows along the index of the “left” data set.
Right join: Returns the rows along the index of the “right” data set.

Assuming DT1 is the “left” data set, we see that DT1[DT2,,] is a right join, and DT2[DT1,,] is a left join. For an inner join use the nomath=0 argument:

DT1[DT2,,,nomatch=0]

##    Names Age Gender
## 1: Alice  29      F
## 2:   Bob  31      M

DT2[DT1,,,nomatch=0]

##    Names Gender Age
## 1: Alice      F  29
## 2:   Bob      M  31

For more about joining with data.table see this

4.3 Reshaping data

Data sets (i.e. frames or tables) may arrive in a “wide” form or a “long” form. The difference is best illustrated with an example. The ChickWeight data encodes the weight of various chicks. It is “long” in that a variable encodes the time of measurement, making the data, well, simply long:

ChickWeight %>%  head

## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

The mtcars data encodes 10 characteristics of 32 types of automobiles. It is “wide” since the various characteristics are encoded in different variables, making the data, well, simply wide.

mtcars %>% head

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Most of R’s functions, with exceptions, will prefer data in the long format. There are thus various facilities to convert from one format to another. We will focus on the melt and dcast functions to convert from one format to another.

4.3.1 Wide to long

melt will convert from wide to long.

dimnames(mtcars)

## [[1]]
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## [[2]]
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

mtcars$type <- rownames(mtcars)
melt(mtcars, id.vars=c("type")) %>% head

##                type variable value
## 1         Mazda RX4      mpg  21.0
## 2     Mazda RX4 Wag      mpg  21.0
## 3        Datsun 710      mpg  22.8
## 4    Hornet 4 Drive      mpg  21.4
## 5 Hornet Sportabout      mpg  18.7
## 6           Valiant      mpg  18.1

Things to note:

The car type was originally encoded in the rows’ names, and not as a variable. We thus created an explicit variable with the cars’ type using the rownames function.
The id.vars of the melt function names the variables that will be used as identifiers. All other variables are assumed to be measurements. These can have been specified using their index instead of their name.
If not all variables are measurements, we could have names measurement variables explicitly using the measure.vars argument of the melt function. These can have been specified using their index instead of their name.
By default, the molten columns are automatically named variable and value.

We can replace the automatic namings using variable.name and value.name:

melt(mtcars, id.vars=c("type"), variable.name="Charachteristic", value.name="Measurement") %>% head

##                type Charachteristic Measurement
## 1         Mazda RX4             mpg        21.0
## 2     Mazda RX4 Wag             mpg        21.0
## 3        Datsun 710             mpg        22.8
## 4    Hornet 4 Drive             mpg        21.4
## 5 Hornet Sportabout             mpg        18.7
## 6           Valiant             mpg        18.1

4.3.2 Long to wide

dcast will conver from long to wide:

dcast(ChickWeight, Chick~Time, value.var="weight")

##    Chick  0  2  4  6   8  10  12  14  16  18  20  21
## 1     18 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
## 2     16 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
## 3     15 41 49 56 64  68  68  67  68  NA  NA  NA  NA
## 4     13 41 48 53 60  65  67  71  70  71  81  91  96
## 5      9 42 51 59 68  85  96  90  92  93 100 100  98
## 6     20 41 47 54 58  65  73  77  89  98 107 115 117
## 7     10 41 44 52 63  74  81  89  96 101 112 120 124
## 8      8 42 50 61 71  84  93 110 116 126 134 125  NA
## 9     17 42 51 61 72  83  89  98 103 113 123 133 142
## 10    19 43 48 55 62  65  71  82  88 106 120 144 157
## 11     4 42 49 56 67  74  87 102 108 136 154 160 157
## 12     6 41 49 59 74  97 124 141 148 155 160 160 157
## 13    11 43 51 63 84 112 139 168 177 182 184 181 175
## 14     3 43 39 55 67  84  99 115 138 163 187 198 202
## 15     1 42 51 59 64  76  93 106 125 149 171 199 205
## 16    12 41 49 56 62  72  88 119 135 162 185 195 205
## 17     2 40 49 58 72  84 103 122 138 162 187 209 215
## 18     5 41 42 48 60  79 106 141 164 197 199 220 223
## 19    14 41 49 62 79 101 128 164 192 227 248 259 266
## 20     7 41 49 57 71  89 112 146 174 218 250 288 305
## 21    24 42 52 58 74  66  68  70  71  72  72  76  74
## 22    30 42 48 59 72  85  98 115 122 143 151 157 150
## 23    22 41 55 64 77  90  95 108 111 131 148 164 167
## 24    23 43 52 61 73  90 103 127 135 145 163 170 175
## 25    27 39 46 58 73  87 100 115 123 144 163 185 192
## 26    28 39 46 58 73  92 114 145 156 184 207 212 233
## 27    26 42 48 57 74  93 114 136 147 169 205 236 251
## 28    25 40 49 62 78 102 124 146 164 197 231 259 265
## 29    29 39 48 59 74  87 106 134 150 187 230 279 309
## 30    21 40 50 62 86 125 163 217 240 275 307 318 331
## 31    33 39 50 63 77  96 111 137 144 151 146 156 147
## 32    37 41 48 56 68  80  83 103 112 135 157 169 178
## 33    36 39 48 61 76  98 116 145 166 198 227 225 220
## 34    31 42 53 62 73  85 102 123 138 170 204 235 256
## 35    39 42 50 61 78  89 109 130 146 170 214 250 272
## 36    38 41 49 61 74  98 109 128 154 192 232 280 290
## 37    32 41 49 65 82 107 129 159 179 221 263 291 305
## 38    40 41 55 66 79 101 120 154 182 215 262 295 321
## 39    34 41 49 63 85 107 134 164 186 235 294 327 341
## 40    35 41 53 64 87 123 158 201 238 287 332 361 373
## 41    44 42 51 65 86 103 118 127 138 145 146  NA  NA
## 42    45 41 50 61 78  98 117 135 141 147 174 197 196
## 43    43 42 55 69 96 131 157 184 188 197 198 199 200
## 44    41 42 51 66 85 103 124 155 153 175 184 199 204
## 45    47 41 53 66 79 100 123 148 157 168 185 210 205
## 46    49 40 53 64 85 108 128 152 166 184 203 233 237
## 47    46 40 52 62 82 101 120 144 156 173 210 231 238
## 48    50 41 54 67 84 105 122 155 175 205 234 264 264
## 49    42 42 49 63 84 103 126 160 174 204 234 269 281
## 50    48 39 50 62 80 104 125 154 170 222 261 303 322

Things to note:

dcast uses a formula interface (~) to specify the row identifier and the variables. The LHS is the row identifier, and the RHS for the variables to be created.
The measurement of each LHS at each RHS, is specified using the value.var argument.

4.4 Bibliographic Notes

data.table has excellent online documentation. See here. See here for joining data.tables. See here for more on reshaping data.tables. See here for a comparison of the data.frame way, versus the data.table way. For some advanced tips and tricks see Andrew Brooks’ blog.

Introduction to Data Science

Chapter 4 data.table

4.1 Make your own variables

4.2 Join

4.3 Reshaping data

4.3.1 Wide to long

4.3.2 Long to wide

4.4 Bibliographic Notes

4.5 Practice Yourself