Chapter 4 data.table
data.table
is an excellent extension of the data.frame
class.
If used as a data.frame
it will look and feel like a data frame.
If, however, it is used with it’s unique capabilities, it will prove faster and easier to manipulate.
Let’s start with importing some freely available car sales data from Kaggle.
## Warning in fread("data/autos.csv"): Found and resolved improper
## quoting out-of-sample. First healed line 5263: <<2016-03-29
## 16:46:46,"_SPARDOSE"______Polo_1_4___6N1___60PS___5Tuerer____FESTPREIS,privat,Angebot,
## 500,control,limousine,1999,manuell,60,polo,150000,12,benzin,volkswagen,ja,
## 2016-03-25 00:00:00,0,59581,2016-03-30 11:46:58>>. If the fields are
## not quoted (e.g. field separator does not appear within any field), try
## quote="" to avoid this warning.
## [1] 371824 20
## [1] "dateCrawled" "name" "seller"
## [4] "offerType" "price" "abtest"
## [7] "vehicleType" "yearOfRegistration" "gearbox"
## [10] "powerPS" "model" "kilometer"
## [13] "monthOfRegistration" "fuelType" "brand"
## [16] "notRepairedDamage" "dateCreated" "nrOfPictures"
## [19] "postalCode" "lastSeen"
## [1] "data.table" "data.frame"
## size isdir mode mtime ctime
## data/autos.csv 68439217 FALSE 644 2019-02-05 17:38:33 2019-02-05 17:38:33
## atime uid gid uname grname
## data/autos.csv 2020-12-02 13:04:40 1026 1027 ron ron
## [1] "65.3 MiB"
## 103.3 Mb
Things to note:
- The import has been done with
fread
instead ofread.csv
. This is more efficient, and directly creates adata.table
object. - The import is very fast.
- The data after import is slightly larger than when stored on disk (in this case).
Let’s start with verifying that it behaves like a data.frame
when expected.
## name
## 1: Golf_3_1.6
## 2: A5_Sportback_2.7_Tdi
## 3: Jeep_Grand_Cherokee_"Overland"
## 4: GOLF_4_1_4__3T\xdcRER
## 5: Skoda_Fabia_1.4_TDI_PD_Classic
## 6: BMW_316i___e36_Limousine___Bastlerfahrzeug__Export
## [1] "Golf_3_1.6"
## [2] "A5_Sportback_2.7_Tdi"
## [3] "Jeep_Grand_Cherokee_\"Overland\""
## [4] "GOLF_4_1_4__3T\xdcRER"
## [5] "Skoda_Fabia_1.4_TDI_PD_Classic"
## [6] "BMW_316i___e36_Limousine___Bastlerfahrzeug__Export"
## name
## 1: Golf_3_1.6
But notice the difference between data.frame
and data.table
when subsetting multiple rows. Uhh!
## [1] 3 20
## [1] 371824 3
Just use columns (,
) and be explicit regarding the dimension you are extracting…
Now let’s do some data.table
specific operations.
The general syntax has the form DT[i,j,by]
.
SQL users may think of i
as WHERE
, j
as SELECT
, and by
as GROUP BY
.
We don’t need to name the arguments explicitly.
Also, the Tab
key will typically help you to fill in column names.
## .
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
## [1] 19026 20
## gearbox powerPS model
## 1: manuell 0 golf
## 2: manuell 190
## 3: automatik 163 grand
## 4: manuell 75 golf
## 5: manuell 69 fabia
## 6: manuell 102 3er
## .
## automatik manuell
## 20223 77169 274432
## [1] 6008 20
## vehicleType
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType V1
## 1: 20124.688
## 2: coupe 25951.506
## 3: suv 13252.392
## 4: kleinwagen 5691.167
## 5: limousine 11111.107
## 6: cabrio 15072.998
## 7: bus 10300.686
## 8: kombi 7739.518
## 9: andere 676327.100
The .N
operator is very useful if you need to count the length of the result.
Notice where I use it:
## dateCrawled name seller offerType price
## 1: 2016-03-20 19:41:08 VW_Golf_Kombi_1_9l_TDI privat Angebot 3400
## abtest vehicleType yearOfRegistration gearbox powerPS model kilometer
## 1: test kombi 2002 manuell 100 golf 150000
## monthOfRegistration fuelType brand notRepairedDamage
## 1: 6 diesel volkswagen
## dateCreated nrOfPictures postalCode lastSeen
## 1: 2016-03-20 00:00:00 0 40764 2016-03-24 12:45:21
## [1] 371824
## vehicleType N
## 1: 37899
## 2: coupe 19026
## 3: suv 14716
## 4: kleinwagen 80098
## 5: limousine 95963
## 6: cabrio 22914
## 7: bus 30220
## 8: kombi 67626
## 9: andere 3362
You may concatenate results into a vector:
## [1] 17286.2996 115.5414
This c()
syntax no longer behaves well if splitting:
## vehicleType V1
## 1: 20124.68801
## 2: 71.23249
## 3: coupe 25951.50589
## 4: coupe 172.97614
## 5: suv 13252.39182
## 6: suv 166.01903
## 7: kleinwagen 5691.16738
## 8: kleinwagen 68.75733
## 9: limousine 11111.10661
## 10: limousine 132.26936
## 11: cabrio 15072.99782
## 12: cabrio 145.17684
## 13: bus 10300.68561
## 14: bus 113.58137
## 15: kombi 7739.51760
## 16: kombi 136.40654
## 17: andere 676327.09964
## 18: andere 102.11154
Use a list()
instead of c()
, within data.table
commands:
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType V1 V2
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can add names to your new variables:
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can use .()
to replace the longer list()
command:
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
And split by multiple variables:
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType fuelType Price Power
## 1: benzin 11820.443 70.14477
## 2: coupe diesel 51170.248 179.48704
## 3: suv diesel 15549.369 168.16115
## 4: kleinwagen benzin 5786.514 68.74309
## 5: kleinwagen diesel 4295.550 76.83666
## 6: limousine benzin 6974.360 127.87025
Compute with variables created on the fly:
## [1] 310497
## [1] 0.8350644
## PriceRange Power
## 1: FALSE 101.8838
## 2: TRUE 185.9029
You may sort along one or more columns
## [1] 2147483647 99999999 99999999 99999999 99999999 99999999
auto[order(price, -lastSeen), price,] %>% head# Order along price and last seen . Ascending and descending.
## [1] 0 0 0 0 0 0
You may apply a function to ALL columns using a Subset of the Data using .SD
## vehicleType dateCrawled name seller offerType price abtest
## 1: 36714 32891 1 2 1378 2
## 2: coupe 18745 13182 1 2 1994 2
## 3: suv 14549 9707 1 1 1667 2
## 4: kleinwagen 75591 49302 2 2 1927 2
## 5: limousine 89352 58581 2 1 2986 2
## 6: cabrio 22497 13411 1 1 2014 2
## 7: bus 29559 19651 1 2 1784 2
## 8: kombi 64415 41976 2 1 2529 2
## 9: andere 3352 3185 1 1 562 2
## yearOfRegistration gearbox powerPS model kilometer monthOfRegistration
## 1: 101 3 374 244 13 13
## 2: 75 3 414 117 13 13
## 3: 73 3 342 122 13 13
## 4: 75 3 317 163 13 13
## 5: 83 3 506 210 13 13
## 6: 88 3 363 95 13 13
## 7: 65 3 251 106 13 13
## 8: 64 3 393 177 13 13
## 9: 81 3 230 162 13 13
## fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode
## 1: 8 40 3 65 1 6304
## 2: 8 35 3 51 1 5159
## 3: 8 37 3 61 1 4932
## 4: 8 38 3 68 1 7343
## 5: 8 39 3 82 1 7513
## 6: 7 38 3 70 1 5524
## 7: 8 33 3 63 1 6112
## 8: 8 38 3 75 1 7337
## 9: 8 38 3 41 1 2220
## lastSeen
## 1: 32813
## 2: 16568
## 3: 13367
## 4: 59354
## 5: 65813
## 6: 19125
## 7: 26094
## 8: 50668
## 9: 3294
Things to note:
.SD
is the data subset after splitting along theby
argument.- Recall that
lapply
applies the same function to all elements of a list. In this example, to all columns of.SD
.
If you want to apply a function only to a subset of columns, use the .SDcols
argument
## vehicleType price abtest vehicleType yearOfRegistration gearbox
## 1: 1378 2 1 101 3
## 2: coupe 1994 2 1 75 3
## 3: suv 1667 2 1 73 3
## 4: kleinwagen 1927 2 1 75 3
## 5: limousine 2986 2 1 83 3
## 6: cabrio 2014 2 1 88 3
## 7: bus 1784 2 1 65 3
## 8: kombi 2529 2 1 64 3
## 9: andere 562 2 1 81 3
4.1 Make your own variables
It is very easy to compute new variables
## [1] Inf 4.567632 4.096387 2.995732 3.954583 1.852000
And if you want to store the result in a new variable, use the :=
operator
Or create multiple variables at once.
The syntax c("A","B"):=.(expression1,expression2)
is read “save the list of results from expression1 and expression2 using the vector of names A, and B”.
4.2 Join
data.table can be used for joining. A join is the operation of aligning two (or more) data frames/tables along some index. The index can be a single variable, or a combination thereof.
Here is a simple example of aligning age and gender from two different data tables:
DT1 <- data.table(Names=c("Alice","Bob"), Age=c(29,31))
DT2 <- data.table(Names=c("Alice","Bob","Carl"), Gender=c("F","M","M"))
setkey(DT1, Names)
setkey(DT2, Names)
DT1[DT2,,]
## Names Age Gender
## 1: Alice 29 F
## 2: Bob 31 M
## 3: Carl NA M
## Names Gender Age
## 1: Alice F 29
## 2: Bob M 31
Things to note:
- A join with
data.tables
is performed by indexing onedata.table
with another. Which is the outer and which is the inner will affect the result. - The indexing variable needs to be set using the
setkey
function.
There are several types of joins:
- Inner join: Returns the rows along the intersection of keys, i.e., rows that appear in all data sets.
- Outer join: Returns the rows along the union of keys, i.e., rows that appear in any of the data sets.
- Left join: Returns the rows along the index of the “left” data set.
- Right join: Returns the rows along the index of the “right” data set.
Assuming DT1
is the “left” data set, we see that DT1[DT2,,]
is a right join, and DT2[DT1,,]
is a left join.
For an inner join use the nomath=0
argument:
## Names Age Gender
## 1: Alice 29 F
## 2: Bob 31 M
## Names Gender Age
## 1: Alice F 29
## 2: Bob M 31
For more about joining with data.table see this
4.3 Reshaping data
Data sets (i.e. frames or tables) may arrive in a “wide” form or a “long” form.
The difference is best illustrated with an example.
The ChickWeight
data encodes the weight of various chicks. It is “long” in that a variable encodes the time of measurement, making the data, well, simply long:
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
The mtcars
data encodes 10 characteristics of 32 types of automobiles. It is “wide” since the various characteristics are encoded in different variables, making the data, well, simply wide.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Most of R’s functions, with exceptions, will prefer data in the long format.
There are thus various facilities to convert from one format to another.
We will focus on the melt
and dcast
functions to convert from one format to another.
4.3.1 Wide to long
melt
will convert from wide to long.
## [[1]]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## [[2]]
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## type variable value
## 1 Mazda RX4 mpg 21.0
## 2 Mazda RX4 Wag mpg 21.0
## 3 Datsun 710 mpg 22.8
## 4 Hornet 4 Drive mpg 21.4
## 5 Hornet Sportabout mpg 18.7
## 6 Valiant mpg 18.1
Things to note:
- The car type was originally encoded in the rows’ names, and not as a variable. We thus created an explicit variable with the cars’ type using the
rownames
function. - The
id.vars
of themelt
function names the variables that will be used as identifiers. All other variables are assumed to be measurements. These can have been specified using their index instead of their name. - If not all variables are measurements, we could have names measurement variables explicitly using the
measure.vars
argument of themelt
function. These can have been specified using their index instead of their name. - By default, the molten columns are automatically named
variable
andvalue
.
We can replace the automatic namings using variable.name
and value.name
:
## type Charachteristic Measurement
## 1 Mazda RX4 mpg 21.0
## 2 Mazda RX4 Wag mpg 21.0
## 3 Datsun 710 mpg 22.8
## 4 Hornet 4 Drive mpg 21.4
## 5 Hornet Sportabout mpg 18.7
## 6 Valiant mpg 18.1
4.3.2 Long to wide
dcast
will conver from long to wide:
## Chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1 18 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 16 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 15 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5 9 42 51 59 68 85 96 90 92 93 100 100 98
## 6 20 41 47 54 58 65 73 77 89 98 107 115 117
## 7 10 41 44 52 63 74 81 89 96 101 112 120 124
## 8 8 42 50 61 71 84 93 110 116 126 134 125 NA
## 9 17 42 51 61 72 83 89 98 103 113 123 133 142
## 10 19 43 48 55 62 65 71 82 88 106 120 144 157
## 11 4 42 49 56 67 74 87 102 108 136 154 160 157
## 12 6 41 49 59 74 97 124 141 148 155 160 160 157
## 13 11 43 51 63 84 112 139 168 177 182 184 181 175
## 14 3 43 39 55 67 84 99 115 138 163 187 198 202
## 15 1 42 51 59 64 76 93 106 125 149 171 199 205
## 16 12 41 49 56 62 72 88 119 135 162 185 195 205
## 17 2 40 49 58 72 84 103 122 138 162 187 209 215
## 18 5 41 42 48 60 79 106 141 164 197 199 220 223
## 19 14 41 49 62 79 101 128 164 192 227 248 259 266
## 20 7 41 49 57 71 89 112 146 174 218 250 288 305
## 21 24 42 52 58 74 66 68 70 71 72 72 76 74
## 22 30 42 48 59 72 85 98 115 122 143 151 157 150
## 23 22 41 55 64 77 90 95 108 111 131 148 164 167
## 24 23 43 52 61 73 90 103 127 135 145 163 170 175
## 25 27 39 46 58 73 87 100 115 123 144 163 185 192
## 26 28 39 46 58 73 92 114 145 156 184 207 212 233
## 27 26 42 48 57 74 93 114 136 147 169 205 236 251
## 28 25 40 49 62 78 102 124 146 164 197 231 259 265
## 29 29 39 48 59 74 87 106 134 150 187 230 279 309
## 30 21 40 50 62 86 125 163 217 240 275 307 318 331
## 31 33 39 50 63 77 96 111 137 144 151 146 156 147
## 32 37 41 48 56 68 80 83 103 112 135 157 169 178
## 33 36 39 48 61 76 98 116 145 166 198 227 225 220
## 34 31 42 53 62 73 85 102 123 138 170 204 235 256
## 35 39 42 50 61 78 89 109 130 146 170 214 250 272
## 36 38 41 49 61 74 98 109 128 154 192 232 280 290
## 37 32 41 49 65 82 107 129 159 179 221 263 291 305
## 38 40 41 55 66 79 101 120 154 182 215 262 295 321
## 39 34 41 49 63 85 107 134 164 186 235 294 327 341
## 40 35 41 53 64 87 123 158 201 238 287 332 361 373
## 41 44 42 51 65 86 103 118 127 138 145 146 NA NA
## 42 45 41 50 61 78 98 117 135 141 147 174 197 196
## 43 43 42 55 69 96 131 157 184 188 197 198 199 200
## 44 41 42 51 66 85 103 124 155 153 175 184 199 204
## 45 47 41 53 66 79 100 123 148 157 168 185 210 205
## 46 49 40 53 64 85 108 128 152 166 184 203 233 237
## 47 46 40 52 62 82 101 120 144 156 173 210 231 238
## 48 50 41 54 67 84 105 122 155 175 205 234 264 264
## 49 42 42 49 63 84 103 126 160 174 204 234 269 281
## 50 48 39 50 62 80 104 125 154 170 222 261 303 322
Things to note:
dcast
uses a formula interface (~
) to specify the row identifier and the variables. The LHS is the row identifier, and the RHS for the variables to be created.- The measurement of each LHS at each RHS, is specified using the
value.var
argument.
4.4 Bibliographic Notes
data.table
has excellent online documentation.
See here.
See here for joining data.tables.
See here for more on reshaping data.tables.
See here for a comparison of the data.frame
way, versus the data.table
way.
For some advanced tips and tricks see Andrew Brooks’ blog.