1.9 Working with Large Datasets
The learning objectives of this section are to:
- Read and manipulate large datasets
R now offers now offers a variety of options for working with large datasets. We won’t try to cover all these options in detail here, but rather give an overview of strategies to consider if you need to work with a large dataset, as well as point you to additional resources to learn more about working with large datasets in R.
While there are a variety of definitions of how large a dataset must be to qualify as “large”, in this section we don’t formally define a limit. Instead, this section is meant to give you some strategies anytime you work with a dataset large enough that you notice it’s causing problems. For example, data large enough for R to be noticeably slow to read or manipulate the data, or large enough it’s difficult to store the data locally on your computer.
1.9.1 In-memory strategies
In this section, we introduce the basics of why and how to use
data.table to work with large datasets in R. We have included a video demonstration online showing how functions from the
data.table package can be used to load and explore a large dataset more efficiently.
data.table package can help you read a large dataset into R and explore it more efficiently. The
fread function in this package, for example, can read large flat files in much more quickly than comparable base R packages. Since all of the
data.table functions will work with smaller datasets, as well, we’ll illustrate using
data.table with the Zika data accessed from GitHub in an earlier section of this chapter. We’ve saved that data locally to illustrate how to read it in and work with it using
First, to read this data in using
fread, you can run:
library(data.table) brazil_zika <- fread("data/COES_Microcephaly-2016-06-25.csv") head(brazil_zika, 2)
report_date location location_type data_field 1: 2016-06-25 Brazil-Acre state microcephaly_confirmed 2: 2016-06-25 Brazil-Alagoas state microcephaly_confirmed data_field_code time_period time_period_type value unit 1: BR0002 NA NA 2 cases 2: BR0002 NA NA 75 cases
 "data.table" "data.frame"
If you are working with a very large dataset,
data.table will provide a status bar showing your progress towards loading the code as you read it in using
If you have a large dataset for which you only want to read in certain columns, you can save time when using
data.table by only reading in the columns you want with the
select argument in
fread. This argument takes a vector of either the names or positions of the columns that you want to read in:
fread("data/COES_Microcephaly-2016-06-25.csv", select = c("location", "value", "unit")) %>% dplyr::slice(1:3)
# A tibble: 3 x 3 location value unit <chr> <int> <chr> 1 Brazil-Acre 2 cases 2 Brazil-Alagoas 75 cases 3 Brazil-Amapa 7 cases
Many of the
fread arguments are counterparts to arguments in the
read.table family of functions in base R (for example,
colClasses). One that is particular useful is
nrows. If you’re working with data that takes a while to read in, using
nrows = 20 or some other small number will allow you to make sure you have set all of the arguments in
fread appropriately for the dataset before you read in the full dataset.
If you already have a dataset loaded to your R session, you can use the
data.table function to convert a data frame into a
data.table object. (Note: if you use
fread, the data is automatically read into a
data.table object.) A
data.table object also has the class
data.frame; this means that you can use all of your usual methods for manipulating a data frame with a
data.table object. However, for extra speed, use
data.table functions to manipulate, clean, and explore the data in a
data.table object. You can find out more about using
data.table functions at the
Many of the functions in
data.table, like many in
ddplyr, use non-standard evaluation. This means that, while they’ll work fine in interactive programming, you’ll need to take some extra steps when you use them to write functions for packages. We’ll cover non-standard evaluation in the context of developing packages in a later section.
When you are working with datasets that are large, but can still fit in-memory, you’ll want to optimize your code as much as possible. There are more details on profiling and optimizing code in a later chapter, but one strategy for speeding up R code is to write some of the code in C++ and connect it to R using the
Rcpp package. Since C++ is a compiled rather than an interpreted language, it runs much faster than similar code written in R. If you are more comfortable coding in another compiled language (C or FORTRAN, for example), you can also use those, although the
Rcpp package is very nicely written and well-maintained, which makes C++ an excellent first choice for creating compiled code to speed up R.
Further, a variety of R packages have been written that help you run R code in parallel, either locally or on a cluster. Parallel strategies may be work pursuing if you are working with very large datasets, and if the coding tasks can be split to run in parallel. To get more ideas and find relevant packages, visit CRAN’s High-Performance and Parallel Computing with R task view.
1.9.2 Out-of-memory strategies
If you need to work with a very large dataset, there are also some options to explore and model the dataset without ever loading it into R, while still using R commands and working from the R console or an R script. These options can make working with large datasets more efficient, because they let other software handle the heavy lifting of sifting through the data and / or avoid loading large datasets into RAM, instead using data stored on hard drive.
For example, database management systems are optimized to more efficiently store and better search through large sets of data; popular examples include Oracle, MySQL, and PostgreSQL. There are several R packages that allow you to connect your R session to a database. With these packages, you can use functions from the R console or an R script to search and subset data without loading the whole dataset into R, and so take advantage of the improved efficiency of the database management system in handling data, as well as work with data too big to fit in memory.
DBI package is particularly convenient for interfacing R code with a database management system, as it provides a top-level interface to a number of different database management systems, with system-specific code applied by a lower-level, more specific R package (Figure 1.7).
DBI package therefore allows you to use the same commands for working with database-stored data in R, without worrying about details specific to the exact type of database management system you’re connecting to. The following table outlines the
DBI functions you can use to perform a variety of tasks when working with data stored in a database:
|Create a new driver object for an instance of a database||
|Connect to database instance||
|Find available tables in a connected database instance||
|Find available fields within a table||
|Query a connected database instance||
|Pull a data frame into R from a query result||
|Jointly query and pull data from a database instance||
|Close result set from a query||
|Write a new table in a database instance||
|Remove a table from a database instance||
|Disconnect from a database instance||
DBI package depends on lower-level R packages to translate its generic commands to work for specific database management systems. DBI-compliant R packages have not been written for every database management system, so there are some databases for which DBI commands will not work. DBI-compliant R packages that are available include:
|Database Management System||R packages|
|Microsoft SQL Server||
For more on the
DBI package, including its history, see the package’s GitHub README page.
The packages for working with database management systems require you to send commands to the database management system in that system’s command syntax (e.g., SQL). You can, however, do “SELECT” database queries directly using
dplyr syntax for some database systems, rather than with SQL syntax. While this functionality is limited to “SELECT” calls, often this is all you’ll need within a data analysis script. For more details, see the dplyr database vignette.
In addition to database management systems, there are other options for working with large data out-of-memory in R. For example, the
bigmemory and associated packages can be used to access and work with large matrices stored on hard drive rather than in RAM, by storing the data in a C++ matrix structure and loading to R pointers to the data, rather than the full dataset. This family of packages includes packages that can be used to summarize and model the data (
bigalgebra). One limitation is that these packages only work with matrices, not data frames; matrices require all elements share a class (e.g., all numeric).
Finally, there are some packages that allow you to write R code that uses other software to load and work with data through an R API provided by the other software. For example, the
h2o package allows you to write R code to load and fit machine learning models in H2O, which is open-source software that facilitates distributed machine learning. H2O includes functions to fit and evaluate numerous machine learning models, including ensemble models, which would take quite a while to fit within R with a large training dataset. Since processing is done using compiled code, models can be fit on large datasets more quickly. However, while the
h2o package allows you to use R-like code from within an R console to explore and model your data, it is not actually running R, but instead is using the R code, through the R API, to run Java-encoded functions. As a result, you only have access to a small subset of R’s total functionality, since you can only run the R-like functions written into H2O’s own software.