2.7 Very large data sets (a.k.a. “Big Data”)

Over the past decade or so there has been growing interest in so-called “Big Data”. See Kaisler et al. (2013) for an overview which, although slightly dated in terms of volumes, e.g., we might now think in zettabytes rather than exabytes the underlying concepts remain the same. For a more recent review and discussion of the business and technological challenges see I. Lee (2017). However, one could make just as convincing argument about data complexity as about volume. For this reason I don’t see much value in being overly exact in trying to define “Big data”.

2.7.1 So how suited is R for big data analysis?

On the face of it there are some restrictions.

  1. By default R can only process data that fits into your computer’s memory. Hardware advances have made this less of a problem since nowadays most laptops come with at least 8Gb of memory and often 16Gb plus.
  2. In practice, the situation is slightly worse because you will be manipulating your data, so a good rule of thumb is that your machine needs at least twice the amount of RAM that your data occupies.
  3. R reads entire data set into RAM all at once. Some other languages/environments can read file sections on demand. The time to pull a very large data set into memory can be far slower than executing the analysis.
  4. There is a two 2 billion vector index limit, because R does not support the int64 data type, so it isn’t possible to index objects with huge numbers of rows or columns.

Gould (2020) describes three possible strategies that can be implemented in R and uses the NYC 2013 flights dataset which has 19 different variables with 336,776 rows, so \(\sim6.4\)m data items. The execution times are quite impressive.

There are also specialised packages (actually implemented in C++) e.g., {bigmemory} that support access to a data set without reading the entire set into R. This enables analysis of data sets up to \(\sim10\)Gb. Beyond this one needs to consider parallelisation using some cloud facility or use a hadoop distributed system via the {RHadoop} package.

So despite some claims to the contrary, R does provide many options to handle very large data files. Nevertheless, our focus is on small, in-memory datasets. This is because you can’t tackle big data unless you have experience with small data and it’s easier to explain principles with smaller data sets. The tools and methods you will learn in this module will easily scale to 1-2 Gb of data.

References

Gould, Alex. 2020. “Three Strategies for Working with Big Data in r.” https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/.
Kaisler, Stephen, Frank Armour, J Alberto Espinosa, and William Money. 2013. “Big Data: Issues and Challenges Moving Forward.” In 2013 46th Hawaii International Conference on System Sciences, 995–1004. IEEE.
Lee, I. 2017. “Big Data: Dimensions, Evolution, Impacts, and Challenges.” Business Horizons 60 (3): 293–303.