Chapter 10 Handling Big Data with ff

Hello! This week, we’ll be talking about the ff package, which we can use to read large datasets into R. We’ll also talk a little about memory in R, as this is an important precursor to understanding why we even need ff.

First, you may want to adjust some of your option defaults. For example, when you read a file into R, you may notice that large numbers become shortened using scientific notation (a way of expressing large numbers, learn more here). However, this can be annoying when you just want to see the 8-digit user number of a Twitter account. You can stop the scientific notation default using options()

options(scipen = 999) #use this so that numbers in scientific notation are written out

10.1 Memory in R

One important feature of R is that it reads nearly everything from RAM memory. If your computer does not have enough RAM, your computer will not be able to read or process large datasets. Several packages, including ff, allow you to work around this limitation using clever storage techniques (ff uses disc storage). When you save a new object in the environment, this also takes up RAM memory.

On a PC, you may not be using the full memory limit. You can extend this using the memory.limit() function.

#memory.limit() #use this to figure out the default memory limit of your computer
#memory.limit(size = 2500) #use this to increase the default memory limit of your computer

On a Mac, the max size is usually 16 GB. You can change this on Terminal using (learn more about how here).

Importantly, you cannot allocate a memory limit that is larger than your computer RAM (you can technically set a larger number, but you literally would not have enough RAM to get to that point).

10.2 Data Importing

One of the most frustrating things about working with bigger data in R is dealing with its import. While useful, read.csv() suffers from some issues in this aspect: it can be time consuming to import a file into the R Environment, but you can’t tell how long it will take to read a file in.

Other reading functions from packages like fread and readr allow you to import data much more quickly and with more information. Below, we’ll use read_csv() from the readr package to import the dataset we used in the Data Wrangling III tutorial (#academictwitter tweets).

library(tidyverse)
tweet_df <- readr::read_csv("data/rtweet_academictwitter_20210115.csv")

#library(data.table)
#tweet_dt <- data.table::fread("rtweet_academictwitter_20210115.csv")
#the line above imports the same data frame, but using the data.table package

Want to compare how long each function takes? You can use the system.time() function in base R. The system.time() function will return a proc_time object which tells you how long the CPU took to execute the line within the parentheses. Use ?proc.time to learn more!

system.time(tweet_df <- readr::read_csv("data/rtweet_academictwitter_20210115.csv"))
##    user  system elapsed 
##    1.45    0.20    0.59
system.time(tweet_dt <- data.table::fread("data/rtweet_academictwitter_20210115.csv"))                                  
##    user  system elapsed 
##    0.22    0.00    0.18

(If you get a warning to use bit64 in fread, you can install the package using install.packages("bit64").)

Based on the results of our system.time() test, the fread() function is notably faster. You can learn more about other ways to import data here.

10.3 pryr

As I mentioned, one limitation of R is that anything store in the environment takes RAM memory. Even running functions take memory (to remember the function you last wrote in History). But the bulk of your memory use will really be on storing data frames and creating new data frames.

To understand what’s going on in your memory, it can be helpful to use the package pryr. pryr “pries” back the hood of R so you can see what is going on with your objects and computer memory.

To check the size of a data frame, you can use the object_size() function in the pryr package.

#install.packages("pryr")
library(pryr)

pryr::object_size(tweet_df)
## 30.62 MB
pryr::object_size(tweet_dt)
## 29.08 MB

As you can see, the dataset imported with read_csv() is slightly smaller than the fread() function.

object_size() can take more than one object, which allows you to see the collective size of multiple datasets

object_size(tweet_df, tweet_dt)
## 43.87 MB

If there are shared components, the collective total of the two datasets may be less memory than the simple sum of both datasets (if the two datasets are completely different, it is likely that the object_size() would be the sum of the memory).

What if you wanted to learn about the total memory used in R? You can use the mem_used() function.

mem_used()
## 138 MB

Woah! Why is it larger than the combination of the data frames? Well, R takes up some space for each data frame, some space for running functions, and some space for storing the history. R may also be holding memory that it hasn’t released back to the computer.

rm(tweet_dt)
mem_used()
## 125 MB

When you delete an object, R will release some of that memory back to the computer (though, since rm() is a function, this will still use a little memory). In general, this is not a big deal–again, the main memory “sucks” are creating and storing data. So, if you keep creating copies of the same data frame, this will take more RAM memory (especially if the data frame is large). Duplicating a data frame with 10 tweets is not bad, but 10,000 tweets takes more memory, and 1 million tweets even more memory.

You can learn more about memory here.

10.4 ff

Alrighty! So let’s talk about the ff package. ff is a package that helps you work with larger datasets. As I mentioned above, ff works by storing your data in disc storage. This is done using a flat file (hence the “ff”), which are numeric vectors that point to disc memory.

In this tutorial, we will also use ffbase, a helper package that allows you to perform simple functions with ff objects. Let’s go ahead and install these packages now.

#packages <- c("ff", "ffbase") #this is a list of the relevant packages
#install.packages(packages) #you can install all the packages together

library(ff)
library(ffbase)

To import a file, we will want to use a function like read.csv.ffdf(). If you search this function (?read.csv.ffdf), you’ll find that read.csv.ffdf() is a specialized version of read.table.ffdf(). The way these read.table.ffdf() functions work is by storing chunks of rows (“row-chunks”) as flatfiles in the disk storage.

By default, the first chunk of rows that are read will be 1,000 rows long. You can modify this with the first.rows argument. R will determine a lot of information from this first chunk. For example, the factor levels are sorted from the first chunk. Increasing the size of the first.row argument will take up more RAM, but will require fewer flat files. Decreasing the size of the first.row argument will make it easier to read when you have many, many columns.

The next.rows argument will tell you the size of the subsequent flat files. Keep in mind that the subsequent row-chunks are structured according to the first row-chunk.

tweet_ff <- read.csv.ffdf(file="data/rtweet_academictwitter_20210115.csv", #file name
                          colClasses = "factor", #column classes as determined by the first chunk
                          first.rows = 100, #first chunk of rows that are read
                          next.rows = 5000)
class(tweet_ff)
## [1] "ffdf"

In many ways, the ffdf object works very similarly to data frames stored in RAM. For example you can see that the length() and nrow() functions work with this object. However, there are some limitations: ffdf objects do not work with characters (strings), so all character vectors are transformed into factors or another variable-type. This is because character objects take up a lot of memory (a factor level is a number, but a string is many characters which can be manipulated).

nrow(tweet_ff) #how many rows are there in this new object?
## [1] 20182
length(tweet_ff) #how many columns are there in this new object?
## [1] 91

Although it may have taken longer to read a file using ff, these objects tend to be smaller. For example, let’s use the object_size() function we just learned!

pryr::object_size(tweet_ff)
## 26.13 MB
pryr::object_size(tweet_df)
## 30.62 MB

As you can see below, the csv we brought in using read.csv.ffdf() is smaller than the one imported with read_csv(). With only 46,000 posts, this is not that substantive of a difference. But, with 4 or 5 million tweets (the equivalent of a few GBs of data), this can be very important.

While some functions work, others will not. In some of these instances, the ff package or its helper packages have equivalents for working with your larger dataset. For example, if you wanted to get the number of tweets sent from verified accounts in the dataset, you can use the table.ff() function in the ffbase package.

#table(tweet_ff$verified) #this function will produce an error

ffbase::table.ff(tweet_ff$verified) 
## 
## FALSE  TRUE 
## 20040   142

10.4.1 subset.ff

Often, you don’t even need the whole data frame–you may only need a specific portion or subset. For example, if you had a corpus of tweets about COVID, you may only be interested in tweets in English, Spanish or Mandarin. For this, it may be useful to use the subset.ff() function in ffbase.

verified_tweets <- subset.ffdf(tweet_ff, verified == "TRUE")
nrow(verified_tweets)
## [1] 142
verified_df <- as.data.frame(verified_tweets) #turn your subset into a data frame

Once you’ve subsetted the data, you can then transform it back into a data.frame using the as.data.frame() function.

Want to save your new data frame into a separate csv? You can use write.csv()! If you haven’t used the write.csv() function, it takes two arguments: write.csv(data.frame, "name_of_new_file.csv").

write.csv(verified_df, "data/rtweet_academictwitter_verified.csv")

You can also save an object in a rds form. Learn more about how here.

Want to learn about the other functions you can use with ff objects? Check out this tutorial.