Chapter 25 Handling Big Data with ff

Hello! This week, we’ll be talking about the ff package, which we can use to read large datasets into R. We’ll also talk a little about memory in R, as this is an important precursor to understanding why we even need ff.

First, you may want to adjust some of your option defaults. For example, when you read a file into R, you may notice that large numbers become shortened using scientific notation (a way of expressing large numbers, learn more here). However, this can be annoying when you just want to see the 8-digit user number of a Twitter account. You can stop the scientific notation default using options()

options(scipen = 999) #use this so that numbers in scientific notation are written out

25.1 Memory in R

One important feature of R is that it reads nearly everything from RAM memory. If your computer does not have enough RAM, your computer will not be able to read or process large datasets. Several packages, including ff, allow you to work around this limitation using clever storage techniques (ff uses disc storage). When you save a new object in the environment, this also takes up RAM memory.

On a PC, you may not be using the full memory limit. You can extend this using the memory.limit() function.

#memory.limit() #use this to figure out the default memory limit of your computer
#memory.limit(size = 2500) #use this to increase the default memory limit of your computer

On a Mac, the max size is usually 16 GB. You can change this on Terminal using (learn more about how here).

Importantly, you cannot allocate a memory limit that is larger than your computer RAM (you can technically set a larger number, but you literally would not have enough RAM to get to that point).

25.2 Data Importing

One of the most frustrating things about working with bigger data in R is dealing with its import. While useful, read.csv() suffers from some issues in this aspect: it can be time consuming to import a file into the R Environment, but you can’t tell how long it will take to read a file in. Other reading functions from packages like fread and readr allow you to import data much more quickly and with more information. Below, we’ll use read_csv() from the readr package to import the Barbenheimer dataset.

library(tidyverse)
fb_df <- readr::read_csv("data/crowdtangle_barbenheimer_2023.csv")

#library(data.table)
#fb_dt <- data.table::fread("crowdtangle_barbenheimer_2023.csv")
#the line above imports the same data frame, but using the data.table package

Want to compare how long each function takes? You can use the system.time() function in base R. The system.time() function will return a proc_time object which tells you how long the CPU took to execute the line within the parentheses (elapsed).

system.time(fb_df <- read_csv("data/crowdtangle_barbenheimer_2023.csv"))
##    user  system elapsed 
##    0.86    0.08    0.41
system.time(fb_dt <- data.table::fread("data/crowdtangle_barbenheimer_2023.csv"))                                  
##    user  system elapsed 
##    0.28    0.00    0.27

(If you get a warning to use bit64 in fread, you can install the package using install.packages("bit64").)

Based on the results of our system.time() test, the fread() function is notably faster. You can learn more about other ways to import data here.

25.3 pryr

As I mentioned, one limitation of R is that anything store in the environment takes RAM memory. Even running functions take memory (to remember the function you last wrote in History). But the bulk of your memory use will really be on storing data frames and creating new data frames.

To understand what’s going on in your memory, it can be helpful to use the package pryr. pryr “pries” back the hood of R so you can see what is going on with your objects and computer memory.

To check the size of a data frame, you can use the object_size() function in the pryr package.

#install.packages("pryr")
library(pryr)

pryr::object_size(fb_df)
## 22.27 MB
pryr::object_size(fb_dt)
## 23.30 MB

object_size() can take more than one object, which allows you to see the collective size of multiple datasets

object_size(fb_df, fb_dt)
## 34.44 MB

If there are shared components, the collective total of the two datasets may be less memory than the simple sum of both datasets (if the two datasets are completely different, it is likely that the object_size() would be the sum of the memory).

What if you wanted to learn about the total memory used in R? You can use the mem_used() function.

mem_used()
## 123 MB

Woah! Why is it larger than the combination of the data frames? Well, R takes up some space for each data frame, some space for running functions, and some space for storing the history. R may also be holding memory that it hasn’t released back to the computer.

rm(fb_dt)
mem_used()
## 110 MB

When you delete an object, R will release some of that memory back to the computer (though, since rm() is a function, this will still use a little memory). In general, this is not a big deal–again, the main memory “sucks” are creating and storing data. So, if you keep creating copies of the same data frame, this will take more RAM memory (especially if the data frame is large). Duplicating a data frame with 10 posts is not bad, but 10,000 posts takes more memory, and 1 million posts even more memory.

You can learn more about memory here.

25.4 ff

Alrighty! So let’s talk about the ff package. ff is a package that helps you work with larger datasets. As I mentioned above, ff works by storing your data in disc storage. This is done using a flat file (hence the “ff”), which are numeric vectors that point to disc memory.

In this tutorial, we will also use ffbase, a helper package that allows you to perform simple functions with ff objects. Let’s go ahead and install these packages now.

#packages <- c("ff", "ffbase") #this is a list of the relevant packages
#install.packages(packages) #you can install all the packages together

library(ff)
## Warning: package 'bit' was built under R version 4.2.2
library(ffbase)

To import a file, we will want to use a function like read.csv.ffdf(). If you search this function (?read.csv.ffdf), you’ll find that read.csv.ffdf() is a specialized version of read.table.ffdf(). The way these read.table.ffdf() functions work is by storing chunks of rows (“row-chunks”) as flatfiles in the disk storage.

By default, the first chunk of rows that are read will be 1,000 rows long. You can modify this with the first.rows argument. R will determine a lot of information from this first chunk. For example, the factor levels are sorted from the first chunk. Increasing the size of the first.row argument will take up more RAM, but will require fewer flat files. Decreasing the size of the first.row argument will make it easier to read when you have many, many columns.

The next.rows argument will tell you the size of the subsequent flat files. Keep in mind that the subsequent row-chunks are structured according to the first row-chunk.

fb_ff <- read.csv.ffdf(file="data/crowdtangle_barbenheimer_2023.csv", #file name
                          colClasses = "factor", #column classes as determined by the first chunk
                          first.rows = 100, #first chunk of rows that are read
                          next.rows = 5000)
class(fb_ff)
## [1] "ffdf"

In many ways, the ffdf object works very similarly to data frames stored in RAM. For example you can see that the length() and nrow() functions work with this object. However, there are some limitations: ffdf objects do not work with characters (strings), so all character vectors are transformed into factors or another variable-type. This is because character objects take up a lot of memory (a factor level is a number, but a string is many characters which can be manipulated).

nrow(fb_ff) #how many rows are there in this new object?
## [1] 16794
length(fb_ff) #how many columns are there in this new object?
## [1] 41

Although it may have taken longer to read a file using ff, these objects tend to be a little smaller. For example, let’s use the object_size() function we just learned!

pryr::object_size(fb_ff)
## 21.69 MB
pryr::object_size(fb_df)
## 22.27 MB

As you can see below, the csv we brought in using read.csv.ffdf() is smaller than the one imported with read_csv(). With only 16,000 posts, this is not that substantive of a difference. But, with 4 or 5 million posts (the equivalent of a few GBs of data), this can be very important.

While some functions work, others will not. In some of these instances, the ff package or its helper packages have equivalents for working with your larger dataset; for example, thetable.ff() function in the ffbase package can be used to identify different types of posts.

#table(fb_ff$Type) #this function will produce an error

ffbase::table.ff(fb_ff$Type) 
## 
##                 Link         Native Video                Photo 
##                 7439                  796                 8108 
##               Status  Live Video Complete Live Video Scheduled 
##                  137                   82                   38 
##                Video              YouTube 
##                   97                   97

25.4.1 subset.ff

Often, you don’t even need the whole data frame–you may only need a specific portion or subset. For example, if you had a bunch of social media posts, you may only be interested in the video posts. For this, it may be useful to use the subset.ff() function in ffbase.

video_fb <- subset.ffdf(fb_ff, Type == "Video")
nrow(video_fb)
## [1] 97
video_df <- as.data.frame(video_fb) #turn your subset into a data frame

Once you’ve subsetted the data, you can then transform it back into a data.frame using the as.data.frame() function.

Want to save your new data frame into a separate csv? You can use write.csv()! If you haven’t used the write.csv() function, it takes two arguments: write.csv(data.frame, "name_of_new_file.csv").

write.csv(video_df, "data/crowdtangle_barbenheimer_video_2023.csv")

You can also save an object in a rds form. Learn more about how here.

Want to learn about the other functions you can use with ff objects? Check out this tutorial.