4.3 Lab: Three strategies: Local SQL database

Gold (2019) discusses “Three Strategies for Working with Big Data in R” namely:

Strategy 1: Sample and Model
Strategy 2: Chunk and Pull
Strategy 3: Push Compute to Data

Below we provide quick examples for all three strategies. Usually, such strategies are employed when we need to estimate more demanding models. To make the example somewhat simpler “the model” we estimate is simply the mean.

4.3.1 Strategy 1: Sample and Model

Using this strategy we sample from the data in the database, download the sample data and calculate our statistics (e.g., the mean) based on the sample.

First, using all the data in the database (if the database size is too large this is impossible). Second, we draw a sample an calculate the mean in the sample. The second strategy is feasible when the dataset is simply too large.

# Connect to the database
  library(DBI)
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  dbListTables(db)

# Download all the data and calculate the mean for comparison
  data <- dbGetQuery(db, "SELECT retweet_count 
                     FROM table_tweets_en") %>%
  mutate(retweet_count = as.numeric(retweet_count))

  nrow(data) # check number of observations
  mean(data$retweet_count, na.rm=TRUE)
  
  
# Draw a random sample and calculate the mean
  data.sample <- dbGetQuery(db, "SELECT retweet_count
           FROM table_tweets_en
           ORDER BY RANDOM()
           LIMIT 5000") %>%
  mutate(retweet_count = as.numeric(retweet_count))

  nrow(data.sample) # check number of observations
  mean(data.sample$retweet_count, na.rm=TRUE) # not too bad

Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)

Here we do the same but now using the DBI and dplyr package. Using dplyr the sample_n() function does not work and we have to resort to the solution below.

# Connect to the database
  library(DBI)
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  dbListTables(db)
  
# Inspect database
  dbListTables(db) # list tables
  data_db <- dplyr::tbl(db, "table_tweets_en") # create table from data source
  tally(data_db)
  data_db %>% select(text, retweet_count)

# Sample data, download it and calculate mean
  data.sample <- data_db %>% 
  select(text, retweet_count) %>% 
  mutate(x = random() %>% row_number()) %>%  # create rank based on random number
  filter(x <= 5000) %>% # only take first 5000
  collect() %>%
  mutate(retweet_count = as.numeric(retweet_count))

  nrow(data.sample) # check number of observations
  mean(data.sample$retweet_count, na.rm=TRUE) # not too bad

4.3.2 Strategy 2: Chunk and Pull

Using this strategy the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining.

Below we will download chunks of the dataset and compute the mean for each chunk (you could also calculate a model or something similar).

# Let's calculate the mean retweets over months
  library(DBI)
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  dbListTables(db)


# Add month year variable in db
  data_db <- data_db %>% 
    mutate(year_month = substr(tweet_created_at_date, 1,7)) %>% 
    compute() # add to db

# Inspect variable
  data_db %>% select(year_month)
  data_db %>% select(year_month) %>% collect() %>% unique()

# Get vector with unique values of year_month variable
  chunk_id <- data_db %>% select(year_month) %>% collect() %>% unique() %>% pull()
  
function_mean <- function(chunk_id) {
  # Pull a chunk of data
  mean <- data_db %>%
    dplyr::filter(year_month == chunk_id) %>%
    select(retweet_count) %>%
    collect() %>%
    summarize(retweet_count_mean = mean(retweet_count, na.rm=TRUE)) %>%
    pull(retweet_count_mean)
mean
}


set.seed(98765)
results <- lapply(chunk_id, function_mean) %>%
  suppressMessages()

names(results) <- chunk_id

# Calulate the overall mean from the chunk means
mean(unlist(results))

4.3.3 Strategy 3: Push Compute to Data

“In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.”

# Connect to the database
  library(DBI)
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  dbListTables(db)
  
# Calculate mean in database and store locally
average <- data_db %>% 
  select(text, retweet_count) %>% 
  summarize(mean_retweets = mean(retweet_count)) %>%
  collect()