4.3 Lab: Three strategies: Local SQL database
Gold (2019) discusses “Three Strategies for Working with Big Data in R” namely:
- Strategy 1: Sample and Model
- Strategy 2: Chunk and Pull
- Strategy 3: Push Compute to Data
Below we provide quick examples for all three strategies. Usually, such strategies are employed when we need to estimate more demanding models. To make the example somewhat simpler “the model” we estimate is simply the mean.
4.3.1 Strategy 1: Sample and Model
- Using this strategy we sample from the data in the database, download the sample data and calculate our statistics (e.g., the mean) based on the sample.
First, using all the data in the database (if the database size is too large this is impossible). Second, we draw a sample an calculate the mean in the sample. The second strategy is feasible when the dataset is simply too large.
# Connect to the database
library(DBI)
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db dbListTables(db)
# Download all the data and calculate the mean for comparison
<- dbGetQuery(db, "SELECT retweet_count
data FROM table_tweets_en") %>%
mutate(retweet_count = as.numeric(retweet_count))
nrow(data) # check number of observations
mean(data$retweet_count, na.rm=TRUE)
# Draw a random sample and calculate the mean
<- dbGetQuery(db, "SELECT retweet_count
data.sample FROM table_tweets_en
ORDER BY RANDOM()
LIMIT 5000") %>%
mutate(retweet_count = as.numeric(retweet_count))
nrow(data.sample) # check number of observations
mean(data.sample$retweet_count, na.rm=TRUE) # not too bad
Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)
Here we do the same but now using the DBI and dplyr package. Using dplyr the sample_n()
function does not work and we have to resort to the solution below.
# Connect to the database
library(DBI)
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db dbListTables(db)
# Inspect database
dbListTables(db) # list tables
<- dplyr::tbl(db, "table_tweets_en") # create table from data source
data_db tally(data_db)
%>% select(text, retweet_count)
data_db
# Sample data, download it and calculate mean
<- data_db %>%
data.sample select(text, retweet_count) %>%
mutate(x = random() %>% row_number()) %>% # create rank based on random number
filter(x <= 5000) %>% # only take first 5000
collect() %>%
mutate(retweet_count = as.numeric(retweet_count))
nrow(data.sample) # check number of observations
mean(data.sample$retweet_count, na.rm=TRUE) # not too bad
4.3.2 Strategy 2: Chunk and Pull
- Using this strategy the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining.
Below we will download chunks of the dataset and compute the mean for each chunk (you could also calculate a model or something similar).
# Let's calculate the mean retweets over months
library(DBI)
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db dbListTables(db)
# Add month year variable in db
<- data_db %>%
data_db mutate(year_month = substr(tweet_created_at_date, 1,7)) %>%
compute() # add to db
# Inspect variable
%>% select(year_month)
data_db %>% select(year_month) %>% collect() %>% unique()
data_db
# Get vector with unique values of year_month variable
<- data_db %>% select(year_month) %>% collect() %>% unique() %>% pull()
chunk_id
<- function(chunk_id) {
function_mean # Pull a chunk of data
<- data_db %>%
mean ::filter(year_month == chunk_id) %>%
dplyrselect(retweet_count) %>%
collect() %>%
summarize(retweet_count_mean = mean(retweet_count, na.rm=TRUE)) %>%
pull(retweet_count_mean)
mean
}
set.seed(98765)
<- lapply(chunk_id, function_mean) %>%
results suppressMessages()
names(results) <- chunk_id
# Calulate the overall mean from the chunk means
mean(unlist(results))
4.3.3 Strategy 3: Push Compute to Data
- “In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.”
# Connect to the database
library(DBI)
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db dbListTables(db)
# Calculate mean in database and store locally
<- data_db %>%
average select(text, retweet_count) %>%
summarize(mean_retweets = mean(retweet_count)) %>%
collect()