## 4.3 Lab: Three strategies: Local SQL database

Gold (2019) discusses “Three Strategies for Working with Big Data in R” namely:

• Strategy 1: Sample and Model
• Strategy 2: Chunk and Pull
• Strategy 3: Push Compute to Data

Below we provide quick examples for all three strategies. Usually, such strategies are employed when we need to estimate more demanding models. To make the example somewhat simpler “the model” we estimate is simply the mean.

### 4.3.1 Strategy 1: Sample and Model

• Using this strategy we sample from the data in the database, download the sample data and calculate our statistics (e.g., the mean) based on the sample.

First, using all the data in the database (if the database size is too large this is impossible). Second, we draw a sample an calculate the mean in the sample. The second strategy is feasible when the dataset is simply too large.

# Connect to the database
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
dbListTables(db)

data <- dbGetQuery(db, "SELECT retweet_count
FROM table_tweets_en") %>%
mutate(retweet_count = as.numeric(retweet_count))

nrow(data) # check number of observations
mean(data$retweet_count, na.rm=TRUE) # Draw a random sample and calculate the mean data.sample <- dbGetQuery(db, "SELECT retweet_count FROM table_tweets_en ORDER BY RANDOM() LIMIT 5000") %>% mutate(retweet_count = as.numeric(retweet_count)) nrow(data.sample) # check number of observations mean(data.sample$retweet_count, na.rm=TRUE) # not too bad

Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)

Here we do the same but now using the DBI and dplyr package. Using dplyr the sample_n() function does not work and we have to resort to the solution below.

# Connect to the database
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
dbListTables(db)

# Inspect database
dbListTables(db) # list tables
data_db <- dplyr::tbl(db, "table_tweets_en") # create table from data source
tally(data_db)
data_db %>% select(text, retweet_count)

data.sample <- data_db %>%
select(text, retweet_count) %>%
mutate(x = random() %>% row_number()) %>%  # create rank based on random number
filter(x <= 5000) %>% # only take first 5000
collect() %>%
mutate(retweet_count = as.numeric(retweet_count))

nrow(data.sample) # check number of observations
mean(data.sample\$retweet_count, na.rm=TRUE) # not too bad

### 4.3.2 Strategy 2: Chunk and Pull

• Using this strategy the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining.

Below we will download chunks of the dataset and compute the mean for each chunk (you could also calculate a model or something similar).

# Let's calculate the mean retweets over months
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
dbListTables(db)

# Add month year variable in db
data_db <- data_db %>%
mutate(year_month = substr(tweet_created_at_date, 1,7)) %>%

# Inspect variable
data_db %>% select(year_month)
data_db %>% select(year_month) %>% collect() %>% unique()

# Get vector with unique values of year_month variable
chunk_id <- data_db %>% select(year_month) %>% collect() %>% unique() %>% pull()

function_mean <- function(chunk_id) {
# Pull a chunk of data
mean <- data_db %>%
dplyr::filter(year_month == chunk_id) %>%
select(retweet_count) %>%
collect() %>%
summarize(retweet_count_mean = mean(retweet_count, na.rm=TRUE)) %>%
pull(retweet_count_mean)
mean
}

set.seed(98765)
results <- lapply(chunk_id, function_mean) %>%
suppressMessages()

names(results) <- chunk_id

# Calulate the overall mean from the chunk means
mean(unlist(results))

### 4.3.3 Strategy 3: Push Compute to Data

• “In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.”
# Connect to the database
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
dbListTables(db)

# Calculate mean in database and store locally
average <- data_db %>%
select(text, retweet_count) %>%
summarize(mean_retweets = mean(retweet_count)) %>%
collect()