20.1 SQL at scale

  • General Strategy: Store data in data warehouse
    • Pass subset of data from warehouse to R
    • Transform R code and pass to warehouse
    • Normally you use different packages for that communication (dplyr, DBI, RHadoop, SparkR)
  • Many many data warehouse solutions
  • We’ll have a look at…

20.1.1 SQL at scale: Google BigQuery

  • GBQ one of many commercial SQL databases available (Amazon RedShift, Microsoft Azure, Oracle Live SQL…)
  • Used by many financial and commercial companies
  • Advantages:
    • Integration with other Google Cloud products (Google Drive, Google Cloud Storage, Google Cloud ML products)
    • Scalable: same SQL (or R) syntax for datasets of any size
    • Easy to collaborate and export results
    • Affordable pricing and cost control
    • API access allows integration with R or python
    • Good documentation

20.1.2 Lab: Google Big Query & 3 strategies

20.1.2.0.1 Authentication

How to set it up… let’s have a quick look at the Google BigQuery user interface.

The package we use to access BigQuery is called bigrquery and you can find a quick overview on the github page.

To do anything on the platform we first have to autenticate. For this we’ll need the key for our service account which we store on our hardrive in a json file. Then we can use the bq_auth() function to autenticate (it will open the browser window, so you can provide the permissions). Alternatively, we could use an API key.

  • See this document on how to setup the Google cloud for using your research credits and how to set up an API Keys.

Billing account/project: Importantly, whenever you interact with the Google Cloud you have to indicate which project in your Google account you are working in - in case you have to pay for any services. Below we always indicate the project/billing account through defining: billing = "css-seminar-2022". You have to make sure the the corresponding name (below it is css-seminar-2022) is changed to the name of the project that you have in your account. In your case it could be different, e.g., css-seminar-2022-487534.

Also for some of the steps below (e.g., to see some of the links) you have to be logged into your account.

In addition, the bigquery API has to be enabled (you should be able to do that here).

# Load packages
pacman::p_load(bigrquery, DBI)


# Authenticate with json file from setting up big query
bq_auth(path = "your_JSON_file.json")

# or autenticate with API key
# bq_auth(token = "your_API_key")

# PB: bq_auth("C:/Users/Paul/Google Drive/2-Teaching/2022 Computational Social Science/keys/css-seminar-2022-a1e75382ae2c.json")

Subsequently, we test whether authentication worked by accessing one of the BigQuery public datasets using an SQL query.

# Test whether autentication worked.

# Run the query and store the data in a dataframe
# Example data on Names in the U.S.: https://console.cloud.google.com/bigquery
tb <- bq_dataset_query(bq_dataset("bigquery-public-data",
                                  "usa_names"),
                 query = "SELECT count(*) FROM usa_1910_current LIMIT 10",
  billing = "css-seminar-2022"
)
bq_table_fields(tb) # 1 variable "f0_"
bq_table_download(tb)
20.1.2.0.2 Running queries on public datasets

Now, let’s play with a massively large dataset – a table that contains all trips completed in Yellow and Green taxis in New York City from 2009 to present. You can find more information here.

This is one of the many publicly-available Google BigQuery tables (some exemplary queries). For a complete list of public datasets, see here. Note that when you use these datasets, storage is free but you still need to pay for the queries.

For any connection we have to define a certain set of parameters: billing, project, dataset.

Let’s connect with this database/table and see how big it is:

# how many taxi trips in this table?
# 1. Create reference to project/dataset (?bq_dataset)
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")

# 2. Submit/run a query
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM tlc_yellow_trips_2018",
  billing = "css-seminar-2022" # GC project name
)

# "css-seminar-2001" is the google cloud project name
# for a participant account it's normally followed by a number 
# e.g., "css-seminar-2022-487534"



# 3. Download result of query
bq_table_download(tb)

# Try to run the query directly in the console
# SELECT pickup_datetime FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018` LIMIT 10

Not bad! What is the distribution of trips by year and by month?

# number of trips per year?

ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(YEAR FROM pickup_datetime) AS year, 
    COUNT(*) AS trips FROM tlc_yellow_trips_2018
  GROUP BY year
  ORDER BY year",
  billing = "css-seminar-2022"
)
bq_table_download(tb)


# number of trips per month?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(MONTH FROM pickup_datetime) AS month, 
    COUNT(*) AS trips 
    FROM tlc_yellow_trips_2018
  GROUP BY month
  ORDER BY month",
  billing = "css-seminar-2022"
)
bq_table_download(tb)

How would we compute the average speed?

# First, let's compute distance and duration separately
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT AVG(trip_distance) AS avg_distance_miles,
   AVG(DATETIME_DIFF(dropoff_datetime, pickup_datetime, HOUR)) AS avg_duration_hours
    FROM tlc_yellow_trips_2018",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)

data$avg_distance_miles/data$avg_duration_hours # realistic?




# but it sounds like there might be some outliers, let's try to fix it:
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT AVG(trip_distance) AS avg_distance_miles,
   AVG(DATETIME_DIFF(dropoff_datetime, pickup_datetime, HOUR)) AS avg_duration_hours
    FROM tlc_yellow_trips_2018
    WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)

data$avg_distance_miles/data$avg_duration_hours # realistic?

And just like with our SQL queries earlier, we can compute averages over groups.

# average number of passengers depending of hour of day?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(HOUR FROM pickup_datetime) AS hour, 
    AVG(passenger_count) AS passengers_avg
    FROM tlc_yellow_trips_2018
  GROUP BY hour
  ORDER BY hour",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)

# average number of passengers depending on weekday?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(DAYOFWEEK FROM pickup_datetime) AS dayofweek, 
    AVG(passenger_count) AS passengers_avg
    FROM tlc_yellow_trips_2018
  GROUP BY dayofweek
  ORDER BY dayofweek",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)

# Overview of date functions:
# https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions

Another public dataset contains a lot of hacker news stories:

# bigquery-public-data: Contains data bases that are publicly available
# hacker_news: A database that contains collections from a website with news
# stories: The table in the db that we worked with before

ds <- bq_dataset("bigquery-public-data", "hacker_news")
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM stories",
  billing = "css-seminar-2022"
)
bq_table_download(tb)



ds <- bq_dataset("bigquery-public-data", "hacker_news")
tb <- bq_dataset_query(ds,
  query = "SELECT text FROM stories LIMIT 5",
  billing = "css-seminar-2022"
)
bq_table_download(tb)
20.1.2.0.3 Uploading data to BigQuery

Make sure you authenticate to be able to connect to BigQuery.

# Load packages
pacman::p_load(bigrquery)

# Authenticate
bq_auth(path = "your_JSON_file.json")
# or autenticate with API key
# bq_auth(token = "your_API_key")

# PB: bq_auth("C:/Users/Paul/Google Drive/2-Teaching/2022 Computational Social Science/keys/css-seminar-2022-a1e75382ae2c.json")

First we have to connect to our bigquery database and create a dataset in which we can store tables (we authenticated above).

# Create a dataset (this is where you store the tables)
bq_dataset_create(bq_dataset(project = "css-seminar-2022",
                             dataset = "my_dataset"))

# "css-seminar-2001" is the google cloud project name
# Make sure to change it to the one in your google account (see start of lab)!
# For a participant account it's normally followed by a number 
# e.g., "css-seminar-2022-487534"

# bq_dataset_delete(bq_dataset(project = "css-seminar-2022", dataset = "my_dataset"))

Then we can upload a table to that dataset. To start, we’ll generate a very small dataset that contains simulated data.

data <- data.frame(var1 = 1:26, var2 = letters)

# Create/upload table to dataset
  bq_table(project = "css-seminar-2022", # Create table and upload it to dataset
           dataset = "my_dataset",
           table   = "my_table") %>%
    bq_table_upload(values = data,
                    create_disposition = "CREATE_IF_NEEDED",
                    write_disposition = "WRITE_APPEND",
                    fields=  as_bq_fields(data))

# If you want to make sure you don't confuse tables add a time to the table name
  # table_name <- paste("data_news_", gsub(":|\\s|-", "_", Sys.time()), sep="")
20.1.2.0.4 Downloading data

Above we dicussed the following way to connect and query a database:

# connect to a dataset
ds <- bq_dataset("css-seminar-2022", "my_dataset")
tb <- bq_dataset_query(ds,
  query = "SELECT * FROM my_table",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)


# connect to a dataset
ds <- bq_dataset("css-seminar-2022", "my_dataset")
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM my_table",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)
20.1.2.0.5 DBI package

A more general way to connect to different databases is to use the DBI package and dbConnect (analogue to working with SQL) (?DBI-package).

# Load packages
pacman::p_load(bigrquery)

# A more general function
ds <- dbConnect(
  bigrquery::bigquery(),
  project = "css-seminar-2022",
  dataset = "my_dataset",
  billing = "css-seminar-2022"
)

# Show tables in dataset and disconnect
dbListTables(ds)
# dbDisconnect(ds)

# List variables
dbListFields(ds, "my_table")


# Download the table
data <- dbReadTable(ds, "my_table")


# create table
dbCreateTable(ds, "iris", iris) # Bug...


# Show queries
tb_query <- tbl(ds, "my_dataset.my_table")
tb_query %>% show_query()

tb_query <- tbl(ds, "my_dataset.my_table") %>% summarise(n = n())
tb_query %>% show_query()

# Simply using tb_query does not work
# BUT try expression below in console
#SELECT count(*) AS `n` FROM my_dataset.my_table

# And you can use it with the code above
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) AS `n` FROM my_dataset.my_table",
  billing = "css-seminar-2022"
)
data <- bq_table_download(tb)

20.1.3 Exercise: Setting up & querying Google BigQuery

  1. Start by setting up accounts on Google BigQuery and Google Cloud Storage.
    • Normally you just need a Google Account and setup up a paying scheme.
  2. Before you analyzed the dataset containing the tweets coded according to sentiment. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140.
  3. Download the dataset from the course data folder (“tweets-sentiment140.csv”).
  4. Import the dataset, take a subsample of 10000 tweets from this dataset and store it in a new object called “data”.
  5. Then upload the dataset to bigquery to your project “css-seminar-2022”.
    • Thereby, create a new dataset called “data_tweets” and store the data in a table called “table_tweets”.
  6. Now you can access your stored data remotely and run different requests. (Tip: SELECT * FROM [my-big-data-analyis.tweets_sentiment.table_tweets])
    1. How many rows are there in the table “table_tweets”?
    2. How many rows are there if you group by the first column/the sentiment score?
    3. Pull out a random sample of 50 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 2000 rows.
    4. Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average score across their tweets.
    5. Pull out a random sample of 5000 rows and store them locally in a .csv file.