3.18 Lab: Google Big Query

3.18.0.1 Authentication

How to set it up… let’s have a quick look at the Google BigQuery user interface.

The package we use to access BigQuery is called bigrquery and you can find a quick overview on the github page.

To do anything on the platform we first have to autenticate. For this we’ll need the key for our service account which we store on our hardrive in a json file. Then we can use the bq_auth() function to autenticate (it will open the browser window, so you can provide the permissions). Alternatively, we could use an API key.

See this document on how to setup the Google cloud for using your research credits and how to set up an API Keys.

Billing account/project: Importantly, whenever you interact with the Google Cloud you have to indicate which project in your Google account you are working in - in case you have to pay for any services. Below we always indicate the project/billing account through defining: billing = "css-seminar-2021". You have to make sure the the corresponding name (below it is css-seminar-2021) is changed to the name of the project that you have in your account. In your case it could be different, e.g., css-seminar-2021-487534.

Also for some of the steps below (e.g., to see some of the links) you have to be logged into your account.

In addition, the bigquery API has to be enabled (you should be able to do that here).

# Load packages
pacman::p_load(bigrquery, DBI)


# Authenticate with json file from setting up big query
bq_auth(path = "your_JSON_file.json")

# or autenticate with API key
# bq_auth(token = "your_API_key")

# PB: bq_auth("C:/Users/Paul/Google Drive/2-Teaching/2021 Computational Social Science/keys/css-seminar-2021-a1e75382ae2c.json")

Subsequently, we test whether authentication worked by accessing one of the BigQuery public datasets using an SQL query.

# Test whether autentication worked.

# Run the query and store the data in a dataframe
# Example data on Names in the U.S.: https://console.cloud.google.com/bigquery
tb <- bq_dataset_query(bq_dataset("bigquery-public-data",
                                  "usa_names"),
                 query = "SELECT count(*) FROM usa_1910_current LIMIT 10",
  billing = "css-seminar-2021"
)
bq_table_fields(tb) # 1 variable "f0_"
bq_table_download(tb)

3.18.0.2 Running queries on public datasets

Now, let’s play with a massively large dataset – a table that contains all trips completed in Yellow and Green taxis in New York City from 2009 to present. You can find more information here.

This is one of the many publicly-available Google BigQuery tables (some exemplary queries). For a complete list of public datasets, see here. Note that when you use these datasets, storage is free but you still need to pay for the queries.

For any connection we have to define a certain set of parameters: billing, project, dataset.

Let’s connect with this database/table and see how big it is:

# how many taxi trips in this table?
# 1. Create reference to project/dataset (?bq_dataset)
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")

# 2. Submit/run a query
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM tlc_yellow_trips_2018",
  billing = "css-seminar-2021" # GC project name
)

# "css-seminar-2001" is the google cloud project name
# for a participant account it's normally followed by a number 
# e.g., "css-seminar-2021-487534"



# 3. Download result of query
bq_table_download(tb)

# Try to run the query directly in the console
# SELECT pickup_datetime FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018` LIMIT 10

Not bad! What is the distribution of trips by year and by month?

# number of trips per year?

ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(YEAR FROM pickup_datetime) AS year, 
    COUNT(*) AS trips FROM tlc_yellow_trips_2018
  GROUP BY year
  ORDER BY year",
  billing = "css-seminar-2021"
)
bq_table_download(tb)


# number of trips per month?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(MONTH FROM pickup_datetime) AS month, 
    COUNT(*) AS trips 
    FROM tlc_yellow_trips_2018
  GROUP BY month
  ORDER BY month",
  billing = "css-seminar-2021"
)
bq_table_download(tb)

How would we compute the average speed?

# First, let's compute distance and duration separately
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT AVG(trip_distance) AS avg_distance_miles,
   AVG(DATETIME_DIFF(dropoff_datetime, pickup_datetime, HOUR)) AS avg_duration_hours
    FROM tlc_yellow_trips_2018",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)

data$avg_distance_miles/data$avg_duration_hours # realistic?




# but it sounds like there might be some outliers, let's try to fix it:
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT AVG(trip_distance) AS avg_distance_miles,
   AVG(DATETIME_DIFF(dropoff_datetime, pickup_datetime, HOUR)) AS avg_duration_hours
    FROM tlc_yellow_trips_2018
    WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)

data$avg_distance_miles/data$avg_duration_hours # realistic?

And just like with our SQL queries earlier, we can compute averages over groups.

# average number of passengers depending of hour of day?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(HOUR FROM pickup_datetime) AS hour, 
    AVG(passenger_count) AS passengers_avg
    FROM tlc_yellow_trips_2018
  GROUP BY hour
  ORDER BY hour",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)

# average number of passengers depending on weekday?
ds <- bq_dataset("bigquery-public-data", "new_york_taxi_trips")
tb <- bq_dataset_query(ds,
  query = "SELECT EXTRACT(DAYOFWEEK FROM pickup_datetime) AS dayofweek, 
    AVG(passenger_count) AS passengers_avg
    FROM tlc_yellow_trips_2018
  GROUP BY dayofweek
  ORDER BY dayofweek",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)

# Overview of date functions:
# https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions

Another public dataset contains a lot of hacker news stories:

# bigquery-public-data: Contains data bases that are publicly available
# hacker_news: A database that contains collections from a website with news
# stories: The table in the db that we worked with before

ds <- bq_dataset("bigquery-public-data", "hacker_news")
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM stories",
  billing = "css-seminar-2021"
)
bq_table_download(tb)



ds <- bq_dataset("bigquery-public-data", "hacker_news")
tb <- bq_dataset_query(ds,
  query = "SELECT text FROM stories LIMIT 5",
  billing = "css-seminar-2021"
)
bq_table_download(tb)

3.18.0.3 Uploading data to BigQuery

Make sure you authenticate to be able to connect to BigQuery.

# Load packages
pacman::p_load(bigrquery)

# Authenticate
bq_auth(path = "your_JSON_file.json")
# or autenticate with API key
# bq_auth(token = "your_API_key")

# PB: bq_auth("C:/Users/Paul/Google Drive/2-Teaching/2021 Computational Social Science/keys/css-seminar-2021-a1e75382ae2c.json")

First we have to connect to our bigquery database and create a dataset in which we can store tables (we authenticated above).

# Create a dataset (this is where you store the tables)
bq_dataset_create(bq_dataset(project = "css-seminar-2021",
                             dataset = "my_dataset"))

# "css-seminar-2001" is the google cloud project name
# Make sure to change it to the one in your google account (see start of lab)!
# For a participant account it's normally followed by a number 
# e.g., "css-seminar-2021-487534"

# bq_dataset_delete(bq_dataset(project = "css-seminar-2021", dataset = "my_dataset"))

Then we can upload a table to that dataset. To start, we’ll generate a very small dataset that contains simulated data.

data <- data.frame(var1 = 1:26, var2 = letters)

# Create/upload table to dataset
  bq_table(project = "css-seminar-2021", # Create table and upload it to dataset
           dataset = "my_dataset",
           table   = "my_table") %>%
    bq_table_upload(values = data,
                    create_disposition = "CREATE_IF_NEEDED",
                    write_disposition = "WRITE_APPEND",
                    fields=  as_bq_fields(data))

# If you want to make sure you don't confuse tables add a time to the table name
  # table_name <- paste("data_news_", gsub(":|\\s|-", "_", Sys.time()), sep="")

3.18.0.4 Downloading data

Above we dicussed the following way to connect and query a database:

# connect to a dataset
ds <- bq_dataset("css-seminar-2021", "my_dataset")
tb <- bq_dataset_query(ds,
  query = "SELECT * FROM my_table",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)


# connect to a dataset
ds <- bq_dataset("css-seminar-2021", "my_dataset")
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) FROM my_table",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)

3.18.0.5 DBI package

A more general way to connect to different databases is to use the DBI package and dbConnect (analogue to working with SQL) (?DBI-package).

# Load packages
pacman::p_load(bigrquery)

# A more general function
ds <- dbConnect(
  bigrquery::bigquery(),
  project = "css-seminar-2021",
  dataset = "my_dataset",
  billing = "css-seminar-2021"
)

# Show tables in dataset and disconnect
dbListTables(ds)
# dbDisconnect(ds)

# List variables
dbListFields(ds, "my_table")


# Download the table
data <- dbReadTable(ds, "my_table")


# create table
dbCreateTable(ds, "iris", iris) # Bug...


# Show queries
tb_query <- tbl(ds, "my_dataset.my_table")
tb_query %>% show_query()

tb_query <- tbl(ds, "my_dataset.my_table") %>% summarise(n = n())
tb_query %>% show_query()

# Simply using tb_query does not work
# BUT try expression below in console
#SELECT count(*) AS `n` FROM my_dataset.my_table

# And you can use it with the code above
tb <- bq_dataset_query(ds,
  query = "SELECT count(*) AS `n` FROM my_dataset.my_table",
  billing = "css-seminar-2021"
)
data <- bq_table_download(tb)