Big data and Social Science

11.7 Lab 10: Setting up & using Google BigQuery

How to set it up…

Let’s check out the platform
Setup a big query account (e.g. see this guide) and check out the user interface
Create a new project
- Project name is important later for access
- I called mine Big Data for Social Scientists
Load the necessary packages

# install.packages("bigrquery")
library(bigrquery)

To illustrate the potential of SQL queries with large-scale databases, we’ll now work with tables hosted in BigQuery – Google’s service to enable interactive analysis of massively large datasets.

You can read more about how to setup a Google BigQuery database here and about their SQL syntax here.

Set up an account now! Normally, you will be automatically asked to login once you start using the package.

Google BigQuery: https://cloud.google.com/bigquery/
Google Cloud Storage: https://cloud.google.com/storage/

We’ll start with accessing one of the public datasets that is hosted on GoogleBigQuery. The same dataset that we analyzed locally before.

library(bigrquery)
project <- "big-data-for-social-scientists" # put your projectID here

sql <- 'SELECT COUNT(*) AS story_count FROM [bigquery-public-data:hacker_news.stories]'
# bigquery-public-data: Contains data bases that are publicly available
# hacker_news: A database that contains collections from a website with news
# stories: The table in the db that we worked with before

# Execute request
query_exec(sql, project = project, useLegacySql = FALSE)

I also stored part of the dataset (database) we analyzed before on my own Google Cloud storage (and linked in Google BigQuery).

library(bigrquery)
project <- "big-data-for-social-scientists" # put your projectID here

sql <- 'SELECT score, time 
        FROM [big-data-for-social-scientists.hacker_news.stories_table] 
        LIMIT 10'
# hacker_news is the database
# stories is the table that we worked with before

# Execute request
query_exec(sql, project = project, useLegacySql = FALSE)

11.7.1 More advanced queries

Now that we’re familiar with Google BigQuery, let’s play with a massively large dataset – a table that contains all trips completed in Yellow and Green taxis in New York City from 2009 to present. You can find more information here.

This is one of the many publicly-available Google BigQuery tables, and you can see some examples of queries More. For a complete list of public datasets, see here. Note that when you use these datasets, storage is free but you still need to pay for the queries.

Let’s connect with this database and see how big it is:

library(bigrquery)
project <- "big-data-for-social-scientists"

get_table(project="nyc-tlc",
          dataset="yellow",
          table="trips")


# how many taxi trips in this table?
query_exec(
  "SELECT COUNT(*) AS count
  FROM [nyc-tlc:yellow.trips]",
    project = project, useLegacySql = FALSE)

Not bad! What is the distribution of trips by year and by month?

# number of trips per year?
query_exec(
  "SELECT YEAR(pickup_datetime) AS year, 
    COUNT(*) AS trips
  FROM [nyc-tlc:yellow.trips]
  GROUP BY year
  ORDER BY year",
  project=project, use_legacy_sql = TRUE)

# number of trips per month?
query_exec(
  "SELECT MONTH(pickup_datetime) AS month, 
    COUNT(*) AS trips
  FROM [nyc-tlc:yellow.trips]
  GROUP BY month
  ORDER BY month",
  project=project, use_legacy_sql = TRUE)

How would we compute the average speed?

# First, let's compute distance and duration separately
query_exec(
  "SELECT AVG(trip_distance) AS avg_distance_miles,
   AVG((dropoff_datetime-pickup_datetime)/1000000/60/60) AS avg_duration_hours
   FROM [nyc-tlc:yellow.trips]",
   project=project)
# Now we can compute it in a single step
query_exec(
  "SELECT AVG(trip_distance / 
      ((dropoff_datetime-pickup_datetime)/1000000/60/60 )) 
      AS avg_speed
  FROM [nyc-tlc:yellow.trips]",
   project=project)

# but it sounds like there might be some outliers, let's try to fix it:
query_exec(
  "SELECT AVG(
      trip_distance / 
      ( (dropoff_datetime-pickup_datetime)/1000000/60/60 ) ) 
      AS avg_speed
  FROM [nyc-tlc:yellow.trips]
  WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime",
   project=project)

And just like with our SQL queries earlier, we can compute averages over groups.

# average number of passengers depending of hour of day?
query_exec(
  "SELECT HOUR(pickup_datetime) AS hour, 
    AVG(passenger_count) AS passengers_avg
  FROM [nyc-tlc:yellow.trips]
  GROUP BY hour
  ORDER BY hour",
  project=project, use_legacy_sql = TRUE)

# average duration per hour of day?
(res <- query_exec(
  "SELECT 
    HOUR(pickup_datetime) AS hour,
    COUNT(*) AS count,
    AVG( (dropoff_datetime-pickup_datetime)/1000000/60 ) AS duration_minutes
  FROM [nyc-tlc:yellow.trips]
  WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime
  GROUP BY hour
  ORDER BY hour",
  project=project, use_legacy_sql = TRUE))

plot(res$hour, res$duration_minutes, type="l")

# average length by day of the week?
(res <- query_exec(
  "SELECT 
    DAYOFWEEK(pickup_datetime) AS day,
    COUNT(*) AS count,
    AVG( (dropoff_datetime-pickup_datetime)/1000000/60 ) AS duration_minutes
  FROM [nyc-tlc:yellow.trips]
  WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime
  GROUP BY day
  ORDER BY day",
  project=project, use_legacy_sql = TRUE))

plot(res$day, res$duration_minutes, type="l")

# average speed by day of week?
query_exec(
  "SELECT 
    DAYOFWEEK(pickup_datetime) AS day,
    COUNT(*) AS count,
    AVG(
      trip_distance / 
      ( (dropoff_datetime-pickup_datetime)/1000000/60/60 ) ) 
      AS avg_speed
  FROM [nyc-tlc:yellow.trips]
  WHERE 
    trip_distance > 0
    AND fare_amount/trip_distance BETWEEN 2 AND 10
    AND dropoff_datetime > pickup_datetime
  GROUP BY day
  ORDER BY day",
  project=project, use_legacy_sql = TRUE)