11.7 Lab 10: Setting up & using Google BigQuery
How to set it up…
- Let’s check out the platform
- Setup a big query account (e.g. see this guide) and check out the user interface
- Create a new project
- Project name is important later for access
- I called mine
Big Data for Social Scientists
- Load the necessary packages
# install.packages("bigrquery")
library(bigrquery)
To illustrate the potential of SQL queries with large-scale databases, we’ll now work with tables hosted in BigQuery – Google’s service to enable interactive analysis of massively large datasets.
You can read more about how to setup a Google BigQuery database here and about their SQL syntax here.
Set up an account now! Normally, you will be automatically asked to login once you start using the package.
- Google BigQuery: https://cloud.google.com/bigquery/
- Google Cloud Storage: https://cloud.google.com/storage/
We’ll start with accessing one of the public datasets that is hosted on GoogleBigQuery. The same dataset that we analyzed locally before.
library(bigrquery)
project <- "big-data-for-social-scientists" # put your projectID here
sql <- 'SELECT COUNT(*) AS story_count FROM [bigquery-public-data:hacker_news.stories]'
# bigquery-public-data: Contains data bases that are publicly available
# hacker_news: A database that contains collections from a website with news
# stories: The table in the db that we worked with before
# Execute request
query_exec(sql, project = project, useLegacySql = FALSE)
I also stored part of the dataset (database) we analyzed before on my own Google Cloud storage (and linked in Google BigQuery).
library(bigrquery)
project <- "big-data-for-social-scientists" # put your projectID here
sql <- 'SELECT score, time
FROM [big-data-for-social-scientists.hacker_news.stories_table]
LIMIT 10'
# hacker_news is the database
# stories is the table that we worked with before
# Execute request
query_exec(sql, project = project, useLegacySql = FALSE)
11.7.1 More advanced queries
Now that we’re familiar with Google BigQuery, let’s play with a massively large dataset – a table that contains all trips completed in Yellow and Green taxis in New York City from 2009 to present. You can find more information here.
This is one of the many publicly-available Google BigQuery tables, and you can see some examples of queries More. For a complete list of public datasets, see here. Note that when you use these datasets, storage is free but you still need to pay for the queries.
Let’s connect with this database and see how big it is:
library(bigrquery)
project <- "big-data-for-social-scientists"
get_table(project="nyc-tlc",
dataset="yellow",
table="trips")
# how many taxi trips in this table?
query_exec(
"SELECT COUNT(*) AS count
FROM [nyc-tlc:yellow.trips]",
project = project, useLegacySql = FALSE)
Not bad! What is the distribution of trips by year and by month?
# number of trips per year?
query_exec(
"SELECT YEAR(pickup_datetime) AS year,
COUNT(*) AS trips
FROM [nyc-tlc:yellow.trips]
GROUP BY year
ORDER BY year",
project=project, use_legacy_sql = TRUE)
# number of trips per month?
query_exec(
"SELECT MONTH(pickup_datetime) AS month,
COUNT(*) AS trips
FROM [nyc-tlc:yellow.trips]
GROUP BY month
ORDER BY month",
project=project, use_legacy_sql = TRUE)
How would we compute the average speed?
# First, let's compute distance and duration separately
query_exec(
"SELECT AVG(trip_distance) AS avg_distance_miles,
AVG((dropoff_datetime-pickup_datetime)/1000000/60/60) AS avg_duration_hours
FROM [nyc-tlc:yellow.trips]",
project=project)
# Now we can compute it in a single step
query_exec(
"SELECT AVG(trip_distance /
((dropoff_datetime-pickup_datetime)/1000000/60/60 ))
AS avg_speed
FROM [nyc-tlc:yellow.trips]",
project=project)
# but it sounds like there might be some outliers, let's try to fix it:
query_exec(
"SELECT AVG(
trip_distance /
( (dropoff_datetime-pickup_datetime)/1000000/60/60 ) )
AS avg_speed
FROM [nyc-tlc:yellow.trips]
WHERE
trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime",
project=project)
And just like with our SQL queries earlier, we can compute averages over groups.
# average number of passengers depending of hour of day?
query_exec(
"SELECT HOUR(pickup_datetime) AS hour,
AVG(passenger_count) AS passengers_avg
FROM [nyc-tlc:yellow.trips]
GROUP BY hour
ORDER BY hour",
project=project, use_legacy_sql = TRUE)
# average duration per hour of day?
(res <- query_exec(
"SELECT
HOUR(pickup_datetime) AS hour,
COUNT(*) AS count,
AVG( (dropoff_datetime-pickup_datetime)/1000000/60 ) AS duration_minutes
FROM [nyc-tlc:yellow.trips]
WHERE
trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime
GROUP BY hour
ORDER BY hour",
project=project, use_legacy_sql = TRUE))
plot(res$hour, res$duration_minutes, type="l")
# average length by day of the week?
(res <- query_exec(
"SELECT
DAYOFWEEK(pickup_datetime) AS day,
COUNT(*) AS count,
AVG( (dropoff_datetime-pickup_datetime)/1000000/60 ) AS duration_minutes
FROM [nyc-tlc:yellow.trips]
WHERE
trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime
GROUP BY day
ORDER BY day",
project=project, use_legacy_sql = TRUE))
plot(res$day, res$duration_minutes, type="l")
# average speed by day of week?
query_exec(
"SELECT
DAYOFWEEK(pickup_datetime) AS day,
COUNT(*) AS count,
AVG(
trip_distance /
( (dropoff_datetime-pickup_datetime)/1000000/60/60 ) )
AS avg_speed
FROM [nyc-tlc:yellow.trips]
WHERE
trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime
GROUP BY day
ORDER BY day",
project=project, use_legacy_sql = TRUE)