3.18 Lab: Google Big Query

3.18.0.1 Authentication

How to set it up… let’s have a quick look at the Google BigQuery user interface.

The package we use to access BigQuery is called bigrquery and you can find a quick overview on the github page.

To do anything on the platform we first have to autenticate. For this we’ll need the key for our service account which we store on our hardrive in a json file. Then we can use the bq_auth() function to autenticate (it will open the browser window, so you can provide the permissions). Alternatively, we could use an API key.

  • See this document on how to setup the Google cloud for using your research credits and how to set up an API Keys.

Billing account/project: Importantly, whenever you interact with the Google Cloud you have to indicate which project in your Google account you are working in - in case you have to pay for any services. Below we always indicate the project/billing account through defining: billing = "css-seminar-2021". You have to make sure the the corresponding name (below it is css-seminar-2021) is changed to the name of the project that you have in your account. In your case it could be different, e.g., css-seminar-2021-487534.

Also for some of the steps below (e.g., to see some of the links) you have to be logged into your account.

In addition, the bigquery API has to be enabled (you should be able to do that here).

Subsequently, we test whether authentication worked by accessing one of the BigQuery public datasets using an SQL query.

3.18.0.2 Running queries on public datasets

Now, let’s play with a massively large dataset – a table that contains all trips completed in Yellow and Green taxis in New York City from 2009 to present. You can find more information here.

This is one of the many publicly-available Google BigQuery tables (some exemplary queries). For a complete list of public datasets, see here. Note that when you use these datasets, storage is free but you still need to pay for the queries.

For any connection we have to define a certain set of parameters: billing, project, dataset.

Let’s connect with this database/table and see how big it is:

Not bad! What is the distribution of trips by year and by month?

How would we compute the average speed?

And just like with our SQL queries earlier, we can compute averages over groups.

Another public dataset contains a lot of hacker news stories:

3.18.0.3 Uploading data to BigQuery

Make sure you authenticate to be able to connect to BigQuery.

First we have to connect to our bigquery database and create a dataset in which we can store tables (we authenticated above).

Then we can upload a table to that dataset. To start, we’ll generate a very small dataset that contains simulated data.