11.8 Exercise 10: Setting up & querying Google BigQuery

  1. Start by setting up accounts on Google BigQuery and Google Cloud Storage.
    • Normally you just need a Google Account and setup up a paying scheme.
  2. Before you analyzed the dataset containing the tweets coded according to sentiment. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140.
  3. Upload this dataset, i.e. the corresponding .csv file (“tweets-sentiment140.csv”) to your Google Cloud storage: https://console.cloud.google.com/
    • First create a project called “my-big-data-analyis”.
    • Second create a cloud storage bucket called ‘my-big-data-bucket’ (options: ‘Regional’, Location: ‘europe-west-1’).
    • Upload the data file “tweets-sentiment140.csv” to the bucket ‘my-big-data-bucket’.
  4. Then go to Tools -> Big Query on the left and add the data (see “+ ADD DATA”).
    • When adding choose ‘my-big-data-analysis’ and pin the project. Then click on “CREATE DATASET” (see here). And name it “tweets_sentiment”.
  5. Then add a table in the dataset, taking the table from your cloud storage.
    • Choose “Create table from: Google cloud storage”.
    • Browse to your bucket and choose the “tweets-sentiment140.csv” file.
    • Choose “File format csv”
    • Write in table name “table_tweets”.
    • Choose auto detect schema.
    • Click “Create table” (it will take some time).
  6. Now access your stored data remotely and run different requests. (Tip: SELECT * FROM [my-big-data-analyis.tweets_sentiment.table_tweets])
    1. How many rows are there in the table “table_tweets”?
    2. How many rows are there if you group by the first column/the sentiment score?
    3. Pull out a random sample of 100 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 5000 rows.
    4. Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average score across their tweets.
    5. Pull out a random sample of 50000 rows and store them locally in a .csv file.