Exercise 10: Setting up & querying Google BigQuery
- Start by setting up accounts on Google BigQuery and Google Cloud Storage.
- Normally you just need a Google Account and setup up a paying scheme.
- Before you analyzed the dataset containing the tweets coded according to sentiment. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140.
- Upload this dataset, i.e. the corresponding .csv file (“tweets-sentiment140.csv”) to your Google Cloud storage: https://console.cloud.google.com/
- First create a project called “my-big-data-analyis”.
- Second create a cloud storage bucket called ‘my-big-data-bucket’ (options: ‘Regional’, Location: ‘europe-west-1’).
- Upload the data file “tweets-sentiment140.csv” to the bucket ‘my-big-data-bucket’.
- Then go to Tools -> Big Query on the left and add the data (see “+ ADD DATA”).
- When adding choose ‘my-big-data-analysis’ and pin the project. Then click on “CREATE DATASET” (see here). And name it “tweets_sentiment”.
- Then add a table in the dataset, taking the table from your cloud storage.
- Choose “Create table from: Google cloud storage”.
- Browse to your bucket and choose the “tweets-sentiment140.csv” file.
- Choose “File format csv”
- Write in table name “table_tweets”.
- Choose auto detect schema.
- Click “Create table” (it will take some time).
- Now access your stored data remotely and run different requests. (Tip:
SELECT * FROM [my-big-data-analyis.tweets_sentiment.table_tweets]
)
- How many rows are there in the table “table_tweets”?
- How many rows are there if you group by the first column/the sentiment score?
- Pull out a random sample of 100 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 5000 rows.
- Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average score across their tweets.
- Pull out a random sample of 50000 rows and store them locally in a .csv file.