Exercise: Setting up & querying Google BigQuery
- Start by setting up accounts on Google BigQuery and Google Cloud Storage.
- Normally you just need a Google Account and setup up a paying scheme.
- Before you analyzed the dataset containing the tweets coded according to sentiment. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140.
- Download the dataset from the course data folder (“tweets-sentiment140.csv”).
- Import the dataset, take a subsample of 10000 tweets from this dataset and store it in a new object called “data”.
- Then upload the dataset to bigquery to your project “css-seminar-2021”.
- Thereby, create a new dataset called “data_tweets” and store the data in a table called “table_tweets”.
- Now you can access your stored data remotely and run different requests. (Tip:
SELECT * FROM [my-big-data-analyis.tweets_sentiment.table_tweets]
)
- How many rows are there in the table “table_tweets”?
- How many rows are there if you group by the first column/the sentiment score?
- Pull out a random sample of 50 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 2000 rows.
- Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average score across their tweets.
- Pull out a random sample of 5000 rows and store them locally in a .csv file.