3.19 Exercise: Setting up & querying Google BigQuery

  1. Start by setting up accounts on Google BigQuery and Google Cloud Storage.
    • Normally you just need a Google Account and setup up a paying scheme.
  2. Before you analyzed the dataset containing the tweets coded according to sentiment. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140.
  3. Download the dataset from the course data folder (“tweets-sentiment140.csv”).
  4. Import the dataset, take a subsample of 10000 tweets from this dataset and store it in a new object called “data”.
  5. Then upload the dataset to bigquery to your project “css-seminar-2021”.
    • Thereby, create a new dataset called “data_tweets” and store the data in a table called “table_tweets”.
  6. Now you can access your stored data remotely and run different requests. (Tip: SELECT * FROM [my-big-data-analyis.tweets_sentiment.table_tweets])
    1. How many rows are there in the table “table_tweets”?
    2. How many rows are there if you group by the first column/the sentiment score?
    3. Pull out a random sample of 50 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 2000 rows.
    4. Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average score across their tweets.
    5. Pull out a random sample of 5000 rows and store them locally in a .csv file.