3.14 Exercise: SQL database

  1. Use a new R or Rmarkdown script and call it ‘working-with-sql-databases.R’ (or .Rmd). Store all the code that you use now in this file.
  2. Start to code and set up a SQL database in your working directory called ‘tweets-sentiment-db.sqlite’.
  3. Download the file ‘tweets-sentiment140.csv’ from the course data folder.
  4. Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140
  5. Load the dataset from this file
    • Tip: data <- readr::read_csv("tweets-sentiment140.csv") (adapt path in case you have another working directory!)
  6. Create a database called tweets-sentiment-db and the data to the a table called table_tweets.
  7. Check whether the dataset has really become part of your database: List the tables in your database and list the fields of the corresponding table.
  8. Disconnect from your database and reconnect to the database again.
  9. Now run different queries.
    1. How many rows are there in the table “table_tweets”?
    2. How many rows are there if you do the same but group by the first column/the sentiment score (= target column)?
    3. Pull out a random sample of 100 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 5000 rows.
    4. Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average sentiment score across their tweets.