Exercise: SQL database
- Use a new R or Rmarkdown script and call it ‘working-with-sql-databases.R’ (or .Rmd). Store all the code that you use now in this file.
- Start to code and set up a SQL database in your working directory called ‘tweets-sentiment-db.sqlite’.
- Download the file ‘tweets-sentiment140.csv’ from the course data folder.
- Read about this dataset here: https://www.kaggle.com/kazanova/sentiment140
- Load the dataset from this file
- Tip:
data <- readr::read_csv("tweets-sentiment140.csv")
(adapt path in case you have another working directory!)
- Create a database called
tweets-sentiment-db
and the data to the a table called table_tweets
.
- Check whether the dataset has really become part of your database: List the tables in your database and list the fields of the corresponding table.
- Disconnect from your database and reconnect to the database again.
- Now run different queries.
- How many rows are there in the table “table_tweets”?
- How many rows are there if you do the same but group by the first column/the sentiment score (= target column)?
- Pull out a random sample of 100 rows and calculate the mean sentiment score. Compare that to the mean sentiment score of a random sample of 5000 rows.
- Aggregate/group by username, calculate the average sentiment per username and order the table as to see which are the 20 users with the worst average sentiment score across their tweets.