3.13 Lab: Working with a SQL database

3.13.1 Creating an SQL database

  • COVID: Problem.. screen has not enough surface to watch and to really work in parallel in our lab…

For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with comes from Twitter (tweets by politicians). It contains various pieces of information such as the twitter handle screen_name, and id for the person user_id_str and for the tweet id_str, the single tweets stores in text, the time when the tweet was created created_at etc.

I will send you a link (by email) to a folder where your can download two datafiles data_tweets_en.csv and data_tweets_de.csv. For simplicity we’ll work with small datafiles here. Make sure to store them in your working directory. We’ll start by connecting to the database and adding the first table, with information on the tweets.

Note that we have used dbWriteTable to add the data frame to our database, and then check that it all worked with dbListFields and dbGetQuery. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.

We now switch to the second dataset data_tweets_de.csv and also create a table in our database for this dataset simply repeating the steps above.

The two files are sufficiently small to load them into our memory. However, imagine together they would exceed your RAM (e.g. > 8 GB). Then we would need to add them to the database in a sequence, one by one (in sizes that we can still open in R).

Below we sequentially add our two datafiles (we have them still in memory) to a single table in the database (if you have many files use a loop!).

3.13.2 Querying an SQL database

Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect and then we type our query with dbGetQuery. Generally, we will write the query clauses in capital letters and the column names in lowercase.

This is the list of query clauses that we will use below (you always need SELECT and FROM):

  • SELECT: determines which columns to include in the query’s result set

  • FROM: identifies the tables from which to draw data and how tables should be joined

  • LIMIT: in combination with SELECT, limits the number of records returned

  • WHERE: filter out unwanted data

  • GROUP BY: groups rows together by common column values

  • ORDER BY: sorts the rows in the final result set by one or more columns

Let’s start with some examples of SELECT:

Now some examples of how we can use WHERE to subset the database based on values of specific columns:

# selecting based on values of a column
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
           FROM table_tweets_en
           WHERE screen_name = 'SenatorEMurphy'
           LIMIT 10")

dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
           FROM table_tweets_en
           WHERE screen_name = 'SenatorEMurphy'")

# greater than, working with dates
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
           FROM table_tweets_en
           WHERE tweet_created_at > '2020-04-27'")

# AND operator
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at, retweet_count
           FROM table_tweets_en
           WHERE tweet_created_at > '2020-04-20' AND screen_name == 'SenatorEMurphy'
           LIMIT 10")

# OR operator
dbGetQuery(db, "SELECT screen_name, text, retweet_count
           FROM table_tweets_en
           WHERE retweet_count>100 AND (screen_name = 'SenatorEMurphy' OR screen_name = 'Rees4Neath')
           LIMIT 100")

# membership, IN
dbGetQuery(db, "SELECT  screen_name, tweet_created_at, text
           FROM table_tweets_en
           WHERE screen_name IN ('SenatorEMurphy', 'Rees4Neath')
           LIMIT 10")

# matching conditions with LIKE:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT  screen_name, tweet_created_at_date, text
           FROM table_tweets_en
           WHERE tweet_created_at_date LIKE '2013-01-__'
           LIMIT 10")

# Exclude retweets (they start with 'RT ')
dbGetQuery(db, "SELECT  screen_name, tweet_created_at_date, text
           FROM table_tweets_en
           WHERE text NOT LIKE 'RT %'
           LIMIT 10")


dbGetQuery(db, "SELECT screen_name, tweet_created_at_date, text
           FROM table_tweets_en
           WHERE tweet_created_at_date LIKE '2013-03%'
           LIMIT 10")

# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT  screen_name, tweet_created_at, text, screen_name
           FROM table_tweets_en
           WHERE text LIKE '%Facebook%'
           LIMIT 5")

Sometimes we might just want to pull out one variable to calculate its mean:

Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)

Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.

And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:

3.13.3 Querying multiple SQL tables

Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN. We just select columns from different tables and then specify what key should be used to match them using ON.

The example belows shows an inner join, but there are other types (left vs right outer joins, cross joins…) that we will not cover in this workshop.

Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the tweets that got the highest retweet_count (beware we have retweets included here). Here’s how we would do it:

3.13.4 Grouping and aggregating

SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:

Combining everything we’ve learned, we can for example count the total number of posts by screen_name, or by screen_name and retweet_count.

Here’s another example. Let’s say we want to find out which user (screen_name) wrote the most tweets (and add the sum of favorites given by other users). How would we do this?

There are other functions besides just COUNT. We can also compute averages, the maximum and minimum of a column… We can also use functions like SUBSTR to extract parts of a character string.