3.13 Lab: Working with a SQL database
3.13.1 Creating an SQL database
- COVID: Problem.. screen has not enough surface to watch and to really work in parallel in our lab…
For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with comes from Twitter (tweets by politicians). It contains various pieces of information such as the twitter handle screen_name
, and id for the person user_id_str
and for the tweet id_str
, the single tweets stores in text
, the time when the tweet was created created_at
etc.
I will send you a link (by email) to a folder where your can download two datafiles data_tweets_en.csv
and data_tweets_de.csv
. For simplicity we’ll work with small datafiles here. Make sure to store them in your working directory. We’ll start by connecting to the database and adding the first table, with information on the tweets.
# install.packages("pacman")
pacman::p_load(DBI,
readr,
dplyr,
lubridate,
RSQLite)
# p_load replaces library/require
# Set working directory
# setwd("..")
# In my case:
# setwd("C:/Users/Paul/Google Drive/2-Teaching/2021 Computational Social Science/2021_computational_social_science/data_email")
# Load the data into R (check with str())
# Convert date columns to chr class
data_tweets_en <- read_csv("data_tweets_en.csv") %>%
mutate(text2 = text,
text = substr(text, 1, 50),
tweet_created_at = as.character(tweet_created_at),
tweet_created_at_date = as.character(tweet_created_at_date))
# Create & connect to database: will create a file in our hard drive
# + connect to the db
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
# adding the first table
dbWriteTable(db, "table_tweets_en", data_tweets_en, overwrite = TRUE)
# testing that it works with a simple query
dbListFields(db, "table_tweets_en")
# a first query
dbGetQuery(db, 'SELECT * FROM table_tweets_en LIMIT 5')
# and we close the connection for now
dbDisconnect(db)
Note that we have used dbWriteTable
to add the data frame to our database, and then check that it all worked with dbListFields
and dbGetQuery
. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.
We now switch to the second dataset data_tweets_de.csv
and also create a table in our database for this dataset simply repeating the steps above.
data_tweets_de <- read_csv("data_tweets_de.csv") %>%
mutate(text2 = text,
text = substr(text, 1, 50),
tweet_created_at = as.character(tweet_created_at),
tweet_created_at_date = as.character(tweet_created_at_date))
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
dbWriteTable(db, "table_tweets_de", data_tweets_de, overwrite = TRUE)
dbListFields(db, "table_tweets_de")
dbGetQuery(db, 'SELECT * FROM table_tweets_de LIMIT 5')
dbListTables(db) # list tables
dbDisconnect(db)
The two files are sufficiently small to load them into our memory. However, imagine together they would exceed your RAM (e.g. > 8 GB). Then we would need to add them to the database in a sequence, one by one (in sizes that we can still open in R).
Below we sequentially add our two datafiles (we have them still in memory) to a single table in the database (if you have many files use a loop!).
# Connect to db
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
# adding the first table
dbWriteTable(db, "table_tweets",
data_tweets_en,
append=TRUE)
# adding the second table
dbWriteTable(db, "table_tweets",
data_tweets_de,
append=TRUE)
# testing that it works
dbListFields(db, "table_tweets")
dbGetQuery(db, 'SELECT * FROM table_tweets LIMIT 5')
# show all tables
dbListTables(db)
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "table_tweets_en")
# and we close the connection for now
dbDisconnect(db)
3.13.2 Querying an SQL database
Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect
and then we type our query with dbGetQuery
. Generally, we will write the query clauses
in capital letters and the column names in lowercase.
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
test <- dbGetQuery(db, 'SELECT *
FROM table_tweets_en
LIMIT 5')
str(test)
This is the list of query clauses that we will use below (you always need SELECT
and FROM
):
SELECT
: determines which columns to include in the query’s result setFROM
: identifies the tables from which to draw data and how tables should be joinedLIMIT
: in combination with SELECT, limits the number of records returnedWHERE
: filter out unwanted dataGROUP BY
: groups rows together by common column valuesORDER BY
: sorts the rows in the final result set by one or more columns
Let’s start with some examples of SELECT
:
# Get column names
dbListFields(db, "table_tweets")
dbListFields(db, "table_tweets_en")
# Check number of rows
dbGetQuery(db, "SELECT COUNT(*)
FROM table_tweets_en")
# querying just one column
dbGetQuery(db, "SELECT text
FROM table_tweets_en
LIMIT 10")
# multiple columns
dbGetQuery(db, "SELECT screen_name, text
FROM table_tweets_en
LIMIT 10")
# adding expressions
dbGetQuery(db, "SELECT text, UPPER(screen_name) as SCREEN_NAME
FROM table_tweets_en
LIMIT 5")
# adding aliases to the new columns
dbGetQuery(db, "SELECT tweet_created_at, text, text AS story
FROM table_tweets_en
LIMIT 5")
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT text,
text AS story
FROM table_tweets_en
LIMIT 10")
# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT screen_name
FROM table_tweets_en
LIMIT 100")
Now some examples of how we can use WHERE
to subset the database based on values of specific columns:
# selecting based on values of a column
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
FROM table_tweets_en
WHERE screen_name = 'SenatorEMurphy'
LIMIT 10")
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
FROM table_tweets_en
WHERE screen_name = 'SenatorEMurphy'")
# greater than, working with dates
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at
FROM table_tweets_en
WHERE tweet_created_at > '2020-04-27'")
# AND operator
dbGetQuery(db, "SELECT screen_name, text, tweet_created_at, retweet_count
FROM table_tweets_en
WHERE tweet_created_at > '2020-04-20' AND screen_name == 'SenatorEMurphy'
LIMIT 10")
# OR operator
dbGetQuery(db, "SELECT screen_name, text, retweet_count
FROM table_tweets_en
WHERE retweet_count>100 AND (screen_name = 'SenatorEMurphy' OR screen_name = 'Rees4Neath')
LIMIT 100")
# membership, IN
dbGetQuery(db, "SELECT screen_name, tweet_created_at, text
FROM table_tweets_en
WHERE screen_name IN ('SenatorEMurphy', 'Rees4Neath')
LIMIT 10")
# matching conditions with LIKE:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT screen_name, tweet_created_at_date, text
FROM table_tweets_en
WHERE tweet_created_at_date LIKE '2013-01-__'
LIMIT 10")
# Exclude retweets (they start with 'RT ')
dbGetQuery(db, "SELECT screen_name, tweet_created_at_date, text
FROM table_tweets_en
WHERE text NOT LIKE 'RT %'
LIMIT 10")
dbGetQuery(db, "SELECT screen_name, tweet_created_at_date, text
FROM table_tweets_en
WHERE tweet_created_at_date LIKE '2013-03%'
LIMIT 10")
# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT screen_name, tweet_created_at, text, screen_name
FROM table_tweets_en
WHERE text LIKE '%Facebook%'
LIMIT 5")
Sometimes we might just want to pull out one variable to calculate its mean:
# Get one variable and calculate the mean (average retweet count)
data <- dbGetQuery(db, "SELECT retweet_count
FROM table_tweets_en")
data$retweet_count <- as.numeric(data$retweet_count)
mean(data$retweet_count, na.rm=TRUE)
# Draw a random sample and calculate the mean
data.sample <- dbGetQuery(db, "SELECT retweet_count
FROM table_tweets_en
ORDER BY RANDOM()
LIMIT 5000")
data.sample$retweet_count <- as.numeric(data.sample$retweet_count)
mean(data.sample$retweet_count, na.rm=TRUE) # not too bad
Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)
Very useful for us – GROUP BY
allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.
# What does the code below do?
dbGetQuery(db,
"SELECT screen_name, COUNT(screen_name) AS tweet_count, tweet_created_at_date
FROM table_tweets_en
GROUP BY screen_name, tweet_created_at_date
LIMIT 10")
And we can then order the results using ORDER BY
. The default is alphabetical order for the first column, but we can change that:
# sort by tweet count
dbGetQuery(db,
"SELECT screen_name, COUNT(screen_name) AS tweet_count, tweet_created_at_date
FROM table_tweets_en
GROUP BY screen_name, tweet_created_at_date
ORDER BY tweet_count DESC
LIMIT 10")
# That was a retweet!
# what was the screen_name with the most posts in our data?
dbGetQuery(db,
"SELECT screen_name, COUNT(screen_name) AS tweet_count
FROM table_tweets_en
GROUP BY screen_name
ORDER BY tweet_count DESC
LIMIT 10")
# You can also specify the column number instead of the name
dbGetQuery(db,
"SELECT screen_name, COUNT(screen_name) AS tweet_count
FROM table_tweets_en
GROUP BY screen_name
ORDER BY 2 DESC
LIMIT 10")
# and we close the connection for now
dbDisconnect(db)
3.13.3 Querying multiple SQL tables
Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN
. We just select columns from different tables and then specify what key should be used to match them using ON
.
The example belows shows an inner join
, but there are other types (left vs right outer joins, cross joins…) that we will not cover in this workshop.
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
# Check which tables are in there
dbListTables(db)
# We create two new tables in our database that
# id_str: Is the identifier of the single tweets! (Unique!)
# we'll join later on
table1 <- dbGetQuery(db,
"SELECT id_str, screen_name, retweet_count
FROM table_tweets_en
Limit 100000")
dbWriteTable(db, "table1", table1, overwrite = TRUE)
table2 <- dbGetQuery(db,
"SELECT id_str, screen_name, favorite_count
FROM table_tweets_en
Limit 100000")
dbWriteTable(db, "table2", table2, overwrite = TRUE)
# Check which tables are in there
dbListTables(db)
# Get column names
dbListFields(db, "table_tweets_en")
dbListFields(db, "table1")
dbListFields(db, "table2")
# Let's take the table1 and merge it with data from the table2
# The identifer variables ideally have different names but we can also specify them as below
dbGetQuery(db, "SELECT id_str, retweet_count, screen_name
FROM table1
JOIN table2
ON table1.id_str = table2.id_str
LIMIT 10")
dbGetQuery(db, "SELECT table1.id_str,
table1.retweet_count,
table1.screen_name,
table2.favorite_count
FROM table1
JOIN table2
ON table1.id_str = table2.id_str
LIMIT 10")
Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the tweets that got the highest retweet_count (beware we have retweets included here). Here’s how we would do it:
3.13.4 Grouping and aggregating
SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:
# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*)
FROM table_tweets_en')
Combining everything we’ve learned, we can for example count the total number of posts by screen_name, or by screen_name and retweet_count.
dbGetQuery(db,
"SELECT retweet_count,
table1.screen_name,
COUNT(*) AS total_tweets_by_screen_name
FROM table1 JOIN table2
ON table1.id_str = table2.id_str
GROUP BY table1.screen_name
LIMIT 20")
Here’s another example. Let’s say we want to find out which user (screen_name) wrote the most tweets (and add the sum of favorites given by other users). How would we do this?
dbGetQuery(db,
"SELECT table1.screen_name,
COUNT(*) AS total_tweets,
SUM(favorite_count) AS favorite_count_total
FROM table1 JOIN table2
ON table1.id_str = table2.id_str
GROUP BY table1.screen_name
ORDER BY total_tweets DESC
LIMIT 20")
There are other functions besides just COUNT
. We can also compute averages, the maximum and minimum of a column… We can also use functions like SUBSTR
to extract parts of a character string.
# average retweet_count (sorted) and number of tweets
dbGetQuery(db,
"SELECT screen_name, SUBSTR(screen_name, 1 , 5), AVG(retweet_count)
FROM table_tweets_en
WHERE text NOT LIKE 'RT %'
GROUP BY screen_name
ORDER BY AVG(retweet_count) DESC
LIMIT 20")
# Most retweeted original tweets
dbGetQuery(db,
"SELECT screen_name, SUBSTR(text, 1 , 40), retweet_count
FROM table_tweets_en
WHERE text NOT LIKE 'RT %'
ORDER BY retweet_count DESC
LIMIT 20")