11.4 Lab 9: Working with a SQL database
11.4.1 Creating an SQL database
For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with is news from the website Hacker News: https://news.ycombinator.com/ provided by Google BigQuery’s collection of public datasets. The data comes with various pieces of information such as authorname, the title of the story, how much points the story got etc.
Make sure you download the respective files into your working directory (see my email).
We’ll start by connecting to the database and adding the first table, with information on the stories.
library(DBI)
library(readr)
library(dplyr)
library(lubridate)
# create database: this will create a file in our hard drive
db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")
# If the files are small we can add them relatively easy.
# There are tables with stories where this is possible.
# We read them into R. Merge them and then write them to the database.
stories0 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000000.csv")
stories1 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000001.csv")
stories2 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000002.csv")
stories3 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000003.csv")
stories4 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000004.csv")
stories <- bind_rows(stories0,
stories1,
stories2,
stories3,
stories4)
# Does not work...
stories <- bind_rows(mutate_all(stories0, as.character),
mutate_all(stories1, as.character),
mutate_all(stories2, as.character),
mutate_all(stories3, as.character),
mutate_all(stories4, as.character))
# Convert date
stories$time_ts <- as.Date(stories$time_ts, "%Y-%m-%d")
stories$time_ts <- as.character(stories$time_ts)
# Add index variable
stories$identifier <- 1:nrow(stories)
#stories$identifier <- paste(stories$identifier, replicate(nrow(stories), paste(sample(letters, 10), collapse="")), sep="")
stories$identifier1 <- stories$identifier
stories$identifier2 <- stories$identifier
# adding the first table
dbWriteTable(db, "table_stories", stories, overwrite = TRUE)
# testing that it works with a simple query
dbListFields(db, "table_stories")
dbGetQuery(db, 'SELECT * FROM table_stories LIMIT 5')
# and we close the connection for now
dbDisconnect(db)
Note that we have used dbWriteTable
to add the data frame to our database, and then check that it all worked with dbListFields
and dbGetQuery
. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.
We now switch to the second table. However, the files we want to have in that table (in the folder ‘full’) are too many and too big to open them all in memory (they contain not only the stories but also the comments). Instead, we will open them one by one, and then append
them directly to the table in the database. Let’s see how that works…
db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")
files <- list.files("./www/hacker-news-data/full", full.names=TRUE)
for (i in files){
message(i)
# read file into memory
file.i <- read_csv(i)
# Convert date
file.i$timestamp <- as.Date(file.i$timestamp, "%Y-%m-%d")
file.i$timestamp <- as.character(file.i$timestamp)
# adding to table in SQL database
dbWriteTable(db, "table_allposts", file.i, append=TRUE)
}
# ...ok our database has 8.7 Gigabytes now.. depending on your RAM too big to work with it in R
# testing that it works
dbListFields(db, "table_allposts")
dbGetQuery(db, 'SELECT * FROM table_allposts LIMIT 5')
# show the tables
dbListTables(db)
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "table_stories")
# and we close the connection for now
dbDisconnect(db)
11.4.2 Querying an SQL database
Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect
and then we type our query with dbGetQuery
. Generally, we will write the query clauses
in capital letters and the column names in lowercase.
db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")
test <- dbGetQuery(db, 'SELECT *
FROM table_stories
LIMIT 5')
str(test)
RMarkdown actually allows us to write the SQL query directly as a chunk of code (code is hidden here, see ‘www’ folder):
This is the list of query clauses that we will use to work with the database table:
SELECT
: determines which columns to include in the query’s result setFROM
: identifies the tables from which to draw data and how tables should be joinedLIMIT
: in combination with SELECT, limits the number of records returnedWHERE
: filter out unwanted dataGROUP BY
: groups rows together by common column valuesORDER BY
: sorts the rows in the final result set by one or more columns
Let’s start with some examples of SELECT
:
# Get column names
dbListFields(db, "table_allposts")
dbListFields(db, "table_stories")
# querying just one column
dbGetQuery(db, "SELECT score
FROM table_stories
LIMIT 10")
# multiple columns
dbGetQuery(db, "SELECT title, score
FROM table_stories
LIMIT 10")
# adding expressions
dbGetQuery(db, "SELECT score, UPPER(title)
FROM table_stories
LIMIT 10")
# adding aliases to the new columns
dbGetQuery(db, "SELECT time_ts, score, text AS story
FROM table_stories
LIMIT 10")
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT score, text AS story
FROM table_stories
LIMIT 10")
# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT score
FROM table_stories
LIMIT 10")
Now some examples of how we can use WHERE
to subset the database based on values of specific columns:
# selecting based on values of a column
dbGetQuery(db, "SELECT title, score, time, author
FROM table_stories
WHERE author = 'kogir'
LIMIT 10")
# greater than, working with dates
dbGetQuery(db, "SELECT title, score, time_ts
FROM table_stories
WHERE time_ts > '2012-01-01'
LIMIT 10")
# AND operator
dbGetQuery(db, "SELECT title, score, time_ts, author
FROM table_stories
WHERE time_ts > '2012-01-01' AND author != 'kogir' AND score > 10
LIMIT 10")
# OR operator
dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
WHERE score>100 AND (author = 'Willdel' OR author = 'catwork')
LIMIT 100")
# membership, IN
dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
WHERE author IN ('catwork', 'mclightning')
LIMIT 10")
# matching conditions with LIKE:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
WHERE time_ts LIKE '2013-01-__'
LIMIT 10")
dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
WHERE time_ts LIKE '2013-03%'
LIMIT 10")
# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
WHERE title LIKE '%global%'
LIMIT 5")
# Get one variable and calculate the mean
data <- dbGetQuery(db, "SELECT id, score
FROM table_stories")
data$score <- as.numeric(data$score)
mean(data$score, na.rm=TRUE)
# Draw a random sample and calculate the mean
data.sample <- dbGetQuery(db, "SELECT title, time_ts, score, author
FROM table_stories
ORDER BY RANDOM()
LIMIT 10000")
data.sample$score <- as.numeric(data.sample$score)
mean(data.sample$score, na.rm=TRUE) # not too bad
Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)
Very useful for us – GROUP BY
allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.
dbGetQuery(db,
"SELECT author, COUNT(author) AS author_count
FROM table_stories
GROUP BY score
LIMIT 100")
And we can then order the results using ORDER BY
. The default is alphabetical order for the first column, but we can change that:
# sort by score
dbGetQuery(db,
"SELECT author, title, score
FROM table_stories
GROUP BY score
ORDER BY score
LIMIT 100")
# now in descending orders
dbGetQuery(db,
"SELECT author, title, score
FROM table_stories
GROUP BY score
ORDER BY score
LIMIT 100")
# what was the author with the most posts?
dbGetQuery(db,
"SELECT author, COUNT(author) AS author_count, score
FROM table_stories
GROUP BY author
ORDER BY author_count DESC
LIMIT 100")
# You can also specify the column number instead of the name
dbGetQuery(db,
"SELECT author, title, score
FROM table_stories
GROUP BY score
ORDER BY 3
LIMIT 100")
# and we close the connection for now
dbDisconnect(db)
11.4.3 Querying multiple SQL tables
Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN
. We just select columns from different tables and then specify what key should be used to match them using ON
.
The example belows shows an inner join
, but there are other types (left vs right outer joins, cross joins…) that we will not cover in this workshop.
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")
# Check which tables are in there
dbListTables(db)
# We create two new tables in our database that
# we'll join later on
table_stories_subset1 <- dbGetQuery(db,
"SELECT identifier1, time, url, text, score
FROM table_stories
Limit 100000")
dbWriteTable(db, "table_stories_subset1", table_stories_subset1, overwrite = TRUE)
table_stories_subset2 <- dbGetQuery(db,
"SELECT identifier2, author, title
FROM table_stories
Limit 100000")
dbWriteTable(db, "table_stories_subset2", table_stories_subset2, overwrite = TRUE)
# Check which tables are in there
dbListTables(db)
# Get column names
dbListFields(db, "table_stories")
dbListFields(db, "table_stories_subset1")
dbListFields(db, "table_stories_subset2")
# Let's take the stories table and merge it with data from the allposts table
# The identifer variables ideally have different names
dbGetQuery(db, "SELECT identifier1, score, author, title
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
LIMIT 10")
Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the stories that got the highest score (beware we are working with a subset of 100000 here). Here’s how we would do it:
dbGetQuery(db,
"SELECT identifier1, score, author, title
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
WHERE score > 100
ORDER BY score DESC
LIMIT 10")
11.4.4 Grouping and aggregating
SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:
# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*)
FROM table_stories')
Combining everything we’ve learned, we can for example count the total number of posts by author, or by author and score.
dbGetQuery(db,
"SELECT score, author, title, COUNT(*) AS total_stories_by_author
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
GROUP BY author
LIMIT 20")
Here’s another example. Let’s say we want to find out which author wrote the most stories. How would we do this?
dbGetQuery(db,
"SELECT score, author, title, COUNT(*) AS total_stories
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
GROUP BY author
ORDER BY total_stories DESC
LIMIT 20")
There are other functions besides just COUNT
. We can also compute averages, the maximum and minimum of a column, the sum… We can also use functions like SUBSTR
to extract parts of a character string.
# average score (sorted) and number of stories
dbGetQuery(db,
"SELECT author, AVG(score), COUNT(*)
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
GROUP BY author
ORDER BY AVG(score) DESC
LIMIT 20")
# average score (sorted) and number of stories
dbGetQuery(db,
"SELECT author, SUBSTR(title, 1 , 5), score
FROM table_stories_subset1 JOIN table_stories_subset2
ON identifier1 = identifier2
LIMIT 20")