Big data and Social Science

11.4 Lab 9: Working with a SQL database

11.4.1 Creating an SQL database

For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with is news from the website Hacker News: https://news.ycombinator.com/ provided by Google BigQuery’s collection of public datasets. The data comes with various pieces of information such as authorname, the title of the story, how much points the story got etc.

Make sure you download the respective files into your working directory (see my email).

We’ll start by connecting to the database and adding the first table, with information on the stories.

library(DBI)
library(readr)
library(dplyr)
library(lubridate)

# create database: this will create a file in our hard drive
  db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")

# If the files are small we can add them relatively easy.
# There are tables with stories where this is possible. 
# We read them into R. Merge them and then write them to the database.

  stories0 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000000.csv")
  stories1 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000001.csv")
  stories2 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000002.csv")
  stories3 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000003.csv")
  stories4 <- read_csv("./www/hacker-news-data/stories/stories%2Fstories-000000000004.csv")
  stories <- bind_rows(stories0,
                       stories1,
                       stories2,
                       stories3,
                       stories4)
# Does not work...

  stories <- bind_rows(mutate_all(stories0, as.character), 
                       mutate_all(stories1, as.character),
                       mutate_all(stories2, as.character),
                       mutate_all(stories3, as.character),
                       mutate_all(stories4, as.character))
  # Convert date
  stories$time_ts <- as.Date(stories$time_ts, "%Y-%m-%d")
  stories$time_ts <- as.character(stories$time_ts)
  
  # Add index variable
  stories$identifier <- 1:nrow(stories)
  #stories$identifier <- paste(stories$identifier, replicate(nrow(stories), paste(sample(letters, 10), collapse="")), sep="")
  stories$identifier1 <- stories$identifier
  stories$identifier2 <- stories$identifier
  
# adding the first table
  dbWriteTable(db, "table_stories", stories, overwrite = TRUE)

# testing that it works with a simple query
  dbListFields(db, "table_stories")
  dbGetQuery(db, 'SELECT * FROM table_stories LIMIT 5')
  
# and we close the connection for now
  dbDisconnect(db)

Note that we have used dbWriteTable to add the data frame to our database, and then check that it all worked with dbListFields and dbGetQuery. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.

We now switch to the second table. However, the files we want to have in that table (in the folder ‘full’) are too many and too big to open them all in memory (they contain not only the stories but also the comments). Instead, we will open them one by one, and then append them directly to the table in the database. Let’s see how that works…

  db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")

files <- list.files("./www/hacker-news-data/full", full.names=TRUE)

for (i in files){
  
  message(i)

  # read file into memory
  file.i <- read_csv(i)
  # Convert date
  file.i$timestamp <- as.Date(file.i$timestamp, "%Y-%m-%d")
  file.i$timestamp <- as.character(file.i$timestamp)
  
  # adding to table in SQL database
  dbWriteTable(db, "table_allposts", file.i, append=TRUE)
}

# ...ok our database has 8.7 Gigabytes now.. depending on your RAM too big to work with it in R

# testing that it works
  dbListFields(db, "table_allposts")
  dbGetQuery(db, 'SELECT * FROM table_allposts LIMIT 5')

# show the tables
  dbListTables(db)
  
# what if we make a mistake and want to remove the table?
  # dbRemoveTable(db, "table_stories")

# and we close the connection for now
  dbDisconnect(db)

11.4.2 Querying an SQL database

Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect and then we type our query with dbGetQuery. Generally, we will write the query clauses in capital letters and the column names in lowercase.

db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")
test <- dbGetQuery(db, 'SELECT * 
                   FROM table_stories 
                   LIMIT 5')
str(test)

RMarkdown actually allows us to write the SQL query directly as a chunk of code (code is hidden here, see ‘www’ folder):

This is the list of query clauses that we will use to work with the database table:

SELECT: determines which columns to include in the query’s result set
FROM: identifies the tables from which to draw data and how tables should be joined
LIMIT: in combination with SELECT, limits the number of records returned
WHERE: filter out unwanted data
GROUP BY: groups rows together by common column values
ORDER BY: sorts the rows in the final result set by one or more columns

Let’s start with some examples of SELECT:

# Get column names
dbListFields(db, "table_allposts")
dbListFields(db, "table_stories")

# querying just one column
dbGetQuery(db, "SELECT score 
           FROM table_stories 
           LIMIT 10")

# multiple columns
dbGetQuery(db, "SELECT title, score 
           FROM table_stories 
           LIMIT 10")

# adding expressions
dbGetQuery(db, "SELECT score, UPPER(title) 
           FROM table_stories 
           LIMIT 10")

# adding aliases to the new columns
dbGetQuery(db, "SELECT time_ts, score, text AS story 
           FROM table_stories 
           LIMIT 10")

# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT score, text AS story 
           FROM table_stories 
           LIMIT 10")

# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT score 
           FROM table_stories 
           LIMIT 10")

Now some examples of how we can use WHERE to subset the database based on values of specific columns:

# selecting based on values of a column
dbGetQuery(db, "SELECT title, score, time, author
           FROM table_stories
           WHERE author = 'kogir'
           LIMIT 10")

# greater than, working with dates
dbGetQuery(db, "SELECT title, score, time_ts 
           FROM table_stories
           WHERE time_ts > '2012-01-01'
           LIMIT 10")

# AND operator
dbGetQuery(db, "SELECT title, score, time_ts, author
           FROM table_stories
           WHERE time_ts > '2012-01-01' AND author != 'kogir' AND score > 10
           LIMIT 10")

# OR operator
dbGetQuery(db, "SELECT title, time_ts, score, author
           FROM table_stories
           WHERE score>100 AND (author = 'Willdel' OR author = 'catwork')
           LIMIT 100")

# membership, IN
dbGetQuery(db, "SELECT  title, time_ts, score, author
           FROM table_stories
           WHERE author IN ('catwork', 'mclightning')
           LIMIT 10")

# matching conditions with LIKE:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT  title, time_ts, score, author
           FROM table_stories
           WHERE time_ts LIKE '2013-01-__'
           LIMIT 10")

dbGetQuery(db, "SELECT title, time_ts, score, author
           FROM table_stories
           WHERE time_ts LIKE '2013-03%'
           LIMIT 10")

# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT  title, time_ts, score, author
           FROM table_stories
           WHERE title LIKE '%global%'
           LIMIT 5")

# Get one variable and calculate the mean
  data <- dbGetQuery(db, "SELECT id, score 
                     FROM table_stories")
  data$score <- as.numeric(data$score)
  mean(data$score, na.rm=TRUE)
  
# Draw a random sample and calculate the mean
  data.sample <- dbGetQuery(db, "SELECT  title, time_ts, score, author
           FROM table_stories
           ORDER BY RANDOM()
           LIMIT 10000")
  data.sample$score <- as.numeric(data.sample$score)
  mean(data.sample$score, na.rm=TRUE) # not too bad

Q: Why is it useful to draw a random sample as opposed to analyzing the full population? (Tip: Statistical inference)

Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.

dbGetQuery(db, 
  "SELECT author, COUNT(author) AS author_count
  FROM table_stories
  GROUP BY score
  LIMIT 100")

And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:

# sort by score
dbGetQuery(db, 
           "SELECT author, title, score
           FROM table_stories
           GROUP BY score
           ORDER BY score
           LIMIT 100")

# now in descending orders
dbGetQuery(db, 
           "SELECT author, title, score
           FROM table_stories
           GROUP BY score
           ORDER BY score
           LIMIT 100")

# what was the author with the most posts?
dbGetQuery(db, 
  "SELECT author, COUNT(author) AS author_count, score
  FROM table_stories
  GROUP BY author
  ORDER BY author_count DESC
  LIMIT 100")

# You can also specify the column number instead of the name
dbGetQuery(db, 
           "SELECT author, title, score
           FROM table_stories
           GROUP BY score
           ORDER BY 3
           LIMIT 100")

# and we close the connection for now
  dbDisconnect(db)

11.4.3 Querying multiple SQL tables

Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN. We just select columns from different tables and then specify what key should be used to match them using ON.

The example belows shows an inner join, but there are other types (left vs right outer joins, cross joins…) that we will not cover in this workshop.

library(DBI)
db <- dbConnect(RSQLite::SQLite(), "./www/hacker-news-db.sqlite")

# Check which tables are in there
  dbListTables(db)

# We create two new tables in our database that
  # we'll join later on
  table_stories_subset1 <- dbGetQuery(db, 
  "SELECT identifier1, time, url, text, score
  FROM table_stories
  Limit 100000")
  dbWriteTable(db, "table_stories_subset1", table_stories_subset1, overwrite = TRUE)  
  
  table_stories_subset2 <- dbGetQuery(db, 
  "SELECT identifier2, author, title
  FROM table_stories
  Limit 100000")
  dbWriteTable(db, "table_stories_subset2", table_stories_subset2, overwrite = TRUE)  
  
# Check which tables are in there
  dbListTables(db)
  
# Get column names
  dbListFields(db, "table_stories")
  dbListFields(db, "table_stories_subset1")
  dbListFields(db, "table_stories_subset2")
 
# Let's take the stories table and merge it with data from the allposts table
# The identifer variables ideally have different names
  
  dbGetQuery(db, "SELECT identifier1, score, author, title
                  FROM table_stories_subset1 JOIN table_stories_subset2
                  ON identifier1 = identifier2
                  LIMIT 10")

Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the stories that got the highest score (beware we are working with a subset of 100000 here). Here’s how we would do it:

dbGetQuery(db, 
  "SELECT identifier1, score, author, title
  FROM table_stories_subset1 JOIN table_stories_subset2
  ON identifier1 = identifier2
  WHERE score > 100
  ORDER BY score DESC
  LIMIT 10")

11.4.4 Grouping and aggregating

SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:

# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*) 
           FROM table_stories')

Combining everything we’ve learned, we can for example count the total number of posts by author, or by author and score.

dbGetQuery(db, 
  "SELECT score, author, title, COUNT(*) AS total_stories_by_author
  FROM table_stories_subset1 JOIN table_stories_subset2
    ON identifier1 = identifier2
  GROUP BY author
  LIMIT 20")

Here’s another example. Let’s say we want to find out which author wrote the most stories. How would we do this?

dbGetQuery(db, 
  "SELECT score, author, title, COUNT(*) AS total_stories
  FROM table_stories_subset1 JOIN table_stories_subset2
    ON identifier1 = identifier2
  GROUP BY author
  ORDER BY total_stories DESC
  LIMIT 20")

There are other functions besides just COUNT. We can also compute averages, the maximum and minimum of a column, the sum… We can also use functions like SUBSTR to extract parts of a character string.

# average score (sorted) and number of stories
dbGetQuery(db, 
  "SELECT author, AVG(score), COUNT(*) 
  FROM table_stories_subset1 JOIN table_stories_subset2
    ON identifier1 = identifier2
  GROUP BY author
  ORDER BY AVG(score) DESC
  LIMIT 20")

# average score (sorted) and number of stories
dbGetQuery(db, 
  "SELECT author, SUBSTR(title, 1 , 5), score
  FROM table_stories_subset1 JOIN table_stories_subset2
    ON identifier1 = identifier2
    LIMIT 20")