3.10 Lab: Working with a SQL database

3.10.1 Creating an SQL database

For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with comes from Twitter (tweets by politicians). It contains various pieces of information such as the twitter handle screen_name, and id for the person user_id_str and for the tweet id_str, the single tweets stores in text, the time when the tweet was created created_at etc.

I will send you a link (by email) to a folder where your can download two datafiles data_tweets_en.csv and data_tweets_de.csv. For simplicity we’ll work with small datafiles here. Make sure to store them in your working directory. We’ll start by connecting to the database and adding the first table, with information on the tweets.

# install.packages("pacman")
pacman::p_load(DBI, 
               readr, 
               dplyr, 
               lubridate, 
               RSQLite)
# p_load replaces library/require
# Set working directory
# setwd("..")
# In my case:
# setwd("C:/Users/Paul/Google Drive/2-Teaching/2022 Computational Social Science/2022_computational_social_science/data_email")

# Load the data into R (check with str())
  # Convert date columns to chr class
  data_tweets_en <- read_csv("./data/data_tweets_en.csv") %>% #  Adapt path!
  mutate(text2 = text,
         text = substr(text, 1, 50),
         tweet_created_at = as.character(tweet_created_at),
         tweet_created_at_date = as.character(tweet_created_at_date))

# Create & connect to database: will create a file in our hard drive
# + connect to the db
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  
# adding the first table
  dbWriteTable(db, "table_tweets_en", data_tweets_en, overwrite = TRUE)

# testing that it works with a simple query
  dbListFields(db, "table_tweets_en")
  
# a first query
  dbGetQuery(db, 'SELECT * FROM table_tweets_en LIMIT 5')
  
# and we close the connection for now
  dbDisconnect(db)

Note that we have used dbWriteTable to add the data frame to our database, and then check that it all worked with dbListFields and dbGetQuery. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.

We now switch to the second dataset data_tweets_de.csv and also create a table in our database for this dataset simply repeating the steps above.

  data_tweets_de <- read_csv("./data/data_tweets_de.csv") %>% 
  mutate(text2 = text,
         text = substr(text, 1, 50),
         tweet_created_at = as.character(tweet_created_at),
         tweet_created_at_date = as.character(tweet_created_at_date))

  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
  dbWriteTable(db, "table_tweets_de", data_tweets_de, overwrite = TRUE)
  dbListFields(db, "table_tweets_de")
  dbGetQuery(db, 'SELECT * FROM table_tweets_de LIMIT 5')
  dbListTables(db) # list tables
  dbDisconnect(db)

The two files are sufficiently small to load them into our memory. However, imagine together they would exceed your RAM (e.g. > 8 GB). Then we would need to add them to the database in a sequence, one by one (in sizes that we can still open in R).

Below we sequentially add our two datafiles (we have them still in memory) to a single table in the database (if you have many files use a loop!).

# Connect to db
  db <- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")

# adding the first table
  dbWriteTable(db, "table_tweets", 
               data_tweets_en, 
               append=TRUE)

# adding the second table
  dbWriteTable(db, "table_tweets", 
               data_tweets_de, 
               append=TRUE)  
  

# testing that it works
  dbListFields(db, "table_tweets")
  dbGetQuery(db, 'SELECT * FROM table_tweets LIMIT 5')

# show all tables
  dbListTables(db)
  
# what if we make a mistake and want to remove the table?
  # dbRemoveTable(db, "table_tweets_en")

# and we close the connection for now
  dbDisconnect(db)