3.10 Lab: Working with a SQL database
3.10.1 Creating an SQL database
For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with comes from Twitter (tweets by politicians). It contains various pieces of information such as the twitter handle screen_name
, and id for the person user_id_str
and for the tweet id_str
, the single tweets stores in text
, the time when the tweet was created created_at
etc.
I will send you a link (by email) to a folder where your can download two datafiles data_tweets_en.csv
and data_tweets_de.csv
. For simplicity we’ll work with small datafiles here. Make sure to store them in your working directory. We’ll start by connecting to the database and adding the first table, with information on the tweets.
# install.packages("pacman")
::p_load(DBI,
pacman
readr,
dplyr,
lubridate,
RSQLite)# p_load replaces library/require
# Set working directory
# setwd("..")
# In my case:
# setwd("C:/Users/Paul/Google Drive/2-Teaching/2022 Computational Social Science/2022_computational_social_science/data_email")
# Load the data into R (check with str())
# Convert date columns to chr class
<- read_csv("./data/data_tweets_en.csv") %>% # Adapt path!
data_tweets_en mutate(text2 = text,
text = substr(text, 1, 50),
tweet_created_at = as.character(tweet_created_at),
tweet_created_at_date = as.character(tweet_created_at_date))
# Create & connect to database: will create a file in our hard drive
# + connect to the db
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db
# adding the first table
dbWriteTable(db, "table_tweets_en", data_tweets_en, overwrite = TRUE)
# testing that it works with a simple query
dbListFields(db, "table_tweets_en")
# a first query
dbGetQuery(db, 'SELECT * FROM table_tweets_en LIMIT 5')
# and we close the connection for now
dbDisconnect(db)
Note that we have used dbWriteTable
to add the data frame to our database, and then check that it all worked with dbListFields
and dbGetQuery
. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.
We now switch to the second dataset data_tweets_de.csv
and also create a table in our database for this dataset simply repeating the steps above.
<- read_csv("./data/data_tweets_de.csv") %>%
data_tweets_de mutate(text2 = text,
text = substr(text, 1, 50),
tweet_created_at = as.character(tweet_created_at),
tweet_created_at_date = as.character(tweet_created_at_date))
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db dbWriteTable(db, "table_tweets_de", data_tweets_de, overwrite = TRUE)
dbListFields(db, "table_tweets_de")
dbGetQuery(db, 'SELECT * FROM table_tweets_de LIMIT 5')
dbListTables(db) # list tables
dbDisconnect(db)
The two files are sufficiently small to load them into our memory. However, imagine together they would exceed your RAM (e.g. > 8 GB). Then we would need to add them to the database in a sequence, one by one (in sizes that we can still open in R).
Below we sequentially add our two datafiles (we have them still in memory) to a single table in the database (if you have many files use a loop!).
# Connect to db
<- dbConnect(RSQLite::SQLite(), "data_tweets.sqlite")
db
# adding the first table
dbWriteTable(db, "table_tweets",
data_tweets_en, append=TRUE)
# adding the second table
dbWriteTable(db, "table_tweets",
data_tweets_de, append=TRUE)
# testing that it works
dbListFields(db, "table_tweets")
dbGetQuery(db, 'SELECT * FROM table_tweets LIMIT 5')
# show all tables
dbListTables(db)
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "table_tweets_en")
# and we close the connection for now
dbDisconnect(db)