Chapter 19 Databases
library(tidyverse)
library(readr)
library(RSQLite)
library(DBI) # DataBase Interface Package
library(dbplyr) # dplyr with databases!
There is a YouTube Video Lecture for the chapter.
19.1 Introduction
As our data grows larger and is being updated more frequently, we need to stop using static input files and instead learn to interact with databases. There are a many reasons for using a database, but these are my favorite:
Data Freshness Because the database holds the definitive copy of the data, there isn’t a problem of using a .csv file that is months (or years) old. That means my results are constantly being updated with new data. When working with a team of individuals or applications that all have read/write access to the data, there must be some program making sure data is not accidentally overwritten or out of date.
Non-Local Storage Often we deal with databases that live on a computer that we access via the internet. Because the data lives on the database and not my computer, I don’t have to occupy gigabytes of space on my laptop to hold an out-of-date copy of the data. Furthermore, this allows us to interact with datasets that are too large to even be held on my wimpy laptop.
Application Reliability Whenever I update the database, the action either happens or it doesn’t and the database should never be left in an inconsistent state. As a result, financial transactions can be reliably processed over unreliable connections and the worst that happens is that the transaction doesn’t happen. For applications where the database is also on the same device, this reliability means that if the application (game!) is abruptly ended, the game state isn’t corrupted and we don’t lose the user’s whole history.
Fortunately, reading data with databases isn’t too hard. If all we need to do is
read table and calculate summary statistics, the dbplyr
package allows us to
almost ignore that that data isn’t locally stored. The great people at Rstudio
have created a
great website
for reading data from databases using their dbplyr
package.
For updating a database, we’ll need to learn a little SQL and learn how to pass R variables into the SQL call and import the SQL result back into our R environment.
19.2 Establishing a Connection
Databases should be run on a server that is ALWAYS on and available via an internet connection. To connect to a database, we’ll need to know the internet address and be able to authenticate with a username/password combination.
To demonstrate how a database works, we unfortunately need to have a live database to connect to. In real situations this would already be done (probably by somebody else) and you would just need to install some DataBase Interface (DBI) package that will negotiate creating a connection between your R-session and the database.
# Normally, a database connection looks something like this:
<- DBI::dbConnect(RMariaDB::MariaDB(),
con host = "database.rstudio.com",
user = "hadley",
password = rstudioapi::askForPassword("Database password")
)
# For a Postgres database, it might look like this:
<- DBI::dbConnect(dbDriver("PostgresSQL"), dbname = "postgres",
con host = "database.nau.edu", port = 5432,
user = "dls354",
password = rstudioapi::askForPassword("Database password"))
However for this tutorial, we need to start up a data base before we can start working with.
# For our little toy example, we'll use a SQLite database file of the
# Credit Card Transactions example we explored previously.
<- DBI::dbConnect(RSQLite::SQLite(), dbname = 'data-raw/Transactions.db') con
The DBI
package has several functions that allow us to figure out what
tables exist and what the columns are in a particular table.
DBI Command | Description |
---|---|
dbListTables() |
Returns the tables available in a database. |
dbListFields() |
What columns are in a given table. |
dbColumnInfo() |
Returns meta information about columns returned by a dbGetQuery() command. Only some databases support this function. SQLite does not. |
::dbListTables(con) # What tables are present DBI
## [1] "Cards" "Customers" "Retailers" "Transactions"
There are four tables. The Customers table contains personal information about our customers. To see what columns are present, we can use the following:
::dbListFields(con, 'Customers') # What columns are in Customers table DBI
## [1] "PersonID" "Name" "Street" "City" "State"
19.3 Basic SQL Commands
The traditional way to interact with a database is by using SQL syntax. SQL stands for Structured Query Language and some understanding of SQL is mandatory for anyone that interacts with databases. There are many good introduction to SQL but we’ll cover a few basics here.
19.3.1 SELECT rows
One way to run SQL commands on the remote database is to use the the DBI
package
commands.
There are several functions (DBI::dbXXXXX()
)that will take a previously established
connection and a SQL command that is stored as a character string.
DBI Command | Description |
---|---|
dbGetQuery() |
Run a SELECT command and return the result. |
dbExecute() |
Run a SQL command that does not return a table (e.g. INSERT or UPDATE). The return value of this is the number of rows that were affected. |
For example, we might want to select all the rows and columns from the Transactions
table.
To do this, we would run the following SQL command.
<- 'SELECT * FROM Transactions' # Define a character string with SQL command
sql_cmd <- DBI::dbGetQuery(con, sql_cmd) # Run the command
transactions transactions
## CardID RetailID DateTime Amount
## 1 9876768717278723 1 2019-10-01 08:31:23 5.68
## 2 9876765498122734 2 2019-10-01 12:45:45 25.67
## 3 9876768717278723 1 2019-10-02 08:26:31 5.68
## 4 9876768717278723 1 2019-10-02 08:30:09 9.23
## 5 9876765798212987 3 2019-10-05 18:58:57 68.54
## 6 9876765498122734 2 2019-10-05 12:39:26 31.84
## 7 9876768965231926 2 2019-10-10 19:02:20 42.83
## 8 9876765798212988 1 2019-10-16 14:30:21 4.98
We can examine the SQL command as follows:
SQL Function | Description |
---|---|
SELECT |
A keyword that denotes that the following is a query. |
* |
A placeholder meaning all columns. This could be any column name(s). |
FROM |
A keyword indicating that whatever follows is the table (or tables) being selected from. Any table joins need to be constrained in the WHERE clause to tell us what columns need to match. |
WHERE |
A keyword indicating the following logical statements will be used to filter rows. Boolean operators AND , OR , and NOT can be used to create complex filter statements. |
One convenient way to run SQL commands is to just get creative with the formatting
inside an R-chunk and wrap your SQL command in the dbGetQuery()
command.
<- DBI::dbGetQuery(con, "
out SELECT * FROM Customers
WHERE Street = '231 River Run'
")
out
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
## 2 2 Aubrey Sonderegger 231 River Run Flagstaff AZ
Another option is to use a SQL chunk. Because Rstudio supports code chunks with different languages (including SQL), you could insert a SQL code chunk and have it appropriately run when knitting the document together. Unfortunately to run this interactively, you are required to have the output inline option set. This is enough for me to prefer the previous solution, but SQL chunks do work. (Preferences -> Rmarkdown -> Show output inline…)
A sql
chunk requires the connection
to be defined and optionally a output
variable name (output.var
) to store the return value from the SQL call.
```{sql, connection=con, output.var='sql_output'}
/* This is a SQL code chunk! */
SELECT * from Customers
```
/* This is a SQL code chunk! */
SELECT * from Customers;
# In further R chunks, I can refer to the output.var variable
sql_output
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
## 2 2 Aubrey Sonderegger 231 River Run Flagstaff AZ
## 3 3 Robert Buscaglia 754 Forest Heights Flagstaff AZ
## 4 4 Roy St Laurent 845 Elk View Flagstaff AZ
From here on out, I’ll just note when I’m in an SQL chunk with a comment.
SQL SELECT statement often include table joins. Unfortunately you have to always
specify how to do the table joins and SQL won’t default to using column names that
are common to both tables. In the following code we use Table.Column
notation
to uniquely specify a table/column pair. Because SQL uses the dot in this manner,
you should avoid using a dot in your R column names.
In the following SQL command, the information about how to do the table joins is mixed up with the information about filtering for a particular customer. I find this more difficult to read than our usual dplyr joins.
/* SQL Chunk */
SELECT Customers.Name, Transactions.DateTime, Retailers.Name, Transactions.Amount
FROM Customers, Cards, Transactions, Retailers
WHERE Customers.PersonID = Cards.PersonID AND
= Transactions.CardID AND
Cards.CardID = Retailers.RetailID AND
Transactions.RetailID = 'Derek Sonderegger' Customers.Name
## Name DateTime Name Amount
## 1 Derek Sonderegger 2019-10-01 08:31:23 Kickstand Kafe 5.68
## 2 Derek Sonderegger 2019-10-02 08:26:31 Kickstand Kafe 5.68
## 3 Derek Sonderegger 2019-10-02 08:30:09 Kickstand Kafe 9.23
19.3.2 INSERT a new Row
To insert a row into a table, we need to specify the table to insert into as well as the values to be inserted. It is good practice to specify the column/value pair (as opposed to assuming the column order) using the following syntax.
/* SQL Chunk */
INSERT INTO Customers (PersonID, Name, Street, City, State)
VALUES (5, 'John Smith', '2378 N Main St', 'Flagstaff', 'AZ');
19.3.3 UPDATE a row
Updating a row in the database is similar, but we need to specify which rows to
update. This is done by including a “WHERE” clause, which feels similar to the
dplyr::filter()
function. The only major difference is that SQL uses the single =
sign for both assignment and testing equality.
/* SQL Chunk */
UPDATE Customers
SET Street = '5638 Daisy Ln', City='Mountainaire', State='AZ'
WHERE PersonID = 3 AND Name='Robert Buscaglia'
In the WHERE statement, multiple test conditions must be separated by AND or OR conjunctions.
/* SQL Chunk */
SELECT * from Customers
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
## 2 2 Aubrey Sonderegger 231 River Run Flagstaff AZ
## 3 3 Robert Buscaglia 5638 Daisy Ln Mountainaire AZ
## 4 4 Roy St Laurent 845 Elk View Flagstaff AZ
## 5 5 John Smith 2378 N Main St Flagstaff AZ
19.4 Passing R variables into SQL chunks
Inevitably I would like to be able to create a SQL command that depends on some dynamic R variable. For example, perhaps I’d like to write a script that queries the user for a person’s name and then prints out all the transactions associated with the person. To do this, we need a mechanism to pass an R variable into the SQL command.
# In an R chunk, I've defined some object that I'd like to use in a SQL statement.
<- 'Derek Sonderegger' CustomerName
/* Anything with a ? prepended will be *
* replaced with the R object of the same name. */
SELECT * FROM Customers WHERE Name = ?CustomerName
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
There are some additional situations where a simple character string substitution doesn’t quite work because of how R interprets character strings. In particular, passing multiple strings won’t work. For example, selecting a person by name requires the character string to be quoted, and some SQL databases have custom quotation syntax.
# R chunk
<- c('Derek Sonderegger', 'Aubrey Sonderegger') # This won't work
customers <- '"Derek Sonderegger","Aubrey Sonderegger"' # The single string will work customers
/* SQL Chunk */
SELECT * FROM Customers WHERE Name IN (?customers)
## [1] PersonID Name Street City State
## <0 rows> (or 0-length row.names)
But unbelievably this doesn’t return any rows, when it should return two. The
solution is to use the glue
package which is facilitates squishing together
character strings. The glue::glue_sql()
function uses the input connection
to decide what the appropriate SQL quotation syntax is. The curly brackets tell
glue
that we want to work with R variable customers
not the literal string.
# R Chunk
<- c('Derek Sonderegger', 'Aubrey Sonderegger')
customers = glue::glue_sql("{customers*}", .con=con) customer
/* SQL Chunk */
SELECT * FROM Customers WHERE name IN (?customer)
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
## 2 2 Aubrey Sonderegger 231 River Run Flagstaff AZ
These SQL statements can be whatever you’d like, and we can happily insert rows into tables as well.
# R Chunk
<- c(4, 'Mike Wazowski', '1102 Main St, Apt A113', 'Phoenix', 'AZ')
newCustomer <- glue::glue_sql('{newCustomer*}', .con = con) newCustomer
/* SQL Chunk */
INSERT INTO Customers (PersonID, Name, Street, City, State)
VALUES(?newCustomer)
And this has happily inserted Mike into our Customers
table.
/* SQL Chunk */
SELECT * FROM Customers
## PersonID Name Street City State
## 1 1 Derek Sonderegger 231 River Run Flagstaff AZ
## 2 2 Aubrey Sonderegger 231 River Run Flagstaff AZ
## 3 3 Robert Buscaglia 5638 Daisy Ln Mountainaire AZ
## 4 4 Roy St Laurent 845 Elk View Flagstaff AZ
## 5 5 John Smith 2378 N Main St Flagstaff AZ
## 6 4 Mike Wazowski 1102 Main St, Apt A113 Phoenix AZ
This set of notes does not go into nearly enough detail to explain how the
glue
package works and on first blush, you might prefer to just paste together
a SQL command string and pass it into DBI::dbExecute() command. One reason to
not do that, is that the SQL syntax can be slightly different between different
database flavors. This is why the glue_sql()
command requires you to also include
the database connection. That way it will figure out the correct way to put
the command together. This way your R code is not dependent on the database used.
For reading/writing R questions on forums like StackExchange, it would be obnoxious
if the responses varied depending on which database you used.
We won’t utilize this much for the rest of the chapter, but readers should be alerted to the trouble with inserting multiple values into a SQL string and have a reference of where to look for more detailed information and solutions.
19.5 Reading tables using dbplyr
There are a lot of good things about SQL, but for database queries, I would
really like to pretend that the tables are in memory and use all of my favorite
dplyr
tools and pipelines. This would mean that I don’t have to remember all
the weird SQL syntax. However, the database interface dbplyr
is ONLY intended
for queries and NOT for updating or inserting rows into the tables.
The way this will work is that we will use the previously established database
connection con
to create a virtual link between the database table and some
appropriately named R object.
# connect the database tables to similarly named objects in R
<- tbl(con, 'Cards')
Cards <- tbl(con, 'Customers')
Customers <- tbl(con, 'Retailers')
Retailers <- tbl(con, 'Transactions') Transactions
However, this does NOT download the whole table into R. Instead it grabs only a couple of rows so that we can see the format. Notice that we don’t know how many rows are in the Transactions table.
%>% head(3) Transactions
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.35.2 [/Users/dls354/GitHub/444/data-raw/Transactions.db]
## CardID RetailID DateTime Amount
## <chr> <chr> <chr> <dbl>
## 1 9876768717278723 1 2019-10-01 08:31:23 5.68
## 2 9876765498122734 2 2019-10-01 12:45:45 25.7
## 3 9876768717278723 1 2019-10-02 08:26:31 5.68
# Transactions %>% tail(3) # not supported because we haven't yet downloaded much information.
The guiding principle of dbplyr
is to delay as much work for as long as possible
actually pulling the data from the database. The rational is that we spend a great
deal of time figuring out what the query should look like and too often we write a
query that accidentally downloads millions of lines of data and slows down our network
connection. Instead dbplyr
returns just the first few rows of whatever query we are
working on until we finish the pipeline with a collect()
command that will cause
us to download ALL of the query results and save them as a local data.frame
.
<-
CC_statement %>%
Customers filter(Name == 'Derek Sonderegger') %>% select(PersonID) %>%
left_join(Cards) %>% left_join(Transactions) %>% left_join(Retailers) %>%
select(DateTime, Name, Amount) %>%
rename(Retailer = Name)
CC_statement
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.35.2 [/Users/dls354/GitHub/444/data-raw/Transactions.db]
## DateTime Retailer Amount
## <chr> <chr> <dbl>
## 1 2019-10-01 08:31:23 Kickstand Kafe 5.68
## 2 2019-10-02 08:26:31 Kickstand Kafe 5.68
## 3 2019-10-02 08:30:09 Kickstand Kafe 9.23
At this point, we still haven’t downloaded all of the rows. Instead this is
still a lazy query. To actually download everything, we’ll pipe this into the
collect
function.
%>%
CC_statement collect()
## # A tibble: 3 x 3
## DateTime Retailer Amount
## <chr> <chr> <dbl>
## 1 2019-10-01 08:31:23 Kickstand Kafe 5.68
## 2 2019-10-02 08:26:31 Kickstand Kafe 5.68
## 3 2019-10-02 08:30:09 Kickstand Kafe 9.23
It can be fun to see what the SQL code that is being generated is.
%>% show_query() CC_statement
## <SQL>
## SELECT `DateTime`, `Name` AS `Retailer`, `Amount`
## FROM (SELECT `PersonID`, `CardID`, `Issue_DateTime`, `Exp_DateTime`, `LHS`.`RetailID` AS `RetailID`, `DateTime`, `Amount`, `Name`, `Street`, `City`, `State`
## FROM (SELECT `PersonID`, `LHS`.`CardID` AS `CardID`, `Issue_DateTime`, `Exp_DateTime`, `RetailID`, `DateTime`, `Amount`
## FROM (SELECT `LHS`.`PersonID` AS `PersonID`, `CardID`, `Issue_DateTime`, `Exp_DateTime`
## FROM (SELECT `PersonID`
## FROM `Customers`
## WHERE (`Name` = 'Derek Sonderegger')) AS `LHS`
## LEFT JOIN `Cards` AS `RHS`
## ON (`LHS`.`PersonID` = `RHS`.`PersonID`)
## ) AS `LHS`
## LEFT JOIN `Transactions` AS `RHS`
## ON (`LHS`.`CardID` = `RHS`.`CardID`)
## ) AS `LHS`
## LEFT JOIN `Retailers` AS `RHS`
## ON (`LHS`.`RetailID` = `RHS`.`RetailID`)
## )
The algorithm used to convert my dplyr
statement into a SQL statement doesn’t
mind nesting SQL statements and isn’t the same as what I generated by hand, but
it works.
19.6 Programming Best Practices
It is not uncommon for a package to work with one or more databases. There are a number of possible reasons to not hardcode a particular connection. For example, when developing an application, testing of the code should be done on a testing database and NOT the production database. Similarly, we might be developing an application we’ll sell to different customers and you’d want to easily point your code to a different customer database.
- As a result, even if you only expect to connect to a single database, you should
always write functions that access a database have a
connection
parameter. You could provide a default, though.
Because the database does not always live on the same computer the application is running on, the slowest component will be the transmission time between the application and database and minimizing the amount of data being transferred is hugely import.
- When doing queries, always leave as much work as possible on the database.
For using
dbplyr
that means delaying thecollect()
command as long as possible. Generally that means doing all thejoin
,filter
, andsummarise
commands first, and thencollect()
the results. Becausemutate
commands actually modify the data, whichdbplyr
won’t do, those have to occur after thecollect
.
#' Get all the transactions for a given customer
#'
#' @param {Name}{A string with customers First and Last name}
#' @param {conn}{The database connection to be used}
<- function(name, conn){
GetTransactions <- tbl(conn, 'Cards') # Data still lives on the
Cards <- tbl(conn, 'Customers') # database and we haven't
Customers <- tbl(conn, 'Retailers') # yet transferred any info
Retailers <- tbl(conn, 'Transactions')
Transactions
%>% # Everything is on the database
Customers filter(Name == name) %>% #
left_join(Cards, by='PersonID') %>% #
select(CardID) %>% #
left_join(Transactions, by='CardID') %>% #
left_join(Retailers, by='RetailID') %>% #
select(DateTime, Amount, Name) %>% #
collect() %>% # NOW it is on my application
mutate(DateTime = lubridate::ymd_hms(DateTime)) %>% # add we can modify
return()
}
Now you can happily get all the transactions for a give person.
GetTransactions('Derek Sonderegger', con)
## # A tibble: 3 x 3
## DateTime Amount Name
## <dttm> <dbl> <chr>
## 1 2019-10-01 08:31:23 5.68 Kickstand Kafe
## 2 2019-10-02 08:26:31 5.68 Kickstand Kafe
## 3 2019-10-02 08:30:09 9.23 Kickstand Kafe
The last step of a script should be to close the database connection.
# Close our database connection when we are through...
dbDisconnect(con)
19.7 Exercises
In this exercise, you’ll create a database containing the
nycflights13
data. Make sure that you’ve already downloaded thenycflights13
package, but DO NOT load thenycflights13
package.Create a SQLite database and connect to it using the following code:
library(dplyr) library(dbplyr) # Start up a SQL-Lite database with the NYCFlights13 data pre-loaded <- dbplyr::nycflights13_sqlite( ) con
Through the
con
connection object, create links to theflights
andairlines
tables using thetbl
command. DO NOT import the completeflights
table. We want to delay downloading the data from the database as long as possible!From the
flights
table, summarize the percent of flights with a departure delayed by more than 10 minutes for each airline. Hint: make a new column of TRUE/FALSE or 1/0 values that indicate if the flight was late. Then take the average of that column. Produce a table that gives the airline name (not the abbreviation) and the percent of flights that are late. Notice you can delay thecollect()
command till the very end of the calculation because most databases can create new columns in ad-hoc table views but these are not saved on the original data table.Using the
dbDisconnect()
command to close the connectioncon
.
I have created a package that contains information for a hypothetical ski pass database that could be used by AZ Snowbowl. This example is inspired by consulting work that I did with Bridger Bowl just outside of Bozeman, MT. We have 5 tables,
Customers
,PassTypes
,Passes
,BlackOutDates
andPatrolIssues
. After downloading the package from GitHub, you should read the documentation for each of these tables. Furthermore, there is a functionSkiPasses_sqlite()
that loads the information into a SQLite database.::install_github('dereksonderegger/SkiPasses') devtoolslibrary(SkiPasses) <- SkiPasses_sqlite() # You might want to use refresh=TRUE here... con
Go to the GitHub repository for this package at https://github.com/dereksonderegger/SkiPasses. Where would you expect the code that creates the data tables to live? Where would the documentation for the data tables be? Where is the documentation and code for the function
SkiPasses_sqlite()
? Notice that there is aSkiPasses.db
file in theinst/extdata/
directory. Poke around the package and check out the code for theSkiPasses_sqlite()
function. What does therefresh=TRUE
option do?Run the following code to see where the SQLite database file exists on your computer.
system.file("extdata", "SkiPasses.db", package = "SkiPasses")
Insert a new row into the
Customers
data table for yourself. Also insert a row in thePasses
table for yourself getting a ski pass for the 2020-2021 ski season. Be sure you are inserting rows into the database tables and not a local version you saved in your R environment.Close your database connection and then reopen it. Write a SQL command to verify that you are still in the customer data table.
Create a function that takes a
PassID
and date and returns either a TRUE or FALSE value indicating if the pass authorizes the customer to ski for the day. Your function should take the database connection,PassID
, andDate
as input parameters. The function should ONLY access the database tables through the connection! Try to do as much filtering on the database side as possible to reduce the amount of information being transferred to your function. Demonstrate your function working with both an authorized and unauthorized cases. Hint: read the documentation of thePasses
table to understand when a pass is valid. Your code will need to identify if there is a valid pass (ie Date is between Start and Finish) and assuming there is a valid pass, thatDate
is not one of the blackout dates for that pass.Create a graph visualizing the ages of the male and female customers.
For this exercise, we’ll start a SQLite database and see that the SQLite application stores the data in a very specialized file structure, which usually has a file extension of
.db
or.sqlite
.Create the SQLite database file in your current working directory using the following:
<- DBI::dbConnect(RSQLite::SQLite(), dbname = "TestSQLiteFile.db") con # Create a table using the iris data dbWriteTable(con, 'IRIS', iris) dbListTables(con) # What tables are available... dbReadTable(con, "IRIS") # Spit out the IRIS table dbDisconnect(con) # Close connection
- Now check the files in your current working directory as there should now be a
TestSQLiteFile.db
. The SQLite file structure for data is extremely stable and works across platform types (Unix/Windows, 32/64 bit, big/little endian, etc).
As such, it is a good file type choice for storing lots of data in a compact format across different systems (e.g. applications that work on a mobile device vs a computer). While you can open this file using a text editor, you will only see the table declaration of column names and types. The data rows that follow will not be readable. - Close the file connection and then reconnect to the database. Confirm that
IRIS
table is still present in the database. - When you knit your Rmarkdown document, you might have trouble if you are
initializing the database each time you knit the document. To get around
this, you could either run your initialization code once by hand and then
comment out the initialization steps, or use the
eval=FALSE
code chunk option. Comment on solution you chose to do.