Practical 5 Databases

5.1 Introduction

5.1.1 1. Overview

This practical is based on Chapter 4 in Comber and Brunsdon (2021). You will need space on your M drive to undertake this practical - one of data files we create is 1GB in size.

Data can be so big that they are difficult to work with on a standard computer: they may not fit on hard drive or disk if they are terabytes in size, even if they are smaller, the computer may struggle to hold them in working memory and in the memory allocated to the program you are using (including R).

For these reasons, we increasingly work with data that is held elsewhere in some kind of database system that we access remotely. Queries are sent to the database and the results are returned.

The advantage of querying databases is that it is only the data we are interested in that is returned to our computer’s working memory. This is usually a (much) smaller subset of the data, which has been manipulated and summarised in some way.

Queries to databases can be used group, summarise, link and manipulate data. Hopefully this statement has made you think that this is similar to piping in dplyr. You are correct: much of the syntax we use in piping can be used to connect to databases and extract data from them.

The key thing is that the data, in its raw unqueried form, is not in the working memory of our computers until we want it to be.

For this practical you will need to install and load the following packages.

packages <- c("sf", "tidyverse", "RSQLite")
# check which packages are not installed
not_installed <- packages[!packages %in% installed.packages()[, "Package"]]
# install missing packages
if (length(not_installed) > 0) {
  install.packages(not_installed, repos = "https://cran.rstudio.com/", dep = T)
}
library(sf)         # for spatial data
library(tidyverse)  # for data wrangling, visualisation and mapping
library(RSQLite)    # for databases
# redefine dplyr select
select <- dplyr::select
# remove unwanted variables
rm(list = c("packages", "not_installed"))

5.1.2 Introduction to Databases

Up until now all of the code snippets in this module have worked with in memory data: the data were loaded into R in some way, either read from an external file or from one of R’s internal datasets. These data have been small enough in size for R to manage them to its internal memory.

However as data get larger, their access times get longer. Additionally, any kind of selection or filtering etc of data rows or columns, requires linear searches of the data, and although these can be quicker if the data are structured in some way (for example by sorting some or all of the columns), there is a trade off between storage and search efficiency. The problem is that sorting columns is expensive as sorted columns (or indexed fields) require additional storage. Thus because of the slowness of reading, writing and searching, much data are held in indexed databases, that are remotely stored (i.e. they are not on your computer or not in your computer’s memory), and are accessed remotely.

The basic idea behind databases (as opposed a data.frame or tibble and other in session data table formats) is that you connect to a database (a local one, one in working memory or a remote one), compile queries that are passed to the database and only the query results are returned.

Databases are a collection of data tables that are accessed through some kind of Database Management Systems or DBMS providing a structure that supports database queries. Databases frequently hold multiple data tables which have some field (attribute) in common supporting relational queries. In these records (observations) in different data tables are related to each other using the field they have in common (e.g. postcode, national insurance number, etc) - see Practical 2 and the section on Two table manipulations in dplyr. This allows data in different tables to be combined, before data are extracted and / or summarised in some way. All of this is done through queries. Queries in this context are specific combinations of instructions to the DBMS in order to retrieve data from the server.

Because we often link or relate data tables in databases they are often referred to as Relational DBMS. An example is shown in Figure \(\ref{fig:f1}\) of the data tables used a in much longer and deeper version of this practical in Chapter 4 of Comber and Brunsdon (2021). Notice how the different tables are linked (or related) by the attributes they have in common.

An example of a relational database, from Chapter 4 in Comber and Brunsdon (2021).

Figure 5.1: An example of a relational database, from Chapter 4 in Comber and Brunsdon (2021).

5.1.3 Databases in R

Relational databases and DBMSs, classically use SQL (Structured Query Language) to retrieve data through queries. SQL has a relatively simple syntax (supported by ISO standards) but complex queries can be difficult to code correctly. For these reasons the team behind dplyr constructed tools and functions that translate to SQL when applied to databases. Workflows combine the various dplyr verbs used for single table and two-table manipulations (Practical 2) and translates them into SQL (Structured Query Language) which is then passed to the DBMS.

You have already done something very similar in the 2nd practical when you wrote some code to join oa_sf and oa2. Remember: do not run this code as we have not loaded any data - it is simply repeated from Practical 2.

inner_join(x = st_drop_geometry(oa_sf), 
           y = oa2, 
           by = c("code" = "OACode")) |> 

In the above code, the join was done by linking variables in the two data tables (oa_sf and oa2) that were joined by the Output Area code attribute that they had in common, although named differently in each data table.

This practical uses SQLite because it is embedded inside an R package (RSQlite) which is automatically loaded with dplyr. This provides a convenient tool for understanding how to manage large datasets because it is completely embedded inside an R package and you do not need to connect to a separate database server. A good introduction to this topic can be found in Horton, Baumer, and Wickham (2015) and the DBI package provides an interface to many different database packages. This allows you to use the same R code, for example with dplyr verbs, to access and connect to a number of back-end database formats including MySQL with the RMySQL package and Postgres with `RPostgreSQL.

In summary:

  • Data are frequently held in databases as big data can need a lot of storage and we may not be able to load it locally in our computers;
  • DBMSs / RDBMSs are used to hold, link and mange collections of databases;
  • We can compile queries using dplyr syntax;
  • Relationships and cardinality (the direction of the relationships) are really important considerations.

Some resources that you may find useful outside of the practical:

5.2 Creating relation databases in R

The basic idea for creating databases is that you define or open a connection from your R / RStudio session to a named database. The database is then populated with data and then you can work with it. The connection should be closed when you finish working with the database.

There a number of different types of databases that can be created:

  • local, in memory databases: these are useful for prototyping and testing. They are temporary and are automatically deleted you disconnect from them.
  • local, on file databases: these are permanent databases but are still held locally (i.e. on your computer).
  • remote, on file databases: these are permanent and held elsewhere (i.e. on another computer!), and require an internet connection to access them. The creation of these is not covered here, but once you are connected to the remote server, the same operations can be undertaken.

5.2.1 Data

You should have downloaded the data file from the VLE for this week. Load the prac5_data.RData file:

load("prac5_data.RData")
ls()
## [1] "ccg"           "ccg_patients"  "prescriptions" "select"

You should see that 3 files are loaded (as well as the redefined select function):

  • a data table called prescriptions. This is month of prescribing data for January 2016 downloaded from here https://data.gov.uk/dataset/prescribing-by-gp-practice-presentation-level and converted to an .RData file;
  • a spatial object in sf format called ccg. This has the Care Commissioning Group (CCG) areas in England - area public health administration areas. These used to be called Primary Care Trusts (PCTs);
  • a data table called ccg_patients with the the number of patients in each CCG/PCT area.

We will use these to populate databases of different kinds

5.2.2 Creating a local, in memory database

Local, in memory databases are useful for illustrating the principles of database creation and of constructing queries. In reality, if the data fit into working memory and can be manipulated and accessed with ease, then it is likely that you do not need to use databases and queries: standard R data formats such the data.frame and tibble and related operations with dplyr will probably suffice.

To work with a database in dplyr, the connection to it needs to be specified. This is done with the DBI::dbConnect() function. This can be interpreted as saying use the Connect() function from the DBI package - i.e. the :: indicates to R in which package the function is (recall that the DBI package is loaded with dbdplyr, which in turn is loaded by dplyr, which is in turn loaded by tidyverse).

The code snippet below defines an in memory database:

db = DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

The arguments to DBI::dbConnect() varies from database to database, but the first argument is always the database backend. It is RSQLite::SQLite() for RSQLite, RPostgreSQL::PostgreSQL() for RPostgreSQL, etc. The SQLite implementation in R (RSQLite) only needs one other argument: the path to the database. Here we use the special string ":memory:" which causes SQLite to make a temporary in memory database (i.e. in the working memory of R / RStudio).

The database, db has no data in it. To populate the database, the dplyr function copy_to can be used. This uploads data to the database. The code below does this for the prescriptions data table.

copy_to(db, prescriptions, name = "prescripts_db")

In fact copy_to is a wrapper for dbWriteTable in the DBI package and the same operation can be undertaken with the code below (you do not need to run this!):

# do not run!
dbWriteTable(conn = db, name = "prescripts_db", 
             value = prescriptions,
             row.names = FALSE, header = TRUE, overwrite = T)

The tbl function can then be used to make the connection to the data table which can be examined:

tbl(db, "prescripts_db")
## # Source:   table<`prescripts_db`> [?? x 10]
## # Database: sqlite 3.47.1 []
##    SHA   PCT   PRACTICE BNF.CODE   BNF.NAME ITEMS   NIC ACT.COST QUANTITY PERIOD
##    <chr> <chr> <chr>    <chr>      <chr>    <int> <dbl>    <dbl>    <int>  <int>
##  1 Q44   RTV   Y04937   0401010Y0… "Zolpid…     1  2.56     2.38       56 201601
##  2 Q44   RTV   Y04937   0401010Z0… "Zopicl…     1  0.67     0.73       14 201601
##  3 Q44   RTV   Y04937   0401010Z0… "Zopicl…     1  1.41     1.31       28 201601
##  4 Q44   RTV   Y04937   0401020K0… "Diazep…     2  1.72     1.61       56 201601
##  5 Q44   RTV   Y04937   0401020K0… "Diazep…     1  0.68     0.74       21 201601
##  6 Q44   RTV   Y04937   0401020K0… "Diazep…     1  1.52     1.52       42 201601
##  7 Q44   RTV   Y04937   0402010AB… "Quetia…     1  1.74     1.72       84 201601
##  8 Q44   RTV   Y04937   0402010AB… "Quetia…     1 39.6     36.7        21 201601
##  9 Q44   RTV   Y04937   0402010AD… "Aripip…     1 20.2     18.7        28 201601
## 10 Q44   RTV   Y04937   0402010D0… "Chlorp…     1  1.08     1.11       14 201601
## # ℹ more rows

In a similar way the CCG patients data can be added to the database:

copy_to(db, ccg_patients, name = "ccg_pat_db")

And this too can be examined:

tbl(db, "ccg_pat_db")
## # Source:   table<`ccg_pat_db`> [?? x 3]
## # Database: sqlite 3.47.1 []
##    CCGcode CCGnm_s                              ccg.reg.pa
##    <chr>   <chr>                                <chr>     
##  1 00C     Darlington                           107191    
##  2 00D     Durham Dales, Easington & Sedgefield 289529    
##  3 00J     North Durham                         254568    
##  4 00K     Hartlepool & Stockton-On-Tees        294897    
##  5 00L     Northumberland                       322154    
##  6 00M     South Tees                           293801    
##  7 00N     South Tyneside                       155426    
##  8 00P     Sunderland                           283594    
##  9 00Q     Blackburn with Darwen                171623    
## 10 00R     Blackpool                            171758    
## # ℹ more rows

The tbl function is very useful here as it allows us to extract data from the database and to pipe it around (i.e. do things with it). The way that tbl works is to create a reference to the data table in the database (e.g. prescripts_db).

The dplyr piping syntax can be applied in conjunction with tbl to select, filter manipulate summarise and extract data from the database. The example below select records from the prescripts_db data table in db that are in CCG / PCT 00D (the PCT field in the data) and cost more than £100:

tbl(db, "prescripts_db") |>
  filter(PCT == "00D" & ACT.COST > 100) |> 
  select(PCT, BNF.NAME, ACT.COST) 
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.47.1 []
##    PCT   BNF.NAME                                   ACT.COST
##    <chr> <chr>                                         <dbl>
##  1 00D   "Sod Algin/Pot Bicarb_Susp S/F           "     288.
##  2 00D   "Gaviscon Infant_Sach 2g (Dual Pack) S/F "     235.
##  3 00D   "Mebeverine HCl_Tab 135mg                "     253.
##  4 00D   "Lansoprazole_Cap 30mg (E/C Gran)        "     667.
##  5 00D   "Lansoprazole_Cap 15mg (E/C Gran)        "     141.
##  6 00D   "Omeprazole_Cap E/C 20mg                 "     842.
##  7 00D   "Omeprazole_Cap E/C 40mg                 "     114.
##  8 00D   "Mesalazine_Tab E/C 400mg                "     287.
##  9 00D   "Mesalazine_Tab G/R 1.2g M/R             "     173.
## 10 00D   "Asacol_MR Tab E/C 400mg                 "     289.
## # ℹ more rows

A similar syntax can be used work out average prescription costs for different Strategic Health Authorities in sha:

tbl(db, "prescripts_db") |>
  group_by(PCT) |> 
  summarise(mean_cost = mean(ACT.COST, na.rm = T))
## # Source:   SQL [?? x 2]
## # Database: sqlite 3.47.1 []
##    PCT   mean_cost
##    <chr>     <dbl>
##  1 00C        74.7
##  2 00D        78.4
##  3 00J        77.7
##  4 00K        76.4
##  5 00L        77.4
##  6 00M        68.5
##  7 00N        70.8
##  8 00P        66.3
##  9 00Q        62.8
## 10 00R        77.6
## # ℹ more rows

This starts to suggest how queries can be constructed.

When we have finished with any database connection we have to close it using the dbDisconnect function:

dbDisconnect(db)

5.2.3 Creating a local, on file database

Any in memory database ceases to exists when the connection is closed (as above).

By contrast, an on file database is permanent and, after being populated with data, can be connected to in later R sessions. Here the data tables are used create a RSQlite database called prescribing.sqlite. This will contain the same 2 data tables for CCGs and prescriptions. The process is similar to creating an in memory database, but this time a database file is created in your local directory.

In the code below, the first line checks for the existence of prescribing_lite.sqlite and if it exists removes it. The second then creates a connection to the database:

if (file.exists("prescribing.sqlite") == TRUE) 
    file.remove("prescribing.sqlite")
db = DBI::dbConnect(RSQLite::SQLite(), dbname="prescribing.sqlite")

Now if you look in your working folder using Windows Explorer (PC) or Finder (Mac), you will see that an object called prescribing.sqlite has been created. It is empty and has no size because it has not been populated with data yet.

The code below populates the db object with the data tables, using the dbWriteTable function and then closes the connection to the database:

dbWriteTable(conn = db, name = "prescriptions", value = prescriptions, 
             row.names = FALSE, header = TRUE)
dbWriteTable(conn = db, name = "ccg_patients", value = ccg_patients, 
             row.names = FALSE, header = TRUE)

If you check again in your working folder you will see that prescribing_lite.sqlite now has been populated and has a size (~1GB).

You can check what the database contains using different commands form the DBI package:

# tables in the database
dbListTables(db)
## [1] "ccg_patients"  "prescriptions"
# fields in the table
dbListFields(db, "prescriptions")
##  [1] "SHA"      "PCT"      "PRACTICE" "BNF.CODE" "BNF.NAME" "ITEMS"   
##  [7] "NIC"      "ACT.COST" "QUANTITY" "PERIOD"

Again, the dbDisconnect function is used to close the connection to the database:

dbDisconnect(db)

If you enter db at the console you will see that it is disconnected.

To access the data, we need simply to connect to the database, having created it:

db <- DBI::dbConnect(RSQLite::SQLite(), dbname="prescribing.sqlite")

It is possible to query the data tables in the database. For example, the code below summarises the mean cost of prescriptions in each of the CCGs/PCTs, arranges them in descending mean cost order, filters them for at least 100 items and prints the top 10 to the console:

tbl(db, "prescriptions") |> 
  group_by(PCT) |>
  summarise(
    mean_cost = mean(ACT.COST, na.rm = T),
    n = n()
  ) |> 
  ungroup() |>
  arrange(desc(mean_cost)) |>
  filter(n > 100) |> print(n=10)

However, this is working with only one of the data tables in the db database. It is not taking advantage of the layered analysis that is possible by linking data tables, as shown in Figure \(\ref{fig:f1}\).

To link the data tables in the db database we need to specify the links between them, just as we did using the join functions in Practical 2. However, to join in the right way, we need to think about how we wish to construct our queries and what any queries to database tables will return (and therefore the analysis and inference they will support). The DBI package has a number of functions that list data tables and their fields as above. Similarly, tbl in the dplyr package can be used to do these summaries:

tbl(db, "ccg_patients")
colnames(tbl(db, "prescriptions"))

The results are similar to the str, head and summary functions used earlier. However, it is important to restate what the tbl function is doing: it is sending a tbl query to db about the table (e.g. the prescriptions table) in db. As tbl returns the first 10 records this is what is returned from db: i.e. the data returned to the R session is only the result of this query and not the whole of the data table. To emphasise this, the code below compares the size of the query and the size of the prescriptions data table that is loaded into the R session.

format(object.size(tbl(db, "prescriptions")), unit="auto")
## [1] "6 Kb"
format(object.size(prescriptions), unit="auto")
## [1] "463.4 Mb"

Again the database can be closed as follows:

dbDisconnect(db)

5.2.4 Summary

The basic process of creating a database is to first define a connection to a database, then populate it with data, and, when it is populated, to close the connection.

Two kinds of database were illustrated: an in memory one - good for development - and a local, on file database. The procedures for constructing these were essentially the same:

  1. define a connection to a database
  2. populate the database with data
  3. close the database

This sequence can be used to construct much larger databases.

5.3 Database queries

In this section we will explore a number of core database operations using dplyr to construct queries and apply them to the prescribing.sqlite database created above. There are 3 mains groups of operations that are commonly used in database queries, either singly or in combination. Queries specify operations that:

  1. Extract (specifying criteria, logical pattern matching, etc.)
  2. Join (linking different tables)
  3. Summarise (grouping, using summary functions, maintaining fields, creating new fields)

There are many overlaps with dplyr operations, but here these are applied in a database context. In many cases the syntax is exactly the same as those applied to data.frame and tibble formats as in previous practicals. On occasion, however, they need to be adapted for working with databases.

5.3.1 Extracting data from a database

It is possible to extract whole records (rows) and fields (columns), individual rows and columns that match some criteria and individual elements (cells in a data table). The two most commonly used approaches for extracting data are:

  1. By specifying some kind of logical test or conditions that have to be satisfied for the data to be extracted.
  2. By specifying location of the data you wish to extract, for example by using the \(i^th\) row or \(j^th\) column, or variable names.

Logical queries have a TRUE or FALSE answers and use logical operators (e.g. greater than, less than, equals, not equals etc). These have been covered in previous practicals full set of logical operators can be found in the R help (enter ?base::Logic). The other main way of selecting is through some kind of text pattern matching. Both may be used to subset database fields (columns) by filtering and / or database records (rows) by selecting.

A connection has to be made for a database to be queried. Connect to the prescribing.sqlite database you created earlier as before, but this time using a different syntax:

library(RSQLite)
db <- dbConnect(SQLite(), dbname="prescribing.sqlite")

The code below uses filter to extract the prescriptions for a specific flu vaccine via the BNF code (for example see https://openprescribing.net/bnf/) and orders the extracted records by volume (items):

tbl(db, "prescriptions") |> 
  filter(BNF.CODE == "1404000H0AAAFAF") |>
  select(BNF.CODE, BNF.NAME, ITEMS) |>
  distinct()

Here the filter operation was based on a single logical statement using == that returned single exact matches. For multiple exact matches, the %in% function can be used with a vector of names:

tbl(db, "prescriptions") |> 
  filter(BNF.CODE %in% c("1404000H0AAAFAF", "1404000H0AAAJAJ")) |>
  select(BNF.CODE, BNF.NAME) |>
  distinct() 

Now we might want to filter for all the flu vaccine prescriptions, which start with the BNF code 1404000H0. For standard dplyr operations, the str_detect function can be used to filter. However, the stringr pattern matching functions within dplyr, all in the form str_<action>, do not work with databases. Instead %like% can be applied, with carefully positioned wildcards (\%):

# one pattern
tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404000H0%')
# multiple patterns
tbl(db, "prescriptions") |> 
  filter(BNF.NAME %like% 'Dermol%' | BNF.NAME %like% 'Influenza_Vac%') |>
  select(BNF.CODE, BNF.NAME) |>
  distinct() 

However, pattern matching in this way needs to be done with extreme care because it undertakes partial matching, whereas filtering using logical approaches with == or %in% will only return exact matches.

Additionally, the wildcards (\%) need to be carefully used. For example, they can be placed at the start and end of the pattern passed to %like%. This means that the filter operation will return any records that have BNF.CODE values containing 1404000H0, not just those that start with 1404000H0.

Notice how the select function was used in the above to return specific fields from a database in a similar way to working data tables loaded into the R session:

tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404000H0%') |>
  select(PCT, ACT.COST)

Fields can also be selected using pattern matching, but here a set of tidy matching functions can be applied:

tbl(db, "prescriptions") |> 
  select(starts_with("BNF"))
tbl(db, "prescriptions") |> 
  select(contains("BNF")) 

You should examine the help for these:

?tidyselect::select_helpers

Thus far, all of the results of the code snippets applied to the database have been printed out to the console. No data has been returned to the R session, meaning that all of the analysis has taken place away from the working memory of your computer. This is one of the key advantages of using databases: the queries are all undertaken by outside of your computer’s working memory (as well as the database being separate).

However, if you want to use the results (for example to link to a map, or other data) then the query results need to be collected. The collect() function returns the results. It creates an object (even if it is just printed out). To show this, run the code below which uses the object.size function to evaluate the memory cost to R of running the code snippets. This is a key advantage of working with databases.

# size of the call
tbl(db, "prescriptions") |> object.size()
# size of a longer call
tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404%') |>
  arrange(desc(ACT.COST)) |> object.size()
# size of what is returned with collect
tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404%') |>
  arrange(desc(ACT.COST)) |> collect() |> object.size()

5.3.2 Joining (linking) database tables

Practical 2 introduced the idea of joins between different tables. Some (but not all) of the two-table dplyr joins can also be applied to tables held in databases.

These can be applied to the data tables in the database in a similar way to working with data tables loaded into an ordinary R session. The code below filters for anti-depressants, summarises the costs over CCGs / PCTs and then joins the result to the patients attribute in the ccg_patients data table, before summarising the costs per 1000 patients in each CCG / PCT area:

# examine the ccg_patients in db
dbListFields(db, "ccg_patients")
## [1] "CCGcode"    "CCGnm_s"    "ccg.reg.pa"
# the query
tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404000H0%') |>
  group_by(PCT) |> 
  summarise(PCT_COST = sum(ACT.COST, na.rm = T)) |>
  left_join(tbl(db, "ccg_patients"), by = c("PCT" = "CCGcode")) |>
  mutate(costs_pp = PCT_COST/(ccg.reg.pa/1000))
## # Source:   SQL [?? x 5]
## # Database: sqlite 3.47.1 [/Users/geoaco/Library/CloudStorage/OneDrive-UniversityofLeeds/leeds_work/teaching_24/geog5917/bookdowngeog5917/prescribing.sqlite]
##    PCT   PCT_COST CCGnm_s                              ccg.reg.pa costs_pp
##    <chr>    <dbl> <chr>                                <chr>         <dbl>
##  1 00C      2724. Darlington                           107191         25.4
##  2 00D     11229. Durham Dales, Easington & Sedgefield 289529         38.8
##  3 00J     17615. North Durham                         254568         69.2
##  4 00K     10410. Hartlepool & Stockton-On-Tees        294897         35.3
##  5 00L     15926. Northumberland                       322154         49.4
##  6 00M     20728. South Tees                           293801         70.6
##  7 00N      6069  South Tyneside                       155426         39.0
##  8 00P     19527. Sunderland                           283594         68.9
##  9 00Q     22255. Blackburn with Darwen                171623        130. 
## 10 00R      5143. Blackpool                            171758         29.9
## # ℹ more rows

As well as joining the data the code snippet above shows how dyplr single table verbs can be applied to joined data, after it has been joined! In this case summarising results over a grouping, the PCT / CCG groups.

Note again that without the collect function in the code above, the joined data remains in the database (i.e. is not returned to the console). To map the rates, the output of the query can be collected and linked to the ccg sf spatial layer and then mapped as in Figure \(\ref{fig:f3}\).

# as above but this time collecting the result from the database
tbl(db, "prescriptions") |> 
  filter(BNF.CODE %like% '%1404000H0%') |>
  group_by(PCT) |> 
  summarise(PCT_COST = sum(ACT.COST, na.rm = T)) |>
  left_join(tbl(db, "ccg_patients"), by = c("PCT" = "CCGcode")) |>
  mutate(costs_pp = PCT_COST/(ccg.reg.pa/1000)) |> collect() -> tmp
# pass tmp to ggplot
ccg |> left_join(tmp, by = c("CCGcode"="PCT")) |>   
  ggplot() + 
  geom_sf(aes(fill = costs_pp)) +
  scale_fill_continuous(type = "viridis")
The prescribing of antidepressants for each CCG / PCT.

Figure 5.2: The prescribing of antidepressants for each CCG / PCT.

The summarise() function summarises existing variables by the function that is passed to it. If the data are grouped it will return a summary for each group, otherwise it will summarise over the whole dataset. To illustrate these the code below summarises costs in the prescription table over the whole dataset and then groups these by PCT, and by PCT and month:

# entire dataset 
tbl(db, "prescriptions") |> 
  summarise(total = sum(ACT.COST, na.rm = T))
# grouped by PCT
tbl(db, "prescriptions") |> 
  group_by(PCT) |>
  summarise(total = sum(ACT.COST, na.rm = T)) |>
  arrange(desc(total))
# grouped by PCT and month
tbl(db, "prescriptions") |>
  mutate(month = substr(PERIOD, 5,6)) |>
  group_by(PCT, month) |>
  summarise(total = sum(ACT.COST, na.rm = T)) |>
  arrange(desc(total))

Note a number of things in the code above:

  • the use of substr to extract the month from the PERIOD variable
  • that summarise only returns variables created by its summary operations along with any grouping;
  • that the mutate() function adds new variables for all the rows in the input data table and returns (selects) all variables present when it is applied.

The code below summarises the costs of all prescriptions per practice and uses mutate to calculates the cost per item.

tbl(db, "prescriptions") |> 
  group_by(PRACTICE) |>
  summarise(
    cost = sum(ACT.COST, na.rm = T), 
    n = n()) |>
  mutate(mean_cost = cost/n) |>
  arrange(desc(mean_cost))
## # Source:     SQL [?? x 4]
## # Database:   sqlite 3.47.1 [/Users/geoaco/Library/CloudStorage/OneDrive-UniversityofLeeds/leeds_work/teaching_24/geog5917/bookdowngeog5917/prescribing.sqlite]
## # Ordered by: desc(mean_cost)
##    PRACTICE    cost     n mean_cost
##    <chr>      <dbl> <int>     <dbl>
##  1 Y00215   144237.    25     5769.
##  2 Y01912     5451.     2     2725.
##  3 Y02774    11869.     6     1978.
##  4 Y03873    23257.    23     1011.
##  5 Y04441    77124.    81      952.
##  6 Y05181    23619.    28      844.
##  7 P81791    14467.    18      804.
##  8 Y02901    15532.    20      777.
##  9 Y03641    21463.    28      767.
## 10 Y04878    41367.    54      766.
## # ℹ more rows

This operation can be unpicked by running the query in discrete steps and examining the intermediate outputs.

First the grouping and the summary:

tbl(db, "prescriptions") |> 
  group_by(PRACTICE) |>
  summarise(
    cost = sum(ACT.COST, na.rm = T), 
    n = n())

Then the creation of a new variable with mutate:

tbl(db, "prescriptions") |> 
  group_by(PRACTICE) |>
  summarise(
    cost = sum(ACT.COST, na.rm = T), 
    n = n()) |>
  mutate(mean_cost = cost/n)

Before the final ordering by adding the pipe and arrange(desc(mean_cost)) as above. And the observant amongst you will notice that this is the mean and the function `mean could have been used instead!

tbl(db, "prescriptions") |> 
  group_by(PRACTICE) |>
  summarise(mean_cost = mean(ACT.COST, na.rm = T))|>
  arrange(desc(mean_cost))

5.3.3 Final observations

This section has described the different kinds of operations for constructing database queries using the dplyr syntax. These include:

  • functions for extracting data (rows / columns) from a database based on pattern matching, logical tests and specific positional references;
  • functions for joining or linking database tables together based on some common property or attribute, with consideration of the cardinality of the link, applying the joins first introduced in Practical 2;
  • functions for summarising data, with or without grouping, applying different functions applying the dplyr verbs, also introduced in Practical 2.

These can be used to construct complex queries of filtered records, selected fields and joined database tables. A long and more complex example is in Chapter 4 of Comber and Brunsdon (2021). Using dplyr to do this has a number of advantages, the main one is that dplyr tries to be lazy by never pulling data into R’s memory unless explicitly requested (for example by using the collect function). This is done because dplyr queries create references to the data in the database, and the results that are returned (printed) to the console are just summaries of the query. When collect is used the dplyr query results are retrieved to a local tibble.

However, in some case we may want to pull the data down from the database and perform our database operations. The dbReadTable function does this and returns similar results, but operates in a very different way as indicated by the size of the R objects they create and different times they take to run:

object.size(tbl(db, "prescriptions"))
object.size(dbReadTable(db, "prescriptions"))

One other aspect of dplyr is that it compiles all of the database commands and translates them to SQL before passing them to the database in one step. This can be illustrated by adding show_query() to the code snippet above:

tbl(db, "prescriptions") |> 
  group_by(PRACTICE) |>
  summarise(mean_cost = mean(ACT.COST, na.rm = T))|>
  arrange(desc(mean_cost)) |>
  show_query()
## <SQL>
## SELECT `PRACTICE`, AVG(`ACT.COST`) AS `mean_cost`
## FROM `prescriptions`
## GROUP BY `PRACTICE`
## ORDER BY `mean_cost` DESC

It also possible to pass SQL code directly to the database using the dbGetQuery function in the DBI package:

dbGetQuery(db, 
  "SELECT `PRACTICE`, AVG(`ACT.COST`) AS `mean_cost`
  FROM `prescriptions`
  GROUP BY `PRACTICE`
  ORDER BY `mean_cost` DESC") |> as_tibble()

SQL is a very powerful language with a standard syntax. Further information on constructing SQL queries can be found at https://www.sqlite.org/queryplanner.html

The database can be closed as follows:

dbDisconnect(db)

5.4 Summary

This practical has illustrated how database queries can be constructed using the dplyr verbs and joins to wrangle data held in different database tables. These can be used to construct complex queries that integrate and pull data from different tables in databases. In some instances some of the dplyr functions required workarounds, particularly for pattern matching. A key point is that dplyr tries to be as lazy as possible by never pulling data into R unless explicitly requested. In effect all of the dplyr commands are compiled and translated into SQL before they are passed to the database in one step. In this schema, the results of queries (e.g. tbl) create references to the data in the database and the results that are returned and printed to the console are just summaries of the query - the data remain on the database. The dplyr query results can be returned using collect, which retrieves data to a local tibble. The SQL created by dplyr can be examined with the show_query() function.

The examples here used in memory and local on file databases. Of course they connections to remote databases can be made and the dplyr team provide a hypothetical example of the syntax for doing that (see https://db.rstudio.com/dplyr/) - do not run this code!

# do not run this code
con <- DBI::dbConnect(RMySQL::MySQL(), 
  host = "database.rstudio.com",
  user = "hadley",
  password = rstudioapi::askForPassword("Database password")
)

The dplyr and related packages such as stringr, many of which are loaded with the tidyverse package, contain many functions for manipulating and linking data tables. These wrap SQL and other functions such as pattern matching into a standard format to support data science. However, there are still some gaps in what they can do. In such cases, a bit creative thought can usually overcome the problem.

Comber, Lex, and Chris Brunsdon. 2021. Geographical Data Science and Spatial Data Analysis: An Introduction in r. Sage.
Horton, Nicholas J, Benjamin S Baumer, and Hadley Wickham. 2015. “Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics.” arXiv Preprint arXiv:1502.00318.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

References

Comber, Lex, and Chris Brunsdon. 2021. Geographical Data Science and Spatial Data Analysis: An Introduction in r. Sage.
Horton, Nicholas J, Benjamin S Baumer, and Hadley Wickham. 2015. “Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics.” arXiv Preprint arXiv:1502.00318.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".