2.4 Querying databases in SQL

So far, we have dealt with small data sets that easily fit into your computer’s memory. But what about data sets that are too large for your computer to handle as a whole?

Most data used in class examples are stored in simple flat files (commonly saved in .csv or .xlsx format). Hence, data are organized and stored in a simple file system:

transaction_id customer_id date product_name quantity price amount
001
002

Commonly, academic researchers gain data through observational sampling or experiments, and statistical/government agencies still collect data mostly through off-line means (e.g., survey, reports, or census).

On the other hand, industry and services often get their data on-line from business processes, i.e., from logistical, production, and administrative transactions. For example, sales data may be collected and updated directly from point-of-sales (POS) devices, then stored into some “database” or “datawarehouse”.

Definition 2.5 A database is an organized collection of structured information, or data, typically stored electronically in a computer system.

A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database.

When doing analysis from a database, you may only want to extract a part of it, because your computer may not be able to handle if you extract the whole database.

Definition 2.6 Structured Query Language (SQL) is a domain-specific and the standard language used to manage data, especially in a relational database management system.

SQL is used to interact (or “talk”) with relational databases such as SQLite, MySQL, PostgreSQL, Oracle, and Sysbase. This is optimized for certain data arrangements.

When working with large datasets, integrating SQL with R provides an efficient workflow for data manipulation, extraction, and analysis.

2.4.1 Connecting to a Database

There are several ways of running SQL in R. All methods require to connect to a database management system (e.g., MySQL).

However, we do not have access to existing databases. Hence, we shall set-up a simple database in our local machines. That way, we can practice SQL in RStudio without connecting to a formal database.

To set-up an sqlite database in memory, you need to install the RSQLite package.

library(DBI)
library(RSQLite)

To set-up a connection to a database, you need the dbConnect() function from DBI package. In this case, we use dbConnect() to set-up a database in memory.

  • First, download the database bank.sqlite to your working directory.

  • Now, we connect to SQLite database, which is in our local device.

    bank <- dbConnect(RSQLite::SQLite(), "bank.sqlite")
  • This database contains 4 tables:

    dbListTables(bank)
    ## [1] "cards"        "customers"    "menu"         "transactions"
    • customers: customer information

    • cards: a list of card ids, customer id of the owner, and remaining balance in the card. A customer may own multiple brand cards.

    • transactions: a list of the items bought in a certain transaction (order). There’s a unique id for each transaction. Take note, transactions require a brand card (which is a reloadable debit card). There may be multiple rows for a single transaction here.

    • menu: a list of item codes, corresponding item names, general category, and price

In R Markdown, you may insert an SQL code chunk.

In every SQL code chunk, the connection must be specified. Set connection = bank for our examples.

Now, let’s try extracting all columns from the customer table from this database.

SELECT * FROM customers
ABCDEFGHIJ0123456789
CustomerId
<int>
surname
<chr>
middle
<chr>
given
<chr>
sex
<fct>
age
<int>
bday
<chr>
est_income
<dbl>
15565701FerriJSusanFemale391978-06-0763000
15565706AkobunduIDavidMale351982-06-1456000
15565714CattaneoSJosephMale471970-06-0573000
15565779KentNBettyFemale301987-05-18165000
15565796DochertyODonaldMale481969-05-2727000
15565806TooseySDonaldMale381979-06-2635000
15565878BatesGMichaelMaleNANA167000
15565879RileyPLisaFemale281989-06-1736000
15565891DipietroHEdwardMale391978-05-2739000
15565996ArnoldGJohnMale441973-06-09117000


Cleaning up

To end a connection from a database, use the R function dbDisconnect() from the DBI package.

dbDisconnect(bank)

In large companies, many users may access a database stored in a cloud server at a single time. It is important that you disconnect from the database especially if the database has connection limits and to avoid possible data corruption.

2.4.2 Simple SQL Queries

There are a large number of SQL major commands, one of which is an SQL query.

A query is a question or inquiry about a set of data. For example, “Tell me how many books there are on computer programming” or “How many Rolling Stones albums were produced before 1980?”.

Generally, it has the following form:

SELECT columns or computations
    FROM table
    WHERE condition
    GROUP BY columns
    HAVING condition
    ORDER BY column [ASC | DESC]
    LIMIT offset,count;

Some notes:

  • The SELECT statement should have at least the SELECT clause and FROM clause. All other clauses are optional.

  • The order of the clauses within the SELECT query does matter. Do not switch the order of SELECT, FROM, WHERE, GROUP BY, and ORDER BY clauses.

  • The language is not case-sensitive, but let us use upper case for keywords.

In the following examples, make sure that you are still connected to the sqlite database, and you write your codes on SQL code chunks.

bank <- dbConnect(RSQLite::SQLite(), "bank.sqlite")

SELECT and FROM

For many of the modern uses of databases, all you’ll need to do with the database is to select some subset of the variables and/or observations from a table, and let some other program manipulate them.

In SQL the, SELECT statement is the workhorse for these operations.

FROM is used to specify the table to query.

SELECT * FROM customers
ABCDEFGHIJ0123456789
CustomerId
<int>
surname
<chr>
middle
<chr>
given
<chr>
sex
<fct>
age
<int>
bday
<chr>
est_income
<dbl>
15565701FerriJSusanFemale391978-06-0763000
15565706AkobunduIDavidMale351982-06-1456000
15565714CattaneoSJosephMale471970-06-0573000
15565779KentNBettyFemale301987-05-18165000
15565796DochertyODonaldMale481969-05-2727000
15565806TooseySDonaldMale381979-06-2635000
15565878BatesGMichaelMaleNANA167000
15565879RileyPLisaFemale281989-06-1736000
15565891DipietroHEdwardMale391978-05-2739000
15565996ArnoldGJohnMale441973-06-09117000

The asterisk * after SELECT means that you are extracting all columns.

You can also select specific columns only, each separated by comma ,.

SELECT TransID, item
    FROM transactions
ABCDEFGHIJ0123456789
TransId
<chr>
item
<chr>
10000115634602BC
10000115634602FT
10000215634602Cap
10000215634602MS
10000315634602Cap
10000315634602FT
10000415634602Cap
10000515634602GTF
10000515634602CCF
10000515634602DDS

WHERE

Conditional statements can be added via WHERE to filter possible output.

Show all female customers
SELECT * 
    FROM customers
    WHERE Sex = "Female"
ABCDEFGHIJ0123456789
CustomerId
<int>
surname
<chr>
middle
<chr>
given
<chr>
sex
<fct>
age
<int>
bday
<chr>
est_income
<dbl>
15565701FerriJSusanFemale391978-06-0763000
15565779KentNBettyFemale301987-05-18165000
15565879RileyPLisaFemale281989-06-1736000
15566091ThomsenMSharonFemale321985-05-0952000
15566139Ts'uiCSharonFemale371980-06-0234000
15566156FranklinSMargaretFemale441973-06-16119000
15566211HsuKPatriciaFemale411976-05-305000
15566251FerrariRSharonFemale371980-06-1864000
15566295SandersBJenniferFemale331984-05-08105000
15566312JollyGSharonFemale421975-06-08152000

Both AND and OR are valid, along with parentheses to affect order of operations.

Extract female aged 50 above, together with all male customers
SELECT * 
    FROM customers
    WHERE (sex = 'Female' AND age >= 50) 
        OR sex = 'Male'
ABCDEFGHIJ0123456789
CustomerId
<int>
surname
<chr>
middle
<chr>
given
<chr>
sex
<fct>
age
<int>
bday
<chr>
est_income
<dbl>
15565706AkobunduIDavidMale351982-06-1456000
15565714CattaneoSJosephMale471970-06-0573000
15565796DochertyODonaldMale481969-05-2727000
15565806TooseySDonaldMale381979-06-2635000
15565878BatesGMichaelMaleNANA167000
15565891DipietroHEdwardMale391978-05-2739000
15565996ArnoldGJohnMale441973-06-09117000
15566030TuHCharlesMale411976-05-2759000
15566111EstesTRichardMale391978-05-1256000
15566253ManningKRobertMale441973-05-14129000

ORDER BY

To order variables, use the syntax:

ORDER BY var1 {ASC/DESC}, var2 {ASC/DESC}

where the choice of ASC for ascending or DESC for descending is made per variable.

Sort items in menu from most expensive to least expensive
SELECT * FROM menu
    ORDER BY Price DESC
ABCDEFGHIJ0123456789
ItemCode
<chr>
MenuItem
<chr>
Category
<chr>
Price
<dbl>
CarMCaramel MacchiatoHot Drink170
ICAIced Cafe AmericanoCold Drink170
DDSDouble Decker Sandwich (Tuna or Chicken)Bread or Sandwich170
GTFGreen Tea FrappeCold Drink160
CCFChocolate Chip FrappeCold Drink160
CapCappuccinoHot Drink150
DMFDark Mocha FrappeCold Drink150
ICarMIced Caramel MacchiatoCold Drink140
CACafe AmericanoHot Drink130
SHCSignature Hot ChocolateHot Drink130

LIMIT

To control the number of results returned, use LIMIT #.

Show top 10 richest customers”
SELECT * FROM customers
    ORDER BY est_income DESC
    LIMIT 10
ABCDEFGHIJ0123456789
CustomerId
<int>
surname
<chr>
middle
<chr>
given
<chr>
sex
<fct>
age
<int>
bday
<chr>
est_income
<dbl>
15569274PisanoFJosephMale491968-05-28180000
15616454DavidsonQDonnaFemale341983-05-28180000
15641582ChibugoGPaulMale431974-06-10180000
15657957HughesBBarbaraFemale261991-06-06180000
15670172PadovesiBPatriciaFemale301987-06-17180000
15713621MollisonQPaulMale411976-06-15180000
15741719DeRoseHNancyFemale401977-05-23180000
15773487ConwayPMargaretFemale311986-06-10180000
15796413GreenTBrianMale461971-06-12180000
15651983FangNHelenFemale561961-05-25179000

Aggregate Functions

An aggregate function is a function that performs a calculation on a set of values, and returns a single value.

The most commonly used SQL aggregate functions are:

  • MIN() - returns the smallest value within the selected column

  • MAX() - returns the largest value within the selected column

  • COUNT() - returns the number of rows in a set

  • SUM() - returns the total sum of a numerical column

  • AVG() - returns the average value of a numerical column

Aggregate functions ignore null values (except for COUNT()).

SELECT COUNT() returns the number of observations.

SELECT COUNT() AS n_transactions 
    FROM transactions
ABCDEFGHIJ0123456789
n_transactions
<int>
310970
Show number of customers that has age
SELECT COUNT() AS n_users, COUNT(age)
    FROM customers
ABCDEFGHIJ0123456789
n_users
<int>
COUNT(age)
<int>
96859637

Aggregate functions are often used with the GROUP BY clause of the SELECT statement. The GROUP BY clause splits the result-set into groups of values and the aggregate function can be used to return a single value for each group.

Show average income per sex
SELECT sex, AVG(est_income) AS Avg_Income
    FROM customers
    GROUP BY sex
ABCDEFGHIJ0123456789
sex
<fct>
Avg_Income
<dbl>
Female77857.31
Male77236.72

The HAVING clause filters a table after aggregation.

Show total count of transaction per items, but only those having total sales greater than 20k.
SELECT item, SUM(count) AS total_sales
    FROM transactions
    GROUP BY item
    HAVING total_sales > 20000
ABCDEFGHIJ0123456789
item
<chr>
total_sales
<int>
CA21188
ICarM21682
MC21522

This is only the tip of the SQL iceberg. There are far more advanced commands available, such as the following:

  • INSERT: Add new records
  • UPDATE: Modify existing data
  • DELETE: Remove records
  • JOIN: Combine tables

If the database is large enough that you cannot store the entire dataset on your computer, you may need to learn more commands.

2.4.3 Query from Multiple Tables

Recall that the tables in the bank database are connected by some “key variables”.

The transactions table has TransId, CardId, item, and count, but the name of the items cannot be found in this table.

The following is an example query from 2 tables.

SELECT transactions.TransID, transactions.item, menu.MenuItem
    FROM transactions LEFT JOIN menu 
            ON transactions.item = menu.ItemCode
Table 2.1: Displaying records 1 - 10
TransId item MenuItem
10000115634602 BC Brewed Coffee
10000115634602 FT French Toast
10000215634602 Cap Cappuccino
10000215634602 MS Muffin Sandwich (Bacon and Cheese)
10000315634602 Cap Cappuccino
10000315634602 FT French Toast
10000415634602 Cap Cappuccino
10000515634602 GTF Green Tea Frappe
10000515634602 CCF Chocolate Chip Frappe
10000515634602 DDS Double Decker Sandwich (Tuna or Chicken)

We use tableName.columnName to get a column from a specific table.

The LEFT JOIN keyword returns all records from the left table (table1), and the matching records from the right table (table2). The result is 0 records from the right side, if there is no match.

SELECT column_name(s)
    FROM table1
        LEFT JOIN table2
            ON table1.column_name = table2.column_name;

Now, what if we want to find the total number of items sold per item?

Menu items can be found on the menu table, but the number of items sold per transaction can be found on the transactions table.

SELECT  menu.MenuItem, 
        SUM(transactions.count) AS total_sold
    FROM transactions LEFT JOIN menu 
        ON transactions.item = menu.ItemCode
    GROUP BY transactions.item
Table 2.2: Displaying records 1 - 10
MenuItem total_sold
Banana Bread 8275
Brewed Coffee 18880
Blueberry Cheesecake 14941
Banoffee Pie 16166
Cafe Americano 21188
Chocolate Chip Frappe 13251
Classic Sandwich (Tuna or Chicken) 14693
Cappuccino 14628
Caramel Macchiato 12208
Double Decker Sandwich (Tuna or Chicken) 13501

There are other types of JOIN clauses:

  • JOIN: Returns records that have matching values in both tables. Also called an inner join.

  • LEFT JOIN: Returns all records from the left table, and the matched records from the right table. This is also recommended if there is many-to-one correspondence between the left and right tables.

  • RIGHT JOIN: Returns all records from the right table, and the matched records from the left table. This is also recommended if there is one-to-many correspondence between the left and right tables.

  • FULL JOIN: Returns all records when there is a match in either left or right table.

2.4.4 Converting SQL Query to Dataframe

After an SQL Query, they are still not directly useful if we want to perform modelling in R.

The dbGetQuery converts an SQL statement into a dataframe.

customers <- dbGetQuery(conn = bank,
                        "SELECT * 
                        FROM customers")

The return value is a dataframe, where you can now perform any R procedures.

Additional task: convert all tables in bank to dataframes.

Practice Exercises

  1. Easy

    Show a table containing number of items per transaction ID.

  2. Average

    Recall that a customer may own multiple cards. Show a table with name of all card holders, and another column for the total balance of each customer for all their owned cards. Limit to top 5 customers only.

  3. Hard

    What food is commonly paired with “Capuccino”? Show top 3 food items with respective number of transactions.