5 SQL: Retrieve data from a single table
In this chapter, we will introduce basic SQL queries to retrieve data stored in a single table of a database. For our analysis, we will use the schema music_marketplace
that describes a hypothetical marketplace for music. The schema includes the following tables:
songs
: stores information about songs.label
: stores information about music labels.artist
: stores information about artists.artist_songs
: stores which songs are produced by which artists.charts
: stores the maximum rank that a songs achieves in the charts (the hypothetical marketplace has top-selling charts).
The Entity Relationship diagram of this schema is presented below:
Recall that the term
schema
is synonymous to the termdatabasde
.
First, we will connect to the database (Section 4.1):
library(DBI)
library(odbc)
library(RODBC)
library(tidyverse)
= DBI::dbConnect(odbc::odbc(),
con Driver= "MySQL",
Server='mysql-isys3350.bc.edu',
UID='your_username',
PWD= 'your_password',
Port= 3306)
5.1 The SELECT
clause
The most basic function of a database is that it allows us to retrieve data stored in tables.
In particular, to retrieve data from a given table we can use the SELECT
clause. The basic syntax is the following:
SELECT columns_to_select
FROM database_name.table_name
WHERE search_condition] /*optional*/
[ORDER BY column_to_order] /*optional*/
[LIMIT rows_to_return] /*optional*/
[ ;
The clauses in brackets are optional and can be omitted.
In SQL a comment is surrounded by
/*
and*/
.
;
.
Assume that we want to retrieve columns title
, label
, and genre
(see Figure 5.1) of each song in the table songs
of our focal database music_marketplace
. Based on the above syntax, we write:
SELECT title, label, genre FROM music_marketplace.songs;
title | label | genre |
---|---|---|
If I Want You | Tilth Music | Breaks |
Suki | Phonocult Records | Minimal / Deep Tech |
Goofy's Nightmare | Phonocult Records | Minimal / Deep Tech |
Suki | Phonocult Records | Minimal / Deep Tech |
Golden Nightfall | Lip Recordings | House |
Hope For The Right | Lip Recordings | House |
The Night's Already Started | Lip Recordings | House |
Yungle | Malicious Damage | Drum & Bass |
Sail | Malicious Damage | Electronica / Downtempo |
Chromatik [Orbzone] | Malicious Damage | Electronica / Downtempo |
To run SQL commands within an R Mardkown file you will need to create an SQL chunk by typing ```{sql connection=con}
and then close it with three left quotes ```
.
Insert a new code chunk
and choose SQL
.
Note that the above query displays only the first 10 rows (records) of the table and not the complete table. When running SQL queries from within RStudio the results will usually limited by default to 1000 rows. See Section @ref{dbR} on how to overcome this limitation.
5.1.1 The LIMIT
clause
Often, we want to get only a glimpse of a table, so we can use the optional keyword LIMIT
:
SELECT title, label, genre
FROM music_marketplace.songs
LIMIT 5;
title | label | genre |
---|---|---|
If I Want You | Tilth Music | Breaks |
Suki | Phonocult Records | Minimal / Deep Tech |
Goofy's Nightmare | Phonocult Records | Minimal / Deep Tech |
Suki | Phonocult Records | Minimal / Deep Tech |
Golden Nightfall | Lip Recordings | House |
LIMIT
5 above returns only the first 5 rows of the table.
5.1.2 The WHERE
clause
The optional WHERE
clause allows us to filter the results (similar to the tidyverse::filter()
command). For instance, we can select only songs of genre Breaks
:
SELECT title, label, genre
FROM music_marketplace.songs
WHERE genre = 'Breaks'
LIMIT 5;
title | label | genre |
---|---|---|
If I Want You | Tilth Music | Breaks |
Sea | eight:fx | Breaks |
Crazy | Bombtraxx | Breaks |
Going Going Gone | Jalapeno Records | Breaks |
Now I'm Livin' for Me | Jalapeno Records | Breaks |
Note that in SQL, the equality test is identified by a single
=
.
5.1.3 The ORDER BY
clause
If we want the result to be ordered, we can use the optional ORDER BY
clause:
SELECT title, label, genre
FROM music_marketplace.songs
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
title | label | genre |
---|---|---|
Light Shot | 24/7 Records | Breaks |
Hold Me Up | 300 Entertainment | Breaks |
Free Will feat. Neil Kramer | Activated Recordings | Breaks |
Lucy | Activated Recordings | Breaks |
The 13th Dimension | Activated Recordings | Breaks |
Galactic Zipline | Activated Recordings | Breaks |
In The Jam | AKIESOUNDS | Breaks |
Dark Side | AKIESOUNDS | Breaks |
Zone H | AKIESOUNDS | Breaks |
Rocket Fuel | Always Never | Breaks |
And of course we can order in decreasing order with the keyword DESC
:
SELECT title, label, genre
FROM music_marketplace.songs
WHERE genre = 'Breaks'
ORDER BY label DESC
LIMIT 10;
title | label | genre |
---|---|---|
Yin Yang | Zone Records | Breaks |
Numero | Zone Records | Breaks |
Rainmaker | Zone Records | Breaks |
The Zapper | Zone Records | Breaks |
In Ya Trunk | Zone Records | Breaks |
Trekster | Zone Records | Breaks |
Let's Toast | Zone Records | Breaks |
This Road Feat. Marlon Deja | Zone Records | Breaks |
Dem Gals | Zone Records | Breaks |
Gauge | Zone Records | Breaks |
5.2 Manipulation of columns
Often, we want to retrieve all the columns of a table. The special keyword *
allows us to do so as follows:
SELECT *
FROM music_marketplace.songs
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 5;
title | subtitle | trackid | label | genre | released | labelId | price |
---|---|---|---|---|---|---|---|
Light Shot | Original Mix | 9945407 | 24/7 Records | Breaks | 2015-04-06 | 3765 | 1.49 |
Hold Me Up | Mike Delinquent Remix | 7357822 | 300 Entertainment | Breaks | 2015-11-20 | 53217 | 1.49 |
Free Will feat. Neil Kramer | Original Mix | 6548413 | Activated Recordings | Breaks | 2015-04-14 | 12018 | 1.49 |
Lucy | Original Mix | 6548415 | Activated Recordings | Breaks | 2015-04-14 | 12018 | 1.49 |
The 13th Dimension | Original Mix | 6548417 | Activated Recordings | Breaks | 2015-04-14 | 12018 | 1.49 |
In addition, we can rename the query columns with the keyword AS
:
SELECT label AS 'Label', genre AS 'Genre'
FROM music_marketplace.songs
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
Label | Genre |
---|---|
24/7 Records | Breaks |
300 Entertainment | Breaks |
Activated Recordings | Breaks |
Activated Recordings | Breaks |
Activated Recordings | Breaks |
Activated Recordings | Breaks |
AKIESOUNDS | Breaks |
AKIESOUNDS | Breaks |
AKIESOUNDS | Breaks |
Always Never | Breaks |
And we can remove duplicates with the special keyword DISTINCT
:
SELECT DISTINCT label as 'Label', genre as 'Genre'
FROM music_marketplace.songs
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
Label | Genre |
---|---|
24/7 Records | Breaks |
300 Entertainment | Breaks |
Activated Recordings | Breaks |
AKIESOUNDS | Breaks |
Always Never | Breaks |
Animal Language | Breaks |
Arms & Legs | Breaks |
Artist Intelligence Agency | Breaks |
ARVA | Breaks |
Audio Esperanto | Breaks |
5.3 Arithmetic operators
Within a SELECT
clause we can perform arithmetic operations between columns. SQL supports the following:
*
multiplication (same as inR
)/
division (same as inR
)+
addition (same as inR
)-
subtraction (same as inR
)DIV
integer division (inR
is the sequence%/%
)%
module (inR
is the sequence%%
)
To showcase, we will use the table charts
from database music_marketplace
(see Figure 5.1):
SELECT * FROM music_marketplace.charts LIMIT 10;
trackId | maxRank | durationInCharts | firstDateInCharts | lastDateInCharts |
---|---|---|---|---|
887832 | 95 | 0 | 2016-01-17 16:38:32 | 2016-01-17 16:38:32 |
907738 | 42 | 20 | 2015-08-16 12:21:35 | 2015-09-05 15:43:31 |
3711813 | 93 | 3 | 2015-12-27 16:38:57 | 2015-12-30 16:39:06 |
4423896 | 56 | 301 | 2015-06-27 16:23:18 | 2016-04-23 16:39:36 |
4927722 | 57 | 266 | 2015-03-29 16:22:17 | 2015-12-20 16:38:10 |
4927724 | 20 | 276 | 2015-03-08 00:13:31 | 2015-12-09 16:38:08 |
4969569 | 62 | 48 | 2016-02-12 16:40:14 | 2016-03-31 16:39:52 |
4969576 | 60 | 6 | 2016-03-25 16:40:03 | 2016-03-31 16:39:52 |
6065153 | 60 | 7 | 2015-03-14 16:22:47 | 2015-03-21 16:22:52 |
6079233 | 35 | 7 | 2015-04-11 16:26:24 | 2015-04-18 16:26:23 |
Now we can perform some arithmetic operations on its numerical columns:
SELECT maxRank, maxRank * 2 as 'maxRankSQR', maxRank / durationInCharts, maxRank DIV durationInCharts
FROM music_marketplace.charts LIMIT 10;
maxRank | maxRankSQR | maxRank / durationInCharts | maxRank DIV durationInCharts |
---|---|---|---|
95 | 190 | NA | NA |
42 | 84 | 2.1000 | 2 |
93 | 186 | 31.0000 | 31 |
56 | 112 | 0.1860 | 0 |
57 | 114 | 0.2143 | 0 |
20 | 40 | 0.0725 | 0 |
62 | 124 | 1.2917 | 1 |
60 | 120 | 10.0000 | 10 |
60 | 120 | 8.5714 | 8 |
35 | 70 | 5.0000 | 5 |
Note that if we do not assign names to the results of an operation SQL assumes that the column name is the opration itself (e.g.,
maxRank / durationInCharts
andmaxRank DIV durationInCharts
in the above output).
5.4 Concatinatination of columns
We can concatenate columns with the function CONCAT_WS (separtor, column1, column2, ...)
, similar to how we use the function paste()
in R (see Section (customFunctions)):
SELECT CONCAT_WS(' -> ',title, subtitle)
FROM music_marketplace.songs LIMIT 10;
CONCAT_WS(’ -> ’,title, subtitle) |
---|
If I Want You -> Retroid Remix |
Suki -> Original Mix |
Goofy's Nightmare -> Original Mix |
Suki -> Seph Remix |
Golden Nightfall -> Muzzaik Remix |
Hope For The Right -> Original Mix |
The Night's Already Started -> Belocca Dark Vox |
Yungle -> Original Mix |
Sail -> Original Mix |
Chromatik [Orbzone] -> Original Mix |
In the above output we can see that the contents of column title
are followed by the separator ->
and then by the contents of column subtitle
.
5.5 Logic in SQL
The WHERE
clause allows us to perform logical operations (Section 1.7). SQL supports the following operators:
Operator in SQL | Description | Operator in R |
---|---|---|
< | Less than | Same |
<= | Less than or equal to | Same |
> | Greater than | Same |
>= | Greater than or equal to | Same |
= | Exactly equal to | == |
!= | Not equal to | Same |
NOT | Not (negation) | ! |
AND | Logical AND | & |
OR | Logical OR | | |
To see an example, let us retrieve only the songs that are either House
or Techno
:
SELECT title, genre
FROM music_marketplace.songs
WHERE genre = 'House' OR genre = 'Techno'
LIMIT 5;
title | genre |
---|---|
Golden Nightfall | House |
Hope For The Right | House |
The Night's Already Started | House |
No Mo | Techno |
Pillow | Techno |
Instead of having multiple OR
conditions like before, we can alternatively use the special operator IN
as follows:
SELECT title, genre
FROM music_marketplace.songs
WHERE genre IN ('House','Techno')
LIMIT 5;
title | genre |
---|---|
Golden Nightfall | House |
Hope For The Right | House |
The Night's Already Started | House |
No Mo | Techno |
Pillow | Techno |
R also offers similar functionality with the
%in%
operator. See Section 5.6.3 for details.
Now let us retrieve songs that are not of genre House
:
SELECT title, genre
FROM music_marketplace.songs
WHERE NOT genre = 'House'
LIMIT 10;
title | genre |
---|---|
If I Want You | Breaks |
Suki | Minimal / Deep Tech |
Goofy's Nightmare | Minimal / Deep Tech |
Suki | Minimal / Deep Tech |
Yungle | Drum & Bass |
Sail | Electronica / Downtempo |
Chromatik [Orbzone] | Electronica / Downtempo |
Eurofen | Electronica / Downtempo |
Sun Of | Electronica / Downtempo |
73.5 feat. nitraM | Electronica / Downtempo |
Alternatively, we can use the !=
operator:
SELECT title, genre
FROM music_marketplace.songs
WHERE genre != 'House'
LIMIT 10;
title | genre |
---|---|
If I Want You | Breaks |
Suki | Minimal / Deep Tech |
Goofy's Nightmare | Minimal / Deep Tech |
Suki | Minimal / Deep Tech |
Yungle | Drum & Bass |
Sail | Electronica / Downtempo |
Chromatik [Orbzone] | Electronica / Downtempo |
Eurofen | Electronica / Downtempo |
Sun Of | Electronica / Downtempo |
73.5 feat. nitraM | Electronica / Downtempo |
What if we wanted to retrieve the songs that ranked between \(10^{th}\) and \(20^{th}\) place in the charts? We can use the logical and operator:
SELECT *
FROM music_marketplace.charts
WHERE maxRank >= 10 AND maxRank <= 20
LIMIT 10;
trackId | maxRank | durationInCharts | firstDateInCharts | lastDateInCharts |
---|---|---|---|---|
4927724 | 20 | 276 | 2015-03-08 00:13:31 | 2015-12-09 16:38:08 |
6317653 | 20 | 34 | 2015-03-14 16:26:10 | 2015-04-17 16:26:25 |
6318411 | 12 | 28 | 2015-03-17 16:27:18 | 2015-04-14 16:27:04 |
6330179 | 20 | 33 | 2015-03-27 16:22:36 | 2015-04-29 16:22:40 |
6337094 | 15 | 15 | 2015-03-11 16:25:39 | 2015-03-26 16:25:53 |
6346768 | 18 | 409 | 2015-03-11 16:22:39 | 2016-04-23 16:38:21 |
6347234 | 10 | 22 | 2015-03-19 16:25:57 | 2015-04-10 16:29:04 |
6347821 | 15 | 26 | 2015-03-17 16:23:53 | 2015-04-12 16:23:36 |
6357844 | 12 | 38 | 2015-04-03 16:26:27 | 2015-05-11 16:26:33 |
6360014 | 17 | 15 | 2015-03-11 16:26:55 | 2015-03-26 16:26:58 |
Yet, a more natural way to perform the same query is to use the BETWEEN
operator:
SELECT *
FROM music_marketplace.charts
WHERE maxRank BETWEEN 10 AND 20
LIMIT 10;
trackId | maxRank | durationInCharts | firstDateInCharts | lastDateInCharts |
---|---|---|---|---|
4927724 | 20 | 276 | 2015-03-08 00:13:31 | 2015-12-09 16:38:08 |
6317653 | 20 | 34 | 2015-03-14 16:26:10 | 2015-04-17 16:26:25 |
6318411 | 12 | 28 | 2015-03-17 16:27:18 | 2015-04-14 16:27:04 |
6330179 | 20 | 33 | 2015-03-27 16:22:36 | 2015-04-29 16:22:40 |
6337094 | 15 | 15 | 2015-03-11 16:25:39 | 2015-03-26 16:25:53 |
6346768 | 18 | 409 | 2015-03-11 16:22:39 | 2016-04-23 16:38:21 |
6347234 | 10 | 22 | 2015-03-19 16:25:57 | 2015-04-10 16:29:04 |
6347821 | 15 | 26 | 2015-03-17 16:23:53 | 2015-04-12 16:23:36 |
6357844 | 12 | 38 | 2015-04-03 16:26:27 | 2015-05-11 16:26:33 |
6360014 | 17 | 15 | 2015-03-11 16:26:55 | 2015-03-26 16:26:58 |
5.6 Analyzing data from a database with R
The analysis so far runs queries but it does not allow us to manipulate the data in R
. Yet, often, we would rather like to retrieve data from a dataase and explore it in R.
5.6.1 The SQL chunk
way
One way to do so, is to use the option output.var
inside the definition of the SQL markdown chunk and get the result in a dataframe For instance, the SQL Chunk below is
```{sql, connection=con, output.var="d"}
and hence saves the query results into an R dataframe named d
:
SELECT title, label, genre FROM music_marketplace.songs;
We can access this dataframe d
as usual within an R chunk:
%>% head d
## title label genre
## 1 If I Want You Tilth Music Breaks
## 2 Suki Phonocult Records Minimal / Deep Tech
## 3 Goofy's Nightmare Phonocult Records Minimal / Deep Tech
## 4 Suki Phonocult Records Minimal / Deep Tech
## 5 Golden Nightfall Lip Recordings House
## 6 Hope For The Right Lip Recordings House
By default, SQL queries ran within RStudio will return 1000 rows. However, through this methodology, we can retrieve the complete table:
%>% nrow d
## [1] 85987
For consistency, we can also transform the dataframe into a tibble:
= as_tibble(d)
d %>% head d
## # A tibble: 6 × 3
## title label genre
## <chr> <chr> <chr>
## 1 If I Want You Tilth Music Breaks
## 2 Suki Phonocult Records Minimal / Deep Tech
## 3 Goofy's Nightmare Phonocult Records Minimal / Deep Tech
## 4 Suki Phonocult Records Minimal / Deep Tech
## 5 Golden Nightfall Lip Recordings House
## 6 Hope For The Right Lip Recordings House
Recall that a tibble is a more efficient dataframe.
5.6.2 The dbGetQuery
way
Alternatively, we can save data to a dataframe by using the R
function dbGetQuery(con,query)
within an R chunk as follows:
= 'SELECT title, label, genre FROM music_marketplace.songs;'
query = dbGetQuery(con,query)
t %>% as_tibble %>% head t
## # A tibble: 6 × 3
## title label genre
## <chr> <chr> <chr>
## 1 If I Want You Tilth Music Breaks
## 2 Suki Phonocult Records Minimal / Deep Tech
## 3 Goofy's Nightmare Phonocult Records Minimal / Deep Tech
## 4 Suki Phonocult Records Minimal / Deep Tech
## 5 Golden Nightfall Lip Recordings House
## 6 Hope For The Right Lip Recordings House
5.6.3 The %in%
operator
In Section 5.5 we used the IN
operator in SQL. We can perform the same operation in R with the special sequence %in%
.
First, we will retrieve all the songs from table songs
:
= 'SELECT title, genre FROM music_marketplace.songs LIMIT 100;'
query = as_tibble(dbGetQuery(con,query))
t %>% head t
## # A tibble: 6 × 2
## title genre
## <chr> <chr>
## 1 If I Want You Breaks
## 2 Suki Minimal / Deep Tech
## 3 Goofy's Nightmare Minimal / Deep Tech
## 4 Suki Minimal / Deep Tech
## 5 Golden Nightfall House
## 6 Hope For The Right House
Now, we can keep only those songs that are in genre ‘House’ and ‘Techno’ as follows:
%>% filter(genre %in% c('House','Techno')) %>% head t
## # A tibble: 6 × 2
## title genre
## <chr> <chr>
## 1 Golden Nightfall House
## 2 Hope For The Right House
## 3 The Night's Already Started House
## 4 No Mo Techno
## 5 Pillow Techno
## 6 El Snig feat. Rut Techno
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu