5 SQL: Retrieve data from a single table

In this chapter, we will introduce basic SQL queries to retrieve data stored in a single table of a database. For our analysis, we will use the schema music_marketplace that describes a hypothetical marketplace for music. The schema includes the following tables:

  • songs: stores information about songs.
  • label: stores information about music labels.
  • artist: stores information about artists.
  • artist_songs: stores which songs are produced by which artists.
  • charts: stores the maximum rank that a songs achieves in the charts (the hypothetical marketplace has top-selling charts).

The Entity Relationship diagram of this schema is presented below:

The Entity Relationship diagram of schema `music_marketplace`.

Figure 5.1: The Entity Relationship diagram of schema music_marketplace.

 

Recall that the term schema is synonymous to the term databasde.

First, we will connect to the database (Section 4.1):

library(DBI)
library(odbc)
library(RODBC)
library(tidyverse)
con = DBI::dbConnect(odbc::odbc(), 
                    Driver= "MySQL", 
                    Server='mysql-isys3350.bc.edu', 
                    UID='your_username', 
                    PWD= 'your_password',
                    Port= 3306)


5.1 The SELECT clause

The most basic function of a database is that it allows us to retrieve data stored in tables. In particular, to retrieve data from a given table we can use the SELECT clause. The basic syntax is the following:

SELECT columns_to_select 
FROM database_name.table_name 
[WHERE search_condition] /*optional*/
[ORDER BY column_to_order] /*optional*/
[LIMIT rows_to_return] /*optional*/
; 

The clauses in brackets are optional and can be omitted.


In SQL a comment is surrounded by /* and */.


It is a good practie to finish each SQL statement with the semicolon ;.


Assume that we want to retrieve columns title, label, and genre (see Figure 5.1) of each song in the table songs of our focal database music_marketplace. Based on the above syntax, we write:

SELECT title, label, genre FROM music_marketplace.songs;
Table 5.1: Displaying records 1 - 10
title label genre
If I Want You Tilth Music Breaks
Suki Phonocult Records Minimal / Deep Tech
Goofy's Nightmare Phonocult Records Minimal / Deep Tech
Suki Phonocult Records Minimal / Deep Tech
Golden Nightfall Lip Recordings House
Hope For The Right Lip Recordings House
The Night's Already Started Lip Recordings House
Yungle Malicious Damage Drum & Bass
Sail Malicious Damage Electronica / Downtempo
Chromatik [Orbzone] Malicious Damage Electronica / Downtempo


To run SQL commands within an R Mardkown file you will need to create an SQL chunk by typing ```{sql connection=con} and then close it with three left quotes ```.

Alternatively, you can click on the Insert a new code chunk and choose SQL.


Note that the above query displays only the first 10 rows (records) of the table and not the complete table. When running SQL queries from within RStudio the results will usually limited by default to 1000 rows. See Section @ref{dbR} on how to overcome this limitation.



5.1.1 The LIMIT clause

Often, we want to get only a glimpse of a table, so we can use the optional keyword LIMIT:

SELECT title, label, genre 
FROM music_marketplace.songs 
LIMIT 5;
Table 5.2: 5 records
title label genre
If I Want You Tilth Music Breaks
Suki Phonocult Records Minimal / Deep Tech
Goofy's Nightmare Phonocult Records Minimal / Deep Tech
Suki Phonocult Records Minimal / Deep Tech
Golden Nightfall Lip Recordings House

LIMIT 5 above returns only the first 5 rows of the table.

5.1.2 The WHERE clause

The optional WHERE clause allows us to filter the results (similar to the tidyverse::filter() command). For instance, we can select only songs of genre Breaks:

SELECT title, label, genre 
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
LIMIT 5;
Table 5.3: 5 records
title label genre
If I Want You Tilth Music Breaks
Sea eight:fx Breaks
Crazy Bombtraxx Breaks
Going Going Gone Jalapeno Records Breaks
Now I'm Livin' for Me Jalapeno Records Breaks

 

Note that in SQL, the equality test is identified by a single =.

5.1.3 The ORDER BY clause

If we want the result to be ordered, we can use the optional ORDER BY clause:

SELECT title, label, genre 
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
Table 5.4: Displaying records 1 - 10
title label genre
Light Shot 24/7 Records Breaks
Hold Me Up 300 Entertainment Breaks
Free Will feat. Neil Kramer Activated Recordings Breaks
Lucy Activated Recordings Breaks
The 13th Dimension Activated Recordings Breaks
Galactic Zipline Activated Recordings Breaks
In The Jam AKIESOUNDS Breaks
Dark Side AKIESOUNDS Breaks
Zone H AKIESOUNDS Breaks
Rocket Fuel Always Never Breaks

And of course we can order in decreasing order with the keyword DESC:

SELECT title, label, genre 
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
ORDER BY label DESC
LIMIT 10;
Table 5.5: Displaying records 1 - 10
title label genre
Yin Yang Zone Records Breaks
Numero Zone Records Breaks
Rainmaker Zone Records Breaks
The Zapper Zone Records Breaks
In Ya Trunk Zone Records Breaks
Trekster Zone Records Breaks
Let's Toast Zone Records Breaks
This Road Feat. Marlon Deja Zone Records Breaks
Dem Gals Zone Records Breaks
Gauge Zone Records Breaks

5.2 Manipulation of columns

Often, we want to retrieve all the columns of a table. The special keyword * allows us to do so as follows:

SELECT *
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 5;
Table 5.6: 5 records
title subtitle trackid label genre released labelId price
Light Shot Original Mix 9945407 24/7 Records Breaks 2015-04-06 3765 1.49
Hold Me Up Mike Delinquent Remix 7357822 300 Entertainment Breaks 2015-11-20 53217 1.49
Free Will feat. Neil Kramer Original Mix 6548413 Activated Recordings Breaks 2015-04-14 12018 1.49
Lucy Original Mix 6548415 Activated Recordings Breaks 2015-04-14 12018 1.49
The 13th Dimension Original Mix 6548417 Activated Recordings Breaks 2015-04-14 12018 1.49

In addition, we can rename the query columns with the keyword AS:

SELECT label AS 'Label', genre AS 'Genre'
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
Table 5.7: Displaying records 1 - 10
Label Genre
24/7 Records Breaks
300 Entertainment Breaks
Activated Recordings Breaks
Activated Recordings Breaks
Activated Recordings Breaks
Activated Recordings Breaks
AKIESOUNDS Breaks
AKIESOUNDS Breaks
AKIESOUNDS Breaks
Always Never Breaks

And we can remove duplicates with the special keyword DISTINCT:

SELECT DISTINCT label as 'Label', genre as 'Genre'
FROM music_marketplace.songs 
WHERE genre = 'Breaks'
ORDER BY label
LIMIT 10;
Table 5.8: Displaying records 1 - 10
Label Genre
24/7 Records Breaks
300 Entertainment Breaks
Activated Recordings Breaks
AKIESOUNDS Breaks
Always Never Breaks
Animal Language Breaks
Arms & Legs Breaks
Artist Intelligence Agency Breaks
ARVA Breaks
Audio Esperanto Breaks

5.3 Arithmetic operators

Within a SELECT clause we can perform arithmetic operations between columns. SQL supports the following:

  • * multiplication (same as in R)
  • / division (same as in R)
  • + addition (same as in R)
  • - subtraction (same as in R)
  • DIV integer division (in R is the sequence %/%)
  • % module (in R is the sequence %%)

To showcase, we will use the table charts from database music_marketplace (see Figure 5.1):

SELECT * FROM music_marketplace.charts LIMIT 10;
Table 5.9: Displaying records 1 - 10
trackId maxRank durationInCharts firstDateInCharts lastDateInCharts
887832 95 0 2016-01-17 16:38:32 2016-01-17 16:38:32
907738 42 20 2015-08-16 12:21:35 2015-09-05 15:43:31
3711813 93 3 2015-12-27 16:38:57 2015-12-30 16:39:06
4423896 56 301 2015-06-27 16:23:18 2016-04-23 16:39:36
4927722 57 266 2015-03-29 16:22:17 2015-12-20 16:38:10
4927724 20 276 2015-03-08 00:13:31 2015-12-09 16:38:08
4969569 62 48 2016-02-12 16:40:14 2016-03-31 16:39:52
4969576 60 6 2016-03-25 16:40:03 2016-03-31 16:39:52
6065153 60 7 2015-03-14 16:22:47 2015-03-21 16:22:52
6079233 35 7 2015-04-11 16:26:24 2015-04-18 16:26:23

Now we can perform some arithmetic operations on its numerical columns:

SELECT maxRank, maxRank * 2 as 'maxRankSQR', maxRank / durationInCharts,  maxRank DIV durationInCharts
FROM music_marketplace.charts LIMIT 10;
Table 5.10: Displaying records 1 - 10
maxRank maxRankSQR maxRank / durationInCharts maxRank DIV durationInCharts
95 190 NA NA
42 84 2.1000 2
93 186 31.0000 31
56 112 0.1860 0
57 114 0.2143 0
20 40 0.0725 0
62 124 1.2917 1
60 120 10.0000 10
60 120 8.5714 8
35 70 5.0000 5

Note that if we do not assign names to the results of an operation SQL assumes that the column name is the opration itself (e.g.,maxRank / durationInCharts and maxRank DIV durationInCharts in the above output).

5.4 Concatinatination of columns

We can concatenate columns with the function CONCAT_WS (separtor, column1, column2, ...), similar to how we use the function paste() in R (see Section (customFunctions)):

SELECT CONCAT_WS(' -> ',title, subtitle)
FROM music_marketplace.songs LIMIT 10;
Table 5.11: Displaying records 1 - 10
CONCAT_WS(’ -> ’,title, subtitle)
If I Want You -> Retroid Remix
Suki -> Original Mix
Goofy's Nightmare -> Original Mix
Suki -> Seph Remix
Golden Nightfall -> Muzzaik Remix
Hope For The Right -> Original Mix
The Night's Already Started -> Belocca Dark Vox
Yungle -> Original Mix
Sail -> Original Mix
Chromatik [Orbzone] -> Original Mix

In the above output we can see that the contents of column title are followed by the separator -> and then by the contents of column subtitle.

5.5 Logic in SQL

The WHERE clause allows us to perform logical operations (Section 1.7). SQL supports the following operators:

Operator in SQL Description Operator in R
< Less than Same
<= Less than or equal to Same
> Greater than Same
>= Greater than or equal to Same
= Exactly equal to ==
!= Not equal to Same
NOT Not (negation) !
AND Logical AND &
OR Logical OR |

To see an example, let us retrieve only the songs that are either House or Techno:

SELECT title, genre 
FROM music_marketplace.songs
WHERE genre = 'House' OR genre = 'Techno'
LIMIT 5;
Table 5.12: 5 records
title genre
Golden Nightfall House
Hope For The Right House
The Night's Already Started House
No Mo Techno
Pillow Techno

Instead of having multiple OR conditions like before, we can alternatively use the special operator IN as follows:

SELECT title, genre 
FROM music_marketplace.songs
WHERE genre IN ('House','Techno')
LIMIT 5;
Table 5.13: 5 records
title genre
Golden Nightfall House
Hope For The Right House
The Night's Already Started House
No Mo Techno
Pillow Techno


R also offers similar functionality with the %in% operator. See Section 5.6.3 for details.


Now let us retrieve songs that are not of genre House:

SELECT title, genre 
FROM music_marketplace.songs
WHERE NOT genre = 'House'
LIMIT 10;
Table 5.14: Displaying records 1 - 10
title genre
If I Want You Breaks
Suki Minimal / Deep Tech
Goofy's Nightmare Minimal / Deep Tech
Suki Minimal / Deep Tech
Yungle Drum & Bass
Sail Electronica / Downtempo
Chromatik [Orbzone] Electronica / Downtempo
Eurofen Electronica / Downtempo
Sun Of Electronica / Downtempo
73.5 feat. nitraM Electronica / Downtempo

Alternatively, we can use the != operator:

SELECT title, genre 
FROM music_marketplace.songs
WHERE genre != 'House'
LIMIT 10;
Table 5.15: Displaying records 1 - 10
title genre
If I Want You Breaks
Suki Minimal / Deep Tech
Goofy's Nightmare Minimal / Deep Tech
Suki Minimal / Deep Tech
Yungle Drum & Bass
Sail Electronica / Downtempo
Chromatik [Orbzone] Electronica / Downtempo
Eurofen Electronica / Downtempo
Sun Of Electronica / Downtempo
73.5 feat. nitraM Electronica / Downtempo

What if we wanted to retrieve the songs that ranked between \(10^{th}\) and \(20^{th}\) place in the charts? We can use the logical and operator:

SELECT * 
FROM music_marketplace.charts
WHERE maxRank >= 10 AND maxRank <= 20
LIMIT 10;
Table 5.16: Displaying records 1 - 10
trackId maxRank durationInCharts firstDateInCharts lastDateInCharts
4927724 20 276 2015-03-08 00:13:31 2015-12-09 16:38:08
6317653 20 34 2015-03-14 16:26:10 2015-04-17 16:26:25
6318411 12 28 2015-03-17 16:27:18 2015-04-14 16:27:04
6330179 20 33 2015-03-27 16:22:36 2015-04-29 16:22:40
6337094 15 15 2015-03-11 16:25:39 2015-03-26 16:25:53
6346768 18 409 2015-03-11 16:22:39 2016-04-23 16:38:21
6347234 10 22 2015-03-19 16:25:57 2015-04-10 16:29:04
6347821 15 26 2015-03-17 16:23:53 2015-04-12 16:23:36
6357844 12 38 2015-04-03 16:26:27 2015-05-11 16:26:33
6360014 17 15 2015-03-11 16:26:55 2015-03-26 16:26:58

Yet, a more natural way to perform the same query is to use the BETWEEN operator:

SELECT * 
FROM music_marketplace.charts
WHERE maxRank BETWEEN 10 AND 20
LIMIT 10;
Table 5.17: Displaying records 1 - 10
trackId maxRank durationInCharts firstDateInCharts lastDateInCharts
4927724 20 276 2015-03-08 00:13:31 2015-12-09 16:38:08
6317653 20 34 2015-03-14 16:26:10 2015-04-17 16:26:25
6318411 12 28 2015-03-17 16:27:18 2015-04-14 16:27:04
6330179 20 33 2015-03-27 16:22:36 2015-04-29 16:22:40
6337094 15 15 2015-03-11 16:25:39 2015-03-26 16:25:53
6346768 18 409 2015-03-11 16:22:39 2016-04-23 16:38:21
6347234 10 22 2015-03-19 16:25:57 2015-04-10 16:29:04
6347821 15 26 2015-03-17 16:23:53 2015-04-12 16:23:36
6357844 12 38 2015-04-03 16:26:27 2015-05-11 16:26:33
6360014 17 15 2015-03-11 16:26:55 2015-03-26 16:26:58

5.6 Analyzing data from a database with R

The analysis so far runs queries but it does not allow us to manipulate the data in R. Yet, often, we would rather like to retrieve data from a dataase and explore it in R.

5.6.1 The SQL chunk way

One way to do so, is to use the option output.var inside the definition of the SQL markdown chunk and get the result in a dataframe For instance, the SQL Chunk below is

```{sql, connection=con, output.var="d"}

and hence saves the query results into an R dataframe named d:

SELECT title, label, genre FROM music_marketplace.songs;

We can access this dataframe d as usual within an R chunk:

 d %>% head
##                   title             label               genre
## 1         If I Want You       Tilth Music              Breaks
## 2                  Suki Phonocult Records Minimal / Deep Tech
## 3 Goofy&#39;s Nightmare Phonocult Records Minimal / Deep Tech
## 4                  Suki Phonocult Records Minimal / Deep Tech
## 5      Golden Nightfall    Lip Recordings               House
## 6    Hope For The Right    Lip Recordings               House

By default, SQL queries ran within RStudio will return 1000 rows. However, through this methodology, we can retrieve the complete table:

d %>%  nrow
## [1] 85987

For consistency, we can also transform the dataframe into a tibble:

d = as_tibble(d)
d %>% head
## # A tibble: 6 × 3
##   title                 label             genre              
##   <chr>                 <chr>             <chr>              
## 1 If I Want You         Tilth Music       Breaks             
## 2 Suki                  Phonocult Records Minimal / Deep Tech
## 3 Goofy&#39;s Nightmare Phonocult Records Minimal / Deep Tech
## 4 Suki                  Phonocult Records Minimal / Deep Tech
## 5 Golden Nightfall      Lip Recordings    House              
## 6 Hope For The Right    Lip Recordings    House

Recall that a tibble is a more efficient dataframe.

5.6.2 The dbGetQuery way

Alternatively, we can save data to a dataframe by using the R function dbGetQuery(con,query) within an R chunk as follows:

query = 'SELECT title, label, genre FROM music_marketplace.songs;'
t = dbGetQuery(con,query)
t %>% as_tibble %>%  head
## # A tibble: 6 × 3
##   title                 label             genre              
##   <chr>                 <chr>             <chr>              
## 1 If I Want You         Tilth Music       Breaks             
## 2 Suki                  Phonocult Records Minimal / Deep Tech
## 3 Goofy&#39;s Nightmare Phonocult Records Minimal / Deep Tech
## 4 Suki                  Phonocult Records Minimal / Deep Tech
## 5 Golden Nightfall      Lip Recordings    House              
## 6 Hope For The Right    Lip Recordings    House

5.6.3 The %in% operator

In Section 5.5 we used the IN operator in SQL. We can perform the same operation in R with the special sequence %in%. First, we will retrieve all the songs from table songs:

query = 'SELECT title, genre FROM music_marketplace.songs LIMIT 100;'
t = as_tibble(dbGetQuery(con,query))
t %>%  head
## # A tibble: 6 × 2
##   title                 genre              
##   <chr>                 <chr>              
## 1 If I Want You         Breaks             
## 2 Suki                  Minimal / Deep Tech
## 3 Goofy&#39;s Nightmare Minimal / Deep Tech
## 4 Suki                  Minimal / Deep Tech
## 5 Golden Nightfall      House              
## 6 Hope For The Right    House

Now, we can keep only those songs that are in genre ‘House’ and ‘Techno’ as follows:

t %>%  filter(genre %in% c('House','Techno')) %>% head
## # A tibble: 6 × 2
##   title                           genre 
##   <chr>                           <chr> 
## 1 Golden Nightfall                House 
## 2 Hope For The Right              House 
## 3 The Night&#39;s Already Started House 
## 4 No Mo                           Techno
## 5 Pillow                          Techno
## 6 El Snig feat. Rut               Techno

For comments, suggestions, errors, and typos, please email me at: