8 SQL: Aggregating and summarizing results

In this chapter, we will discuss how we can aggregate and summarize results. We will again use the schema music_marketplace that describes a hypothetical marketplace for music. Recall that the schema includes the following tables:

songs: stores information about songs.
label: stores information about music labels.
artist: stores information about artists.
artist_songs: stores which songs are produced by which artists.
charts: stores the maximum rank that a songs achieves in the charts (the hypothetical marketplace has top-selling charts).

Recall that the Entity Relationship diagram of this schema is as follows:

Figure 8.1: The Entity Relationship diagram of schema music_marketplace.

First, we will connect to the database (Section 4.1):

library(DBI)
library(odbc)
library(RODBC)
library(tidyverse)

con = DBI::dbConnect(odbc::odbc(), 
                    Driver= "MySQL", 
                    Server='mysql-isys3350.bc.edu', 
                    UID='your_username', 
                    PWD= 'your_password',
                    Port= 3306)

Aggregation functions perform a calculation on the values of the selected column(s). In this chapter, we will discuss multiple such functions, including:

count, min, max, avg, sum, std
group by, having

We discuss these next.

8.1 Summarization

In SQL, we can estimate simple summary statistics of a column as follows:

SELECT min(maxRank), max(maxRank), avg(maxRank), std(maxRank), sum(maxRank)
FROM music_marketplace.charts;

Table 8.1: 1 records
min(maxRank)	max(maxRank)	avg(maxRank)	std(maxRank)	sum(maxRank)
1	100	45.2291	28.63897	289059

Note that there is not a predefined function to estimate the median of a column!

The function count calculates the number of rows of the result. It is identical to nrow in R:

SELECT count(*)
FROM music_marketplace.charts t1;

Table 8.2: 1 records
count(*)
6391

8.2 Group by

Similar to the group_by function in R, we can group results based on one or more columns. For instance, we can count how many songs per genre exist in the table songs:

SELECT genre, count(*)  
FROM music_marketplace.songs 
GROUP BY genre
LIMIT 10;

Table 8.3: Displaying records 1 - 10
genre	count(*)
-1	2158
Big Room	40
Breaks	948
Dance	1032
Deep House	8289
DJ Tools	503
Drum & Bass	1680
Dubstep	736
Electro House	2939
Electronica / Downtempo	4977

Of course, we can group by multiple columns:

SELECT genre, label, count(*)  
FROM music_marketplace.songs 
WHERE genre != -1
GROUP BY genre, label
LIMIT 10;

Table 8.4: Displaying records 1 - 10
genre	label	count(*)
Big Room	+Mas Label	3
Big Room	AFTERCLUV DANCE LAB	1
Big Room	Anghellic Records	1
Big Room	Armind (Armada)	1
Big Room	ARVA	1
Big Room	Better Than Records	1
Big Room	Columbia (Sony)	2
Big Room	Disruptor Records/Columbia	1
Big Room	Dmn Records	9
Big Room	Doner Music	1

On top of the GROUP BY clause, we can use the HAVING close, which specifies a condition to filter the result. SQL applies the HAVING condition on the result of the GROUP BY (i.e., after filtering in the WHERE clause). For instance, we can extend the previous query as follows:

SELECT genre, label, count(*)  
FROM music_marketplace.songs 
WHERE genre != -1
GROUP BY genre, label
HAVING count(*) > 50
LIMIT 10;

Table 8.5: Displaying records 1 - 10
genre	label	count(*)
Dance	Jalapeno Records	87
Dance	LNG Music	94
Deep House	Anjunadeep	68
Deep House	City Life	94
Deep House	Club Session	69
Deep House	Doppelgaenger	88
Deep House	Flagman	112
Deep House	Get Physical Music	69
Deep House	HiFi Stories	54
Deep House	High Pro-File Recordings	98

Whenever we use a GROUP BY clause, the select clause can only inlcude the columns used to group by and aggregations (e.g., avg, count, sum, etc.).

How does the HAVING clause compare with the WHERE clause? The WHERE clause filters the results before the grouping; the HAVING clause filters the results after the grouping. In addition, a WHERE clause can refer to any columns in the selected tables but cannot filter on aggregate functions. On the other hand, a HAVING clause can only filter on columns that are included in the SELECT clause and can apply aggregate functions.

Here is a different example:

SELECT genre, label, count(*), max(price), avg(price) 
FROM music_marketplace.songs 
WHERE genre != -1 
GROUP BY genre, label
HAVING avg(price) > 0 and count(*) > 100
LIMIT 10;

Table 8.6: Displaying records 1 - 10
genre	label	count(*)	max(price)	avg(price)
Deep House	Flagman	112	1.49	1.490000
Deep House	Variety Music	180	1.49	1.490000
Deep House	Voltaire Music	103	1.49	1.490000
DJ Tools	SuperNova Records	103	1.49	1.490000
Electro House	Club Session	151	1.49	1.490000
Electro House	RH2	136	1.49	1.490000
Electronica / Downtempo	Manifold Records	108	1.49	1.236389
Funk / Soul / Disco	RCA Records Label	129	1.49	1.490000
Hard Dance	Fireball Recordings	101	1.49	1.490000
Hard Dance	Hard Dance Coalition	112	1.49	1.467768

For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu