5 SQL

Product data science interviews tend to allocate 45 minutes to an hour for SQL problems. You will need to be proficient in the following:

  • Joins (inner, left, right, cross)
  • Common Table Expressions (CTEs)
  • Window functions (yes, many companies will ask questions that will require knowledge of window functions)



For all questions below, I will use the following hypothetical schema. The schema includes the following tables:

  • songs: stores information about songs.
  • artist: stores information about artists.
  • artist_songs: stores which songs are produced by which artists.
  • charts: stores the maximum (highest) rank that a song achieves in the charts (the hypothetical marketplace has top-selling charts).


Even though none of the below questions have been ``seen’’ in real interviews (since they all rely on the above hypothetical schema) they all are questions that you have seen in a real interview. Typically, a 45-minute SQL interview will start with something very simple (a join) and then move into CTEs and window functions. Memorizing specific SQL questions is pointless, and that’s why I chose to focus on a specific schema so that you can run as many queries as you like instead of just presenting questions that have been asked in real SQL sessions.


5.1 Choose house or techno

EasyLogical ORTry it


Question: Retrieve only the songs that are either House or Techno. Limit your response to 3.

Answer: We can answer this in a few ways. First, let’s use the logical OR coefficient:

SELECT title, genre 
FROM songs
WHERE genre = 'House' OR genre = 'Techno'
LIMIT 3;
title genre
Drum To Drum Techno
Earthquake Techno
Single Minded Techno

Instead of having multiple OR conditions, we can alternatively use the special operator IN as follows:

SELECT title, genre
FROM songs
WHERE genre IN ('House','Techno')
LIMIT 3;
title genre
Drum To Drum Techno
Earthquake Techno
Single Minded Techno


5.2 Songs that ranked 1 to 50

EasyBetweenTry it


Question: Retrieve the trackId of the songs that ranked between \(1^{st}\) and \(50^{th}\) place in the charts. Limit your answer to 4.

Answer: We can do it through a logical AND operator:

SELECT trackId 
FROM charts
WHERE maxRank >= 1 AND maxRank <= 50
LIMIT 4;
trackId
7603998
6607764
6749010
7588479

Yet, a more natural way to perform the same query is to use the BETWEEN operator:

SELECT trackId
FROM charts
WHERE maxRank BETWEEN 1 AND 50
LIMIT 4;
trackId
7603998
6607764
6749010
7588479


The BETWEEN operator is inclusive: begin and end values are included.


5.3 Artist maxranks

EasyJoinTry it


Question: Write a query to retrieve the artist along with maxRank, title, and whether the artist is a remixer. The result should only contain rows with maxRank greater than 50.
Order the result by artist in increasing order, and limit your response to 4 rows. The results should look like the following:

artist maxRank title remixer
1467 92 30 Years At Sea False
4432 68 2soul True
10297 97 Fusion Love True
12649 59 The Clown feat. Pedestrian False

Answer: We can join tables charts and songs and artist_songs to get all the information we need:

SELECT  distinct artist, maxRank, title, remixer
FROM charts t1 INNER JOIN songs t2
ON t1.trackid = t2.trackid AND t1.maxRank > 50 
INNER JOIN artist_songs t3 ON
t2.trackid = t3.trackid 
order by artist
limit 4
artist maxRank title remixer
1467 92 30 Years At Sea False
4432 68 2soul True
10297 97 Fusion Love True
12649 59 The Clown feat. Pedestrian False

5.4 Expensive house songs I

MediumSubquery, CTE, JoinTry it


Question: Return all the songs that have price greater than the average price of the songs of genre House. Limit your response to 5 songs ordered by title. The price should be rounded to 2 decimal places. Your result should look like the following:

title price
#MIAMIBACKAGAIN 1.49
96 1.49
A House Thing 1.99
A Voz 1.49
Abacaxi 1.49

Answer: There are several ways we can solve this problem. First, let’s solve it with a subquery:

SELECT  title,  ROUND(price,2) as price
FROM songs  
WHERE price > 
(SELECT AVG(price) FROM songs WHERE genre = 'House') and genre = 'House'
ORDER BY title
LIMIT 5
title price
#MIAMIBACKAGAIN 1.49
96 1.49
A House Thing 1.99
A Voz 1.49
Abacaxi 1.49

A different way to solve this problem through a CTE and a cross join:

WITH avg_price as
(
SELECT AVG(price) as mean_price FROM songs WHERE genre = 'House'
)
SELECT  title,  ROUND(price,2) as price
FROM songs  
CROSS JOIN avg_price  ON price > mean_price
AND genre = 'House'
ORDER BY title
limit 5
title price
#MIAMIBACKAGAIN 1.49
96 1.49
A House Thing 1.99
A Voz 1.49
Abacaxi 1.49

5.5 Expensive house songs II

HardSubquery, CTE, Join, Window functionsTry it


Question: Return all the songs of genre House that have price greater than the average price of the songs of genre House, along with the average price of genre House rounded to 2 decimal points. Limit your response to 5 songs ordered by title. Your result should look like the following:

title price Avg price
#MIAMIBACKAGAIN 1.49 1.43
96 1.49 1.43
A House Thing 1.99 1.43
A Voz 1.49 1.43
Abacaxi 1.49 1.43


> Note that both price and Avg price are rounded to 2 decimal points.


Answer: Similar to Expensive house songs I, there are several ways to approach this problem. Perhaps the easiest and dirtiest is to include a SELECT clause in the SELECT clause:

SELECT  title, ROUND(price,2) as price, ROUND((SELECT AVG(price) FROM songs WHERE genre = 'House'),2) as 'Avg price'
FROM songs  
WHERE price > 
(SELECT AVG(price) FROM songs WHERE genre = 'House') AND genre = 'House'
ORDER BY title
LIMIT 5
title price Avg price
#MIAMIBACKAGAIN 1.49 1.43
96 1.49 1.43
A House Thing 1.99 1.43
A Voz 1.49 1.43
Abacaxi 1.49 1.43

Another way to do this is through window functions and CTEs, where we can get the average as follows:

WITH avg_price_per_genre AS
(
SELECT  title, ROUND(price,2) as price, genre,  ROUND(AVG(price) OVER(partition by genre),2)  as 'Avg price'
FROM songs  
)
select title, price, `Avg price`
FROM avg_price_per_genre
WHERE price > `Avg price` AND genre = 'House'
ORDER BY title
LIMIT 5
title price Avg price
#MIAMIBACKAGAIN 1.49 1.43
96 1.49 1.43
A House Thing 1.99 1.43
A Voz 1.49 1.43
Abacaxi 1.49 1.43


An interviewer in a good day will be OK with the first solution. However, an interviewer in a bad day might completely reject the first solution (in fact, it is likely that the interviewer might not even believe that you can include a select clause within a select clause – it has happened to me). It is true however that the second solution shows greater experience and understanding of the language, which is probably what your interviewer will be looking for. In addition, always keep in mind that using CTEs increases your code readability exponentially, so always try to work with CTEs if possible.


5.6 Songs that stay in the chars for a while

MediumSubquery, CTE, Join, ALL, Window functionsTry it


Question: Return all the songs that have durationInCharts greater than the all the durationInCharts of songs of genre Breaks. Order the results by their durationInCharts in decreasing order. Your result should look like the following:

trackId durationInCharts
6166032 407
6706800 397
5971424 362
5954022 326
6329956 313

Answer: One way to do this is to use the keyword ALL and a subquery:

SELECT  trackId,durationInCharts
FROM charts
WHERE durationInCharts > ALL
(SELECT durationInCharts FROM charts c
INNER JOIN songs s ON s.trackid = c.trackid WHERE genre = 'Breaks')
ORDER BY durationInCharts DESC
LIMIT 5;
trackId durationInCharts
6166032 407
6706800 397
5971424 362
5954022 326
6329956 313

Of course, we can also use a subquery along with max():

SELECT  trackId,durationInCharts
FROM charts
WHERE durationInCharts > (SELECT max(durationInCharts) FROM charts c
INNER JOIN songs s ON s.trackid = c.trackid WHERE genre = 'Breaks')
ORDER BY durationInCharts DESC
LIMIT 5;
trackId durationInCharts
6166032 407
6706800 397
5971424 362
5954022 326
6329956 313

If we wanted to do this with CTEs and window functions, one way would have been the following:

WITH max_breaks as 
(
SELECT c.trackId, max(IF(genre= 'Breaks',durationInCharts,NULL)) OVER() as max_breaks
FROM charts c INNER JOIN songs s ON c.trackId = s.trackid
)
SELECT  c.trackId,durationInCharts
FROM charts c 
INNER JOIN max_breaks m ON m.trackid = c.trackid AND durationInCharts > max_breaks
ORDER BY durationInCharts DESC
LIMIT 5;
trackId durationInCharts
6166032 407
6706800 397
5971424 362
5954022 326
6329956 313

5.7 Additional Questions

The following SQL questions are included in the complete book that defines the bar for product data science:

Question Topics
Concat columns Concat
Label recent songs Case
Songs with letters Regexp
Songs per genre Group by
Songs that did not enter the charts or entered high Subquery, Join
Songs in charts with greater durations Subquery, CTE, Join, Window functions
Artists with more songs than others Subquery, CTE, Join, Window functions
Median songs per artist CTE, Window functions
Histogram of songs Recursive CTE, Join