5 SQL
Product data science interviews tend to allocate 45 minutes to an hour for SQL problems. You will need to be proficient in the following:
- Joins (inner, left, right, cross)
- Common Table Expressions (CTEs)
- Window functions (yes, many companies will ask questions that will require knowledge of window functions)
For all questions below, I will use the following hypothetical schema. The schema includes the following tables:
songs
: stores information about songs.artist
: stores information about artists.artist_songs
: stores which songs are produced by which artists.charts
: stores the maximum (highest) rank that a song achieves in the charts (the hypothetical marketplace has top-selling charts).
Even though none of the below questions have been ``seen’’ in real interviews (since they all rely on the above hypothetical schema) they all are questions that you have seen in a real interview. Typically, a 45-minute SQL interview will start with something very simple (a join) and then move into CTEs and window functions. Memorizing specific SQL questions is pointless, and that’s why I chose to focus on a specific schema so that you can run as many queries as you like instead of just presenting questions that have been asked in real SQL sessions.
5.1 Choose house or techno
Question: Retrieve only the songs that are either House
or Techno
. Limit your response to 3.
Answer: We can answer this in a few ways. First, let’s use the logical OR coefficient:
title | genre |
---|---|
Drum To Drum | Techno |
Earthquake | Techno |
Single Minded | Techno |
Instead of having multiple OR
conditions, we can alternatively use the special operator IN
as follows:
title | genre |
---|---|
Drum To Drum | Techno |
Earthquake | Techno |
Single Minded | Techno |
5.2 Songs that ranked 1 to 50
Question: Retrieve the trackId
of the songs that ranked between \(1^{st}\) and \(50^{th}\) place in the charts. Limit your answer to 4.
Answer: We can do it through a logical AND
operator:
trackId |
---|
7603998 |
6607764 |
6749010 |
7588479 |
Yet, a more natural way to perform the same query is to use the BETWEEN
operator:
trackId |
---|
7603998 |
6607764 |
6749010 |
7588479 |
BETWEEN
operator is inclusive: begin and end values are included.
5.3 Artist maxranks
Question: Write a query to retrieve the artist
along with maxRank
, title
, and whether the artist is a remixer
. The result should only contain rows with maxRank
greater than 50.
Order the result by artist
in increasing order, and limit your response to 4 rows. The results should look like the following:
artist | maxRank | title | remixer |
---|---|---|---|
1467 | 92 | 30 Years At Sea | False |
4432 | 68 | 2soul | True |
10297 | 97 | Fusion Love | True |
12649 | 59 | The Clown feat. Pedestrian | False |
Answer: We can join tables charts
and songs
and artist_songs
to get all the information we need:
SELECT distinct artist, maxRank, title, remixer
FROM charts t1 INNER JOIN songs t2
ON t1.trackid = t2.trackid AND t1.maxRank > 50
INNER JOIN artist_songs t3 ON
t2.trackid = t3.trackid
order by artist
limit 4
artist | maxRank | title | remixer |
---|---|---|---|
1467 | 92 | 30 Years At Sea | False |
4432 | 68 | 2soul | True |
10297 | 97 | Fusion Love | True |
12649 | 59 | The Clown feat. Pedestrian | False |
5.4 Expensive house songs I
Question: Return all the songs that have price
greater than the average price of the songs of genre House
. Limit your response to 5 songs ordered by title
. The price
should be rounded to 2 decimal places. Your result should look like the following:
title | price |
---|---|
#MIAMIBACKAGAIN | 1.49 |
96 | 1.49 |
A House Thing | 1.99 |
A Voz | 1.49 |
Abacaxi | 1.49 |
Answer: There are several ways we can solve this problem. First, let’s solve it with a subquery:
SELECT title, ROUND(price,2) as price
FROM songs
WHERE price >
(SELECT AVG(price) FROM songs WHERE genre = 'House') and genre = 'House'
ORDER BY title
LIMIT 5
title | price |
---|---|
#MIAMIBACKAGAIN | 1.49 |
96 | 1.49 |
A House Thing | 1.99 |
A Voz | 1.49 |
Abacaxi | 1.49 |
A different way to solve this problem through a CTE and a cross join:
WITH avg_price as
(
SELECT AVG(price) as mean_price FROM songs WHERE genre = 'House'
)
SELECT title, ROUND(price,2) as price
FROM songs
CROSS JOIN avg_price ON price > mean_price
AND genre = 'House'
ORDER BY title
limit 5
title | price |
---|---|
#MIAMIBACKAGAIN | 1.49 |
96 | 1.49 |
A House Thing | 1.99 |
A Voz | 1.49 |
Abacaxi | 1.49 |
5.5 Expensive house songs II
Question: Return all the songs of genre House
that have price
greater than the average price of the songs of genre House
, along with the average price of genre House
rounded to 2 decimal points. Limit your response to 5 songs ordered by title. Your result should look like the following:
title | price | Avg price |
---|---|---|
#MIAMIBACKAGAIN | 1.49 | 1.43 |
96 | 1.49 | 1.43 |
A House Thing | 1.99 | 1.43 |
A Voz | 1.49 | 1.43 |
Abacaxi | 1.49 | 1.43 |
> Note that both price
and Avg price
are rounded to 2 decimal points.
Answer: Similar to Expensive house songs I
, there are several ways to approach this problem. Perhaps the easiest and dirtiest is to include a SELECT
clause in the SELECT
clause:
SELECT title, ROUND(price,2) as price, ROUND((SELECT AVG(price) FROM songs WHERE genre = 'House'),2) as 'Avg price'
FROM songs
WHERE price >
(SELECT AVG(price) FROM songs WHERE genre = 'House') AND genre = 'House'
ORDER BY title
LIMIT 5
title | price | Avg price |
---|---|---|
#MIAMIBACKAGAIN | 1.49 | 1.43 |
96 | 1.49 | 1.43 |
A House Thing | 1.99 | 1.43 |
A Voz | 1.49 | 1.43 |
Abacaxi | 1.49 | 1.43 |
Another way to do this is through window functions and CTEs, where we can get the average as follows:
WITH avg_price_per_genre AS
(
SELECT title, ROUND(price,2) as price, genre, ROUND(AVG(price) OVER(partition by genre),2) as 'Avg price'
FROM songs
)
select title, price, `Avg price`
FROM avg_price_per_genre
WHERE price > `Avg price` AND genre = 'House'
ORDER BY title
LIMIT 5
title | price | Avg price |
---|---|---|
#MIAMIBACKAGAIN | 1.49 | 1.43 |
96 | 1.49 | 1.43 |
A House Thing | 1.99 | 1.43 |
A Voz | 1.49 | 1.43 |
Abacaxi | 1.49 | 1.43 |
5.6 Songs that stay in the chars for a while
Question: Return all the songs that have durationInCharts
greater than the all the durationInCharts
of songs of genre Breaks
. Order the results by their durationInCharts
in decreasing order. Your result should look like the following:
trackId | durationInCharts |
---|---|
6166032 | 407 |
6706800 | 397 |
5971424 | 362 |
5954022 | 326 |
6329956 | 313 |
Answer: One way to do this is to use the keyword ALL
and a subquery:
SELECT trackId,durationInCharts
FROM charts
WHERE durationInCharts > ALL
(SELECT durationInCharts FROM charts c
INNER JOIN songs s ON s.trackid = c.trackid WHERE genre = 'Breaks')
ORDER BY durationInCharts DESC
LIMIT 5;
trackId | durationInCharts |
---|---|
6166032 | 407 |
6706800 | 397 |
5971424 | 362 |
5954022 | 326 |
6329956 | 313 |
Of course, we can also use a subquery along with max()
:
SELECT trackId,durationInCharts
FROM charts
WHERE durationInCharts > (SELECT max(durationInCharts) FROM charts c
INNER JOIN songs s ON s.trackid = c.trackid WHERE genre = 'Breaks')
ORDER BY durationInCharts DESC
LIMIT 5;
trackId | durationInCharts |
---|---|
6166032 | 407 |
6706800 | 397 |
5971424 | 362 |
5954022 | 326 |
6329956 | 313 |
If we wanted to do this with CTEs and window functions, one way would have been the following:
WITH max_breaks as
(
SELECT c.trackId, max(IF(genre= 'Breaks',durationInCharts,NULL)) OVER() as max_breaks
FROM charts c INNER JOIN songs s ON c.trackId = s.trackid
)
SELECT c.trackId,durationInCharts
FROM charts c
INNER JOIN max_breaks m ON m.trackid = c.trackid AND durationInCharts > max_breaks
ORDER BY durationInCharts DESC
LIMIT 5;
trackId | durationInCharts |
---|---|
6166032 | 407 |
6706800 | 397 |
5971424 | 362 |
5954022 | 326 |
6329956 | 313 |
5.7 Additional Questions
The following SQL questions are included in the complete book that defines the bar for product data science:
Question | Topics |
---|---|
Concat columns | Concat |
Label recent songs | Case |
Songs with letters | Regexp |
Songs per genre | Group by |
Songs that did not enter the charts or entered high | Subquery, Join |
Songs in charts with greater durations | Subquery, CTE, Join, Window functions |
Artists with more songs than others | Subquery, CTE, Join, Window functions |
Median songs per artist | CTE, Window functions |
Histogram of songs | Recursive CTE, Join |