A Longitudinal Analysis of Spotify’s Top 100 Tracks
Delaney Beal
November 20, 2020
Introduction
Goal: to determine which factors influence the overall danceability of a song over time
- Data comes from Spotify’s Top 100 Tracks 2001-2019
- Inspired by projects done in previous courses analyzing datasets from Spotify playlists
Statement of Problem
- Previous datasets were found on Kaggle
- Only two years worth of data was available
- More data needed to be extracted for longitudinal analysis
- Knowledge of Python was minimal so research had to be completed to learn Python
Methods
Spotify for Developers
- Explains the process for retrieving data from Spotify
- Authorization information
- How to work with playlists
- Quick start guide to the Spotify Web API
- Most examples are in Java
Spotify Web API
- Server-side application to access data from user profiles
- Create new playlists
- Find information about songs and artists
Methods
Spotipy
- Python library for the Spotify Web API
- Site included examples of Python code used to extract track information
Previous Uses
- To extract a list of song IDs to pull audio features for the Million Songs Dataset
- To assess what songs will play more expensive advertisements using machine learning
Coding Process
- Python Code had to be produced to extract data points from Spotify
- Spotify Web API tutorial was followed
- Produced code where specific tracks were able to be extracted from specific artists
- Complete playlists could not be extracted
- Tamer’s GitHub page became useful
- Her code was able to extract data from entire playlists
Variables
Outcome Variable
- Danceability - how easy it is to dance to a song, from 0 to 1
Predictor Variables
- Energy - intensity of a track, from 0 to 1
- Key - the key the song is in, from 0 to 11 (A to G#)
- Loudness - how loud a track is, in decibels
- Mode - indicates whether a song is in a major key or a minor key, 1 or 0
- Speechiness - the portion of the song consisting of spoken word, from 0 to 1
- Acousticness - determines if a song is acoustic or not, from 0 to 1
Variables
Predictor Variables
- Instrumentalness - detects how instrumental a song is, from 0 to 1
- Liveness - detects whether or not there is an audience present, from 0 to 1
- Valence - how uplifting a song is, from 0 to 1
- Tempo - the speed of the song, in beats per minutes
- Duration - how long the song is, in milliseconds
- Time signature - how many beats are in a measure
Previous Research
- In Biostatistics,
- Top 100 Tracks of 2017
- Found valance and energy to be significant predictors of danceability
- In Regression Analysis,
- Top 100 Tracks of 2018
- Along with valance and energy, speechiness, acousticness, and tempo were also significant predictors of danceability
Statistical Methods
- Multiple Linear Regression
- Used for datasets with more than one predictor variable
\[\hat{Y} = \beta_0 + \beta_k{X_k} \]
- Assumptions: \[\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\]
Statistical Methods
- Backward Selection
- Begins with full model and strips away predictors based on some criteria
- Must have more observations than variables
From previous research, energy, speechiness, acousticness, valence, and tempo are significant predictors of danceability.
Initial Model
The initial regression model including all variables is: \[\hat{Y} = -7.0358 + 0.0039\mbox{ year} -0.2561\mbox{ energy} + 0.0003\mbox{ key}\] \[+ 0.001\mbox{ loudness} -0.0178\mbox{ mode} + 0.1798\mbox{ speechiness} -0.1227\mbox{ acousticness}\] \[+ 0.119\mbox{ instumentalness} -0.0928\mbox{ liveness} + 0.3141\mbox{ valence} -0.0009\mbox{ tempo}\] \[ + 0\mbox{ duration} + 0.0185\mbox{ time signature}\]
Characteristic |
Beta |
95% CI |
p-value |
year |
0.00 |
0.00, 0.00 |
<0.001 |
energy |
-0.26 |
-0.31, -0.20 |
<0.001 |
key |
0.00 |
0.00, 0.00 |
0.720 |
loudness |
0.00 |
0.00, 0.00 |
0.610 |
mode |
-0.02 |
-0.03, -0.01 |
0.001 |
speechiness |
0.18 |
0.12, 0.24 |
<0.001 |
acousticness |
-0.12 |
-0.15, -0.09 |
<0.001 |
instrumentalness |
0.12 |
0.06, 0.18 |
<0.001 |
liveness |
-0.09 |
-0.13, -0.05 |
<0.001 |
valence |
0.31 |
0.29, 0.34 |
<0.001 |
tempo |
0.00 |
0.00, 0.00 |
<0.001 |
duration_ms |
0.00 |
0.00, 0.00 |
0.614 |
time_signature |
0.02 |
0.00, 0.04 |
0.106 |
Final Model
After removing the variables that are not significant predictors of danceability, the regression model is: \[\hat{Y} = -16.8717 + 0.0088\mbox{ year} -0.2426\mbox{ energy} -0.017\mbox{ mode} + 23.877\mbox{ speechiness}\] \[ -0.1224\mbox{ acousticness} + 0.1145\mbox{ instumentalness} -0.0946\mbox{ liveness} + 13.9302\mbox{ valence}\] \[ -0.0008\mbox{ tempo} -0.0118\mbox{ year} \times \mbox{ speechiness} -0.0068\mbox{ year } \times \mbox{ valence}\]
Interactions between year and speechiness and year and valence.
Characteristic |
Beta |
95% CI |
p-value |
year |
0.01 |
0.01, 0.01 |
<0.001 |
energy |
-0.24 |
-0.28, -0.20 |
<0.001 |
mode |
-0.02 |
-0.03, -0.01 |
0.002 |
speechiness |
24 |
3.7, 44 |
0.021 |
acousticness |
-0.12 |
-0.15, -0.09 |
<0.001 |
instrumentalness |
0.11 |
0.05, 0.17 |
<0.001 |
liveness |
-0.09 |
-0.14, -0.05 |
<0.001 |
valence |
14 |
5.1, 23 |
0.002 |
tempo |
0.00 |
0.00, 0.00 |
<0.001 |
year * speechiness |
-0.01 |
-0.02, 0.00 |
0.022 |
year * valence |
-0.01 |
-0.01, 0.00 |
0.002 |
Testing Assumptions
Graphical Analysis
Interaction between Year and Speechiness
- As year increases, slope of speechiness has decreased
- Relationship is weakening
Interaction between Year and Valence
- As year increases, slope of valence has decreased
- Relationship is weakening
Conclusions
- Conducted longitudinal analysis, expanding on prior projects
- Variables used in final model:
- Year, energy, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, year * speechiness, year * valence
- P-values less than 0.05
- Limitation not accounted for:
- The instance of one artist having multiple songs in the Top 100 Tracks
Impact
The datasets created for this project have been uploaded to Kaggle for future analysts to utilize.
Suggestions for Further Study
- Look for any distinct differences of top songs in 2020
- COVID-19, riots, the election
- Generational differences in music choices
- Public user profiles are accessible by the user’s URI through the Spotify Web API
- Compare the types of music listened to by different age groups
A Longitudinal Analysis of Spotify’s Top 100 Tracks