A Longitudinal Analysis of Spotify’s Top 100 Tracks

Delaney Beal

November 20, 2020

Introduction

Goal: to determine which factors influence the overall danceability of a song over time

Data comes from Spotify’s Top 100 Tracks 2001-2019
Inspired by projects done in previous courses analyzing datasets from Spotify playlists

Statement of Problem

Previous datasets were found on Kaggle
Only two years worth of data was available
- More data needed to be extracted for longitudinal analysis
Knowledge of Python was minimal so research had to be completed to learn Python

Methods

Spotify for Developers

Explains the process for retrieving data from Spotify
- Authorization information
- How to work with playlists
- Quick start guide to the Spotify Web API
Most examples are in Java

Spotify Web API

Server-side application to access data from user profiles
- Create new playlists
- Find information about songs and artists

Methods

Spotipy

Python library for the Spotify Web API
Site included examples of Python code used to extract track information

Previous Uses

To extract a list of song IDs to pull audio features for the Million Songs Dataset
To assess what songs will play more expensive advertisements using machine learning

Coding Process

Python Code had to be produced to extract data points from Spotify
Spotify Web API tutorial was followed
- Produced code where specific tracks were able to be extracted from specific artists
- Complete playlists could not be extracted
Tamer’s GitHub page became useful
- Her code was able to extract data from entire playlists

Data Extraction Process

An app needed to be created through the Spotify Dashboard

Gives access to Spotify Client ID and Client Secret
Permitted access to various data points in Spotify through Python
Provided a path to extract the data from the playlists

Data Extraction Process

A whitelisted redirect URI had to be created with Spotify within the app for re-authentication when running the Python code
Playlist URI’s could be inserted into the Python code and produced a .csv file with all of the data points needed for analyses

Variables

Outcome Variable

Danceability - how easy it is to dance to a song, from 0 to 1

Predictor Variables

Energy - intensity of a track, from 0 to 1
Key - the key the song is in, from 0 to 11 (A to G#)
Loudness - how loud a track is, in decibels
Mode - indicates whether a song is in a major key or a minor key, 1 or 0
Speechiness - the portion of the song consisting of spoken word, from 0 to 1
Acousticness - determines if a song is acoustic or not, from 0 to 1

Variables

Predictor Variables

Instrumentalness - detects how instrumental a song is, from 0 to 1
Liveness - detects whether or not there is an audience present, from 0 to 1
Valence - how uplifting a song is, from 0 to 1
Tempo - the speed of the song, in beats per minutes
Duration - how long the song is, in milliseconds
Time signature - how many beats are in a measure

Previous Research

In Biostatistics,
- Top 100 Tracks of 2017
- Found valance and energy to be significant predictors of danceability
In Regression Analysis,
- Top 100 Tracks of 2018
- Along with valance and energy, speechiness, acousticness, and tempo were also significant predictors of danceability

Statistical Methods

Multiple Linear Regression
- Used for datasets with more than one predictor variable

$\hat{Y} = \beta_0 + \beta_k{X_k}$

Assumptions: $\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)$

Statistical Methods

Backward Selection
- Begins with full model and strips away predictors based on some criteria
- Must have more observations than variables

From previous research, energy, speechiness, acousticness, valence, and tempo are significant predictors of danceability.

Initial Model

The initial regression model including all variables is: $\hat{Y} = -7.0358 + 0.0039\mbox{ year} -0.2561\mbox{ energy} + 0.0003\mbox{ key}$ $+ 0.001\mbox{ loudness} -0.0178\mbox{ mode} + 0.1798\mbox{ speechiness} -0.1227\mbox{ acousticness}$ $+ 0.119\mbox{ instumentalness} -0.0928\mbox{ liveness} + 0.3141\mbox{ valence} -0.0009\mbox{ tempo}$ $+ 0\mbox{ duration} + 0.0185\mbox{ time signature}$

Characteristic	Beta	95% CI¹	p-value
year	0.00	0.00, 0.00	<0.001
energy	-0.26	-0.31, -0.20	<0.001
key	0.00	0.00, 0.00	0.720
loudness	0.00	0.00, 0.00	0.610
mode	-0.02	-0.03, -0.01	0.001
speechiness	0.18	0.12, 0.24	<0.001
acousticness	-0.12	-0.15, -0.09	<0.001
instrumentalness	0.12	0.06, 0.18	<0.001
liveness	-0.09	-0.13, -0.05	<0.001
valence	0.31	0.29, 0.34	<0.001
tempo	0.00	0.00, 0.00	<0.001
duration_ms	0.00	0.00, 0.00	0.614
time_signature	0.02	0.00, 0.04	0.106
¹ CI = Confidence Interval

Final Model

After removing the variables that are not significant predictors of danceability, the regression model is: $\hat{Y} = -16.8717 + 0.0088\mbox{ year} -0.2426\mbox{ energy} -0.017\mbox{ mode} + 23.877\mbox{ speechiness}$ $-0.1224\mbox{ acousticness} + 0.1145\mbox{ instumentalness} -0.0946\mbox{ liveness} + 13.9302\mbox{ valence}$ $-0.0008\mbox{ tempo} -0.0118\mbox{ year} \times \mbox{ speechiness} -0.0068\mbox{ year } \times \mbox{ valence}$

Interactions between year and speechiness and year and valence.

Characteristic	Beta	95% CI¹	p-value
year	0.01	0.01, 0.01	<0.001
energy	-0.24	-0.28, -0.20	<0.001
mode	-0.02	-0.03, -0.01	0.002
speechiness	24	3.7, 44	0.021
acousticness	-0.12	-0.15, -0.09	<0.001
instrumentalness	0.11	0.05, 0.17	<0.001
liveness	-0.09	-0.14, -0.05	<0.001
valence	14	5.1, 23	0.002
tempo	0.00	0.00, 0.00	<0.001
year * speechiness	-0.01	-0.02, 0.00	0.022
year * valence	-0.01	-0.01, 0.00	0.002
¹ CI = Confidence Interval

Testing Assumptions

Graphical Analysis

Interaction between Year and Speechiness

As year increases, slope of speechiness has decreased
- Relationship is weakening

Interaction between Year and Valence

As year increases, slope of valence has decreased
- Relationship is weakening

Conclusions

Conducted longitudinal analysis, expanding on prior projects
Variables used in final model:
- Year, energy, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, year * speechiness, year * valence
- P-values less than 0.05
Limitation not accounted for:
- The instance of one artist having multiple songs in the Top 100 Tracks

Impact

The datasets created for this project have been uploaded to Kaggle for future analysts to utilize.

Suggestions for Further Study

Look for any distinct differences of top songs in 2020
- COVID-19, riots, the election
Generational differences in music choices
- Public user profiles are accessible by the user’s URI through the Spotify Web API
- Compare the types of music listened to by different age groups