1 Introduction

1.1 Statement of Problem

Everyone loves a good bop. A great song can get you dancing no matter the time or place. Spotify is a modern music-lover’s go-to listening platform. The music that is trending annually can define an entire year. Top 100 songs can be heard anywhere from grocery stores to sports functions. There are many aspects that contribute to a song reaching the Top 100. The purpose of this research is to assess what factors impact the trendiness and danceability of a song over time.

1.2 Relevance of Problem

The subject matter for this research was inspired by projects done in previous courses that consisted of analyzing datasets pulled from yearly Spotify playlists and creating linear regression models. This is a continuation of that research, but now longitudinal analysis will be performed. Longitudinal analysis determines how variables have changed over time. The datasets used in previous courses were found on Kaggle [1], however, only two years worth of data was available. To continue the analysis of yearly Top 100 Tracks, more data would be needed. The datasets used previously were created by Nadin Tamer. After contacting Tamer, the GitHub page where the Python code to extract the data was accessed. After reaching this point, a second problem surfaced. To extract data as needed for the continuation of this research, a more extensive knowledge of Python was needed. Also, research was required to learn about the Python library for the Spotify Web API, known as Spotipy.

After solving these problems, the goal was to create a linear regression model that compares data across years and determine what factors help get a song into the Top 100 Tracks. Then, using the variables that are found to be significant, it could be determined how those variables have changed over time.

1.3 Literature Review

Previous Research

In April 2019, through a group project in Biostatistics, data from the Top 100 Tracks of 2017 was analyzed [2]. As an outcome of this project, it was determined that valance and energy were both significant predictors of danceability. In November 2019, another group project was conducted in Regression Analysis where the data from the Top 100 Tracks of 2018 was analyzed [3]. With a more comprehensive knowledge of statistical modeling, this project provided a more extensive model of the variables. It was ascertained that, along with valance and energy, the variables speechiness, acousticness, and tempo are also significant predictors of danceability.

Ample research has been done regarding Top 100 Tracks, mostly from US Billboard Top 100 Hits. Christenson, Roberts, and Bjork conducted a content analysis of drug and alcohol use in the top 100 Billboard songs from 1968, 1978, 1988, 1998, and 2008 [4]. They found an 18% increase in drug and alcohol references in songs in just two decades. Jensen and Hebert used the US Billboard Top 100 Hits from 1940 to 2015 in their implementation of the “Jensen Chroma Complexity” computational strategy [5]. Kawawa-Beaudan and Garza used the Million Songs Dataset to predict whether a song would make it into the Billboard Top 100 hits using logistic regression [6]. They found that their “best classifier based on recall and precision rate was a Gaussian discriminant model on the metadata features [6].” Durr expanded on previous research done by Lafrance, Worcester, and Burns [7] to include race in the analysis of gender trends of the Billboard Top 40 hits from 1997 to 2007 [8]. In her categorical data analysis, she found that race and gender discrimination play a significant part in the artists whose tracks make it into the Top 40 hits.

Many other researchers have used Spotipy to extract data from Spotify. Menten, Ng, O’Rourke, and Holmes used Spotipy to extract a list of song IDs in order to pull the audio features for the Million Songs Dataset [9]. Using temporal analysis, this research concluded that “changes in the energy and acousticness of songs, combined with decreased correlations between tempo and valence, support preconceived notions that in recent years, 4-beat rock-n-roll style songs, which show heavy correlations between tempo and valence, are being increasingly pushed out of the mainstream and replaced by electronically produced songs of varying tempos [9].”

Another use of the Spotipy library was to assess what songs will play more expensive advertisements using machine learning [10]. Machine learning is one use of artificial intelligence. Kavikondala, Muppalla, Krishna Prakasha, and Acharya stated, “ML enables the machine or a system to learn and improve its performance with experience, without being explicitly programmed [10].” The researchers were able to develop a program where new songs could be input and, by machine learning, the program would update itself so as to not become obsolete.

Surana, Goyal, and Alluri used Spotipy to extract datapoints in order to conduct a correlational analysis between type of music listened and depression risk [11]. Using the variables valence and energy, four quadrants were established: one for happiness, one for anger, one for sadness, and one for tenderness. They concluded that users at risk of depression generally listened to low energy, low valence songs, and users with high anxiety listened to low energy songs.

Misael et al. Used Spotipy to extract songs from different genres of music to compare the lyrics used over time [12]. They found reggae and hip hop songs to have the highest danceability of the genres they tested. They also used linear discriminant analysis to find common topics in song lyrics.

Trygve and Gunnarsson researched entertainment systems that are used on planes and “created a system that could give personalized suggestions without the need of creating a login [13].” In this predictive analysis, they used Spotipy to extract the data that had been listened to by passengers in flight. Then, they were able to recommend songs for passengers to listen to during their plane ride. The Spotipy library has very diverse applications, however, the main purpose is to extract data from Spotify tracks.