Text as Data

\(~\)

Public Enemy released the album Fear of A Black Planet on April 10, 1990. One song that is perhaps familiar is Fight the Power. We will examine the nature of the lyrics of this song by counting the most popular words. The data is included however, it was simply constructed. I searched the web for the lyrics of the song, pasted it into a spreadsheet and saved this as a .csv file. This stands for comma separated value format, and is a standard format for data acquisition. Most spreadsheet programs can save files as .csv files, and these are easily readable by R whether they contain text or numbers.

Reading the Data as Text

First, we will take the .csv file that we’ve saved in the same directory as the .Rmd file we are working with or the .ipynb notebook. Next, we read the file in using the read.csv() function. Usually, text elements in spreadsheets would be converted to what is called a “Factor” in R. This makes sense. Our earlier example of irises would have factors as the species names, where the names are associated with numeric values. We avoid this here because we are interested in the words as individual characters. This is the crucial stringAsFactors = FALSE argument that I forgot when first trying this!

\(~\)

ftp <- read.csv("public_enemy.csv", stringsAsFactors = FALSE)
kable(head(ftp))
song words
Fight the Power 1989 the number another summer (get down)
Fight the Power Sound of the funky drummer
Fight the Power Music hittin’ your heart cause I know you got soul
Fight the Power (Brothers and sisters, hey)
Fight the Power Listen if you’re missin’ y’all
Fight the Power Swingin’ while I’m singin’

\(~\)

Now, what we have is a dataframe with rows that contains lines of lyrics from the song. The first simplest way we will investigate the lyrics is by counting the most popular words. In order to do this, we will use the tidytext library to format the lyrics in tidytext format, which according to the wonderful Text Mining in R book by Julia Silge and David Robinson freely available here, describe the key characteristics of tidytext as follows:

  • Each variable is a column

  • Each observation is a row

  • Each type of observational unit is a table

The unnest_tokenst(word, words) command takes every word and assigns it to a column named words in a new dataframe called tidy_ftp.

library(tidytext)
tidy_ftp  <-  ftp %>%
                unnest_tokens(word, words)
kable(head(tidy_ftp))
song word
1 Fight the Power 1989
1.1 Fight the Power the
1.2 Fight the Power number
1.3 Fight the Power another
1.4 Fight the Power summer
1.5 Fight the Power get