Text as Data
\(~\)
Public Enemy released the album Fear of A Black Planet on April 10, 1990. One song that is perhaps familiar is Fight the Power. We will examine the nature of the lyrics of this song by counting the most popular words. The data is included however, it was simply constructed. I searched the web for the lyrics of the song, pasted it into a spreadsheet and saved this as a .csv
file. This stands for comma separated value format, and is a standard format for data acquisition. Most spreadsheet programs can save files as .csv
files, and these are easily readable by R whether they contain text or numbers.
Reading the Data as Text
First, we will take the .csv
file that we’ve saved in the same directory as the .Rmd
file we are working with or the .ipynb
notebook. Next, we read the file in using the read.csv()
function. Usually, text elements in spreadsheets would be converted to what is called a “Factor” in R. This makes sense. Our earlier example of irises would have factors as the species names, where the names are associated with numeric values. We avoid this here because we are interested in the words as individual characters. This is the crucial stringAsFactors = FALSE
argument that I forgot when first trying this!
\(~\)
ftp <- read.csv("public_enemy.csv", stringsAsFactors = FALSE)
kable(head(ftp))
song | words |
---|---|
Fight the Power | 1989 the number another summer (get down) |
Fight the Power | Sound of the funky drummer |
Fight the Power | Music hittin’ your heart cause I know you got soul |
Fight the Power | (Brothers and sisters, hey) |
Fight the Power | Listen if you’re missin’ y’all |
Fight the Power | Swingin’ while I’m singin’ |
\(~\)
Now, what we have is a dataframe with rows that contains lines of lyrics from the song. The first simplest way we will investigate the lyrics is by counting the most popular words. In order to do this, we will use the tidytext
library to format the lyrics in tidytext format, which according to the wonderful Text Mining in R book by Julia Silge and David Robinson freely available here, describe the key characteristics of tidytext
as follows:
Each variable is a column
Each observation is a row
Each type of observational unit is a table
The unnest_tokenst(word, words)
command takes every word and assigns it to a column named words
in a new dataframe called tidy_ftp
.
library(tidytext)
tidy_ftp <- ftp %>%
unnest_tokens(word, words)
kable(head(tidy_ftp))
song | word | |
---|---|---|
1 | Fight the Power | 1989 |
1.1 | Fight the Power | the |
1.2 | Fight the Power | number |
1.3 | Fight the Power | another |
1.4 | Fight the Power | summer |
1.5 | Fight the Power | get |