Chapter 2 Text Pre-Processing: Text to Data

First, we need to turn the corpus into a representation that lends itself nicely to quantitative analyses of text. There are a couple of packages around which you can use for text mining, such as quanteda (Benoit et al. 2018), tm (Feinerer, Hornik, and Meyer 2008), and tidytext (Silge and Robinson 2016), the latter being probably the most recent addition to them. A larger overview of relevant packages can be found on this CRAN Task View.

As you could probably tell from its name, tidytext obeys the tidy data principles2. “Every observation is a row” translates here to “every token has its own row” – “token” not necessarily relating to a singular term, but also so-called n-grams. In the following, we will demonstrate what text mining using tidy principles can look like in R. For this, we will first cover the preprocessing of text using tidy data principles. Thereafter, we will delve into more advanced preprocessing such as the lemmatization of words and part-of-speech (POS) tagging using spaCy (Honnibal and Montani 2017). Finally, different R packages are using different representations of text data. Depending on the task at hand, you will therefore have to be able to transform the data into the proper format. This will be covered in the final part.

2.1 Pre-processing with tidytext

The sotu package contains all of the so-called “State of the Union” addresses – the president gives them to the congress annually – since 1790.

library(tidyverse)
library(sotu)
sotu_raw <- sotu_meta |> 
  mutate(text = sotu_text) |> 
  distinct(text, .keep_all = TRUE)

sotu_raw |> glimpse()
## Rows: 240
## Columns: 7
## $ X            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ president    <chr> "George Washington", "George Washington", "George Washing…
## $ year         <int> 1790, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 179…
## $ years_active <chr> "1789-1793", "1789-1793", "1789-1793", "1789-1793", "1793…
## $ party        <chr> "Nonpartisan", "Nonpartisan", "Nonpartisan", "Nonpartisan…
## $ sotu_type    <chr> "speech", "speech", "speech", "speech", "speech", "speech…
## $ text         <chr> "Fellow-Citizens of the Senate and House of Representativ…

Now that the data are read in, I need to put them into the proper format and clean them. For this purpose, I take a look at the first entry of the tibble.

sotu_raw |> slice(1) |> pull(text) |> str_sub(1, 500)
## [1] "Fellow-Citizens of the Senate and House of Representatives: \n\nI embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, an"

2.1.1 unnest_tokens()

I will focus on the 20th-century SOTUs. Here, the dplyr::between() function comes in handy.

sotu_20cent_raw <- sotu_raw |> 
  filter(between(year, 1900, 2000))

In a first step, I bring the data into a form that facilitates manipulation: a tidy tibble. For this, I use tidytext’s unnest_tokens() function. It basically breaks the corpus up into tokens – the respective words. Let’s demonstrate that with a brief, intuitive example. `

library(tidytext)
toy_example <- tibble(
  text = "Look, this is a brief example for how tokenization works."
)

toy_example |> 
  unnest_tokens(output = token, 
                input = text)
## # A tibble: 10 × 1
##    token       
##    <chr>       
##  1 look        
##  2 this        
##  3 is          
##  4 a           
##  5 brief       
##  6 example     
##  7 for         
##  8 how         
##  9 tokenization
## 10 works

Note that unnest_tokens() already reduces complexity for us by removing the comma and the full-stop and making everything lower-case.

sotu_20cent_tokenized <- sotu_20cent_raw |> 
  unnest_tokens(output = token, input = text)
glimpse(sotu_20cent_tokenized)
## Rows: 911,321
## Columns: 7
## $ X            <int> 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 11…
## $ president    <chr> "William McKinley", "William McKinley", "William McKinley…
## $ year         <int> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 190…
## $ years_active <chr> "1897-1901", "1897-1901", "1897-1901", "1897-1901", "1897…
## $ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
## $ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
## $ token        <chr> "to", "the", "senate", "and", "house", "of", "representat…

The new tibble consists of 911321 rows. Please note that usually, you have to put some sort of id column into your original tibble before tokenizing it, e.g., by giving each case – representing a document, or chapter, or whatever – a separate id (e.g., using tibble::rowid_to_column()). This does not apply here, because my original tibble came with a bunch of metadata (president, year, party) which serve as sufficient identifiers.

2.1.2 Removal of unnecessary content

The next step is to remove stop words – they are not necessary for the analyses I want to perform. The stopwords package has a nice list for English.

library(stopwords)
stopwords_vec <- stopwords(language = "en")
#stopwords(language = "de") # the german equivalent
#stopwords_getlanguages() # find the languages that are available

Removing the stop words now is straight-forward:

sotu_20cent_tokenized_nostopwords <- sotu_20cent_tokenized |> 
  filter(!token %in% stopwords_vec)

Another thing I forgot to remove are digits. They do not matter for the analyses either:

sotu_20cent_tokenized_nostopwords_nonumbers <- sotu_20cent_tokenized_nostopwords |> 
  filter(!str_detect(token, "[:digit:]"))

The corpus now contains 19263 different tokens, the so-called “vocabulary.” 1848 tokens were removed from the vocuabulary. This translates to a signifiant reduction in corpus size though, the new tibble only consists of 464271 rows, basically a 50 percent reduction.

2.1.3 Stemming

To decrease the complexity of the vocabulary even further, we can reduce the tokens to their stem using the SnowballC package and its function wordStem():

library(SnowballC)
sotu_20cent_tokenized_nostopwords_nonumbers_stemmed <- sotu_20cent_tokenized_nostopwords_nonumbers |> 
  mutate(token_stemmed = wordStem(token, language = "en"))

#SnowballC::getStemLanguages() # if you want to know the abbreviations for other languages as well

Maybe I should also remove insignificant words, i.e., ones that appear less than 0.5 percent of the time.

n_rows <- nrow(sotu_20cent_tokenized_nostopwords_nonumbers_stemmed)
sotu_20cent_tokenized_nostopwords_nonumbers_stemmed |> 
  group_by(token) |> 
  filter(n() > n_rows/200)
## # A tibble: 13,406 × 8
## # Groups:   token [5]
##        X president         year years_active party sotu_type token token_stemmed
##    <int> <chr>            <int> <chr>        <chr> <chr>     <chr> <chr>        
##  1   112 William McKinley  1900 1897-1901    Repu… written   cong… congress     
##  2   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
##  3   112 William McKinley  1900 1897-1901    Repu… written   cong… congress     
##  4   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
##  5   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
##  6   112 William McKinley  1900 1897-1901    Repu… written   year  year         
##  7   112 William McKinley  1900 1897-1901    Repu… written   year  year         
##  8   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
##  9   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
## 10   112 William McKinley  1900 1897-1901    Repu… written   gove… govern       
## # ℹ 13,396 more rows

These steps have brought down the vocabulary from 19263 to 10971.

2.1.4 In a nutshell

Well, all those things could also be summarized in one nice cleaning pipeline:

sotu_20cent_clean <- sotu_raw |> 
  filter(between(year, 1900, 2000)) |> 
  unnest_tokens(output = token, input = text) |> 
  anti_join(get_stopwords(), by = c("token" = "word")) |> 
  filter(!str_detect(token, "[:digit:]")) |> 
  mutate(token = wordStem(token, language = "en")) |> 
  group_by(token) |> 
  filter(n() > n_rows/200)

Now I have created a nice tibble containing the SOTU addresses of the 20th century in a tidy format. This is a great point of departure for subsequent analyses.

2.3 Exercises

  1. Use data from the friends R package. Take data from the first season. Preprocess them as we did earlier, each document should be one utterance.
#install.packages("friends")
library(friends)

dplyr::glimpse(friends)
  1. What are the 20 most used words (hint: use count() and slice_max() from the dplyr package)?
  2. Who said the most words?
  3. Remove stopwords. How does it change your results from a?
  4. Are there any other words in your 20 most used words you would consider stopwords as they do not convey any bigger meaning? How does removing them alter your results?
  5. Preprocess the remaining seasons and perform the analysis you did above. Could you visualize the results in a neat bar plot comparing the 5 most-used words in the different seasons? (We will learn more on this next week!)
  1. Use spacyr to investigate which countries were mentioned the most in the SOTU addresses over time in the 20th century. Do you find patterns? (step-by-step: take the SOTUs; filter them (1900–2000); spacyr::spacy_extract_entity(), filter geographical units, normalize them – str_replace_all + pattern; plot them in ggplot2 with date on x-axis, count on y-axis, colored by country)
Solution. Click to expand!
#1
season_1_tokens <- friends |> 
  filter(season == 1) |> 
  unnest_tokens(token, text)

#a
season_1_tokens |> 
  count(token) |> 
  slice_max(n, n = 20)

#b
season_1_tokens |> 
  count(speaker) |> 
  arrange(-n)

#c
season_1_tokens |> 
  filter(!token %in% stopwords_vec) |> 
  count(token) |> 
  slice_max(n, n = 20)

#d
stopwords_friends_specific <- c("oh", "just", "know", "like", "yeah", "uh", "well", "okay", "hey", "right", "gonna", "ok")
season_1_tokens |> 
  filter(!token %in% stopwords_vec) |> 
  filter(!token %in% stopwords_friends_specific) |> 
  count(token) |> 
  slice_max(n, n = 20)

#e
friends |> 
  unnest_tokens(token, text) |> 
  count(token, season) |> 
  filter(!token %in% stopwords_vec) |> 
  filter(!token %in% stopwords_friends_specific) |> 
  group_by(season) |> 
  slice_max(n, n = 5) |> 
  ungroup() |> 
  mutate(token = reorder_within(token, n, season)) |> 
  ggplot() +
  geom_col(aes(x = n, y = token)) +
  facet_wrap(vars(season), scales = "free") + 
  scale_y_reordered()

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5). https://doi.org/10.18637/jss.v025.i05.
Honnibal, Matthew, and Ines Montani. 2017. spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.”
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.

  1. Each observation has its own row, each variable its own column, each value has its own cell, find more here↩︎