4 Text Preprocessing and Featurization

After having learned about the basics of string manipulation, we are now turning to how you can turn your collection of documents, your corpus, into a representation that lends itself nicely to quantitative analyses of text. There are a couple of packages around which you can use for text mining, such as quanteda (Benoit et al. 2018), tm (Feinerer, Hornik, and Meyer 2008), and tidytext (Silge and Robinson 2016), the latter being probably the most recent addition to them. A larger overview of relevant packages can be found on this CRAN Task View.

As you could probably tell from its name, tidytext obeys the tidy data principles¹. “Every observation is a row” translates here to “every token has its own row” – “token” not necessarily relating to a singular term, but also so-called n-grams. In the following, we will demonstrate what text mining using tidy principles can look like in R. For this, we will first cover the preprocessing of text using tidy data principles. Thereafter, we will delve into more advanced preprocessing such as the lemmatization of words and part-of-speech (POS) tagging using spaCy (Honnibal and Montani 2017). Finally, different R packages are using different representations of text data. Depending on the task at hand, you will therefore have to be able to transform the data into the proper format. This will be covered in the final part.

4.1 Pre-processing with `tidytext`

The sotu package contains all of the so-called “State of the Union” addresses – the president gives them to the congress annually – since 1790.

needs(hcandersenr, SnowballC, sotu, spacyr, stopwords, tidyverse, tidytext)

sotu_raw <- sotu_meta |> 
  mutate(text = sotu_text) |> 
  distinct(text, .keep_all = TRUE)

sotu_raw |> glimpse()

Rows: 240
Columns: 7
$ X            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ president    <chr> "George Washington", "George Washington", "George Washing…
$ year         <int> 1790, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 179…
$ years_active <chr> "1789-1793", "1789-1793", "1789-1793", "1789-1793", "1793…
$ party        <chr> "Nonpartisan", "Nonpartisan", "Nonpartisan", "Nonpartisan…
$ sotu_type    <chr> "speech", "speech", "speech", "speech", "speech", "speech…
$ text         <chr> "Fellow-Citizens of the Senate and House of Representativ…

Now that the data are read in, I need to put them into the proper format and clean them. For this purpose, I take a look at the first entry of the tibble.

sotu_raw |> slice(1) |> pull(text) |> str_sub(1, 500)

[1] "Fellow-Citizens of the Senate and House of Representatives: \n\nI embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, an"

4.1.1 `unnest_tokens()`

I will focus on the 20th-century SOTUs. Here, the dplyr::between() function comes in handy.

sotu_20cent_raw <- sotu_raw |> 
  filter(between(year, 1900, 2000))

glimpse(sotu_20cent_raw)

Rows: 109
Columns: 7
$ X            <int> 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 12…
$ president    <chr> "William McKinley", "Theodore Roosevelt", "Theodore Roose…
$ year         <int> 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 190…
$ years_active <chr> "1897-1901", "1901-1905", "1901-1905", "1901-1905", "1901…
$ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
$ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
$ text         <chr> "\n\n To the Senate and House of Representatives: \n\nAt …

In a first step, I bring the data into a form that facilitates manipulation: a tidy tibble. For this, I use tidytext’s unnest_tokens() function. It basically breaks the corpus up into tokens – the respective words. Let’s demonstrate that with a brief, intuitive example. `

toy_example <- tibble(
  text = "Look, this is a brief example for how tokenization works."
)

toy_example |> 
  unnest_tokens(output = token, 
                input = text)

# A tibble: 10 × 1
   token       
   <chr>       
 1 look        
 2 this        
 3 is          
 4 a           
 5 brief       
 6 example     
 7 for         
 8 how         
 9 tokenization
10 works

Note that unnest_tokens() already reduces complexity for us by removing the comma and the full-stop and making everything lower-case.

sotu_20cent_tokenized <- sotu_20cent_raw |> 
  unnest_tokens(output = token, input = text)
glimpse(sotu_20cent_tokenized)

Rows: 911,321
Columns: 7
$ X            <int> 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 11…
$ president    <chr> "William McKinley", "William McKinley", "William McKinley…
$ year         <int> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 190…
$ years_active <chr> "1897-1901", "1897-1901", "1897-1901", "1897-1901", "1897…
$ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
$ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
$ token        <chr> "to", "the", "senate", "and", "house", "of", "representat…

The new tibble consists of 911321 rows. Please note that usually, you have to put some sort of id column into your original tibble before tokenizing it, e.g., by giving each case – representing a document, or chapter, or whatever – a separate id (e.g., using tibble::rowid_to_column()). This does not apply here, because my original tibble came with a bunch of metadata (president, year, party) which serve as sufficient identifiers.

4.1.2 Removal of unnecessary content

The next step is to remove stop words – they are not necessary for the analyses I want to perform. The stopwords package has a nice list for English.

stopwords_vec <- stopwords(language = "en")
stopwords(language = "de") # the german equivalent

  [1] "aber"      "alle"      "allem"     "allen"     "aller"     "alles"    
  [7] "als"       "also"      "am"        "an"        "ander"     "andere"   
 [13] "anderem"   "anderen"   "anderer"   "anderes"   "anderm"    "andern"   
 [19] "anderr"    "anders"    "auch"      "auf"       "aus"       "bei"      
 [25] "bin"       "bis"       "bist"      "da"        "damit"     "dann"     
 [31] "der"       "den"       "des"       "dem"       "die"       "das"      
 [37] "daß"       "derselbe"  "derselben" "denselben" "desselben" "demselben"
 [43] "dieselbe"  "dieselben" "dasselbe"  "dazu"      "dein"      "deine"    
 [49] "deinem"    "deinen"    "deiner"    "deines"    "denn"      "derer"    
 [55] "dessen"    "dich"      "dir"       "du"        "dies"      "diese"    
 [61] "diesem"    "diesen"    "dieser"    "dieses"    "doch"      "dort"     
 [67] "durch"     "ein"       "eine"      "einem"     "einen"     "einer"    
 [73] "eines"     "einig"     "einige"    "einigem"   "einigen"   "einiger"  
 [79] "einiges"   "einmal"    "er"        "ihn"       "ihm"       "es"       
 [85] "etwas"     "euer"      "eure"      "eurem"     "euren"     "eurer"    
 [91] "eures"     "für"       "gegen"     "gewesen"   "hab"       "habe"     
 [97] "haben"     "hat"       "hatte"     "hatten"    "hier"      "hin"      
[103] "hinter"    "ich"       "mich"      "mir"       "ihr"       "ihre"     
[109] "ihrem"     "ihren"     "ihrer"     "ihres"     "euch"      "im"       
[115] "in"        "indem"     "ins"       "ist"       "jede"      "jedem"    
[121] "jeden"     "jeder"     "jedes"     "jene"      "jenem"     "jenen"    
[127] "jener"     "jenes"     "jetzt"     "kann"      "kein"      "keine"    
[133] "keinem"    "keinen"    "keiner"    "keines"    "können"    "könnte"   
[139] "machen"    "man"       "manche"    "manchem"   "manchen"   "mancher"  
[145] "manches"   "mein"      "meine"     "meinem"    "meinen"    "meiner"   
[151] "meines"    "mit"       "muss"      "musste"    "nach"      "nicht"    
[157] "nichts"    "noch"      "nun"       "nur"       "ob"        "oder"     
[163] "ohne"      "sehr"      "sein"      "seine"     "seinem"    "seinen"   
[169] "seiner"    "seines"    "selbst"    "sich"      "sie"       "ihnen"    
[175] "sind"      "so"        "solche"    "solchem"   "solchen"   "solcher"  
[181] "solches"   "soll"      "sollte"    "sondern"   "sonst"     "über"     
[187] "um"        "und"       "uns"       "unse"      "unsem"     "unsen"    
[193] "unser"     "unses"     "unter"     "viel"      "vom"       "von"      
[199] "vor"       "während"   "war"       "waren"     "warst"     "was"      
[205] "weg"       "weil"      "weiter"    "welche"    "welchem"   "welchen"  
[211] "welcher"   "welches"   "wenn"      "werde"     "werden"    "wie"      
[217] "wieder"    "will"      "wir"       "wird"      "wirst"     "wo"       
[223] "wollen"    "wollte"    "würde"     "würden"    "zu"        "zum"      
[229] "zur"       "zwar"      "zwischen"

#stopwords_getlanguages(source = "snowball") # find the languages that are available
#stopwords_getsources() # find the dictionaries that are available

Removing the stop words now is straight-forward:

sotu_20cent_tokenized_nostopwords <- sotu_20cent_tokenized |> 
  filter(!token %in% stopwords_vec)

Another thing I forgot to remove are digits. They do not matter for the analyses either:

sotu_20cent_tokenized_nostopwords_nonumbers <- sotu_20cent_tokenized_nostopwords |> 
  filter(!str_detect(token, "[:digit:]"))

The corpus now contains 19263 different tokens, the so-called “vocabulary.” 1848 tokens were removed from the vocuabulary. This translates to a signifiant reduction in corpus size though, the new tibble only consists of 464271 rows, basically a 50 percent reduction.

4.1.3 Stemming

To decrease the complexity of the vocabulary even further, we can reduce the tokens to their stem using the SnowballC package and its function wordStem():

sotu_20cent_tokenized_nostopwords_nonumbers_stemmed <- sotu_20cent_tokenized_nostopwords_nonumbers |> 
  mutate(token_stemmed = wordStem(token, language = "en"))

#SnowballC::getStemLanguages() # if you want to know the abbreviations for other languages as well

Maybe I should also remove insignificant words, i.e., ones that appear less than 0.05 percent of the time.

n_rows <- nrow(sotu_20cent_tokenized_nostopwords_nonumbers_stemmed)
sotu_20cent_tokenized_nostopwords_nonumbers_stemmed |> 
  group_by(token_stemmed) |> 
  filter(n() > n_rows/2000)

# A tibble: 285,203 × 8
# Groups:   token_stemmed [490]
       X president         year years_active party sotu_type token token_stemmed
   <int> <chr>            <int> <chr>        <chr> <chr>     <chr> <chr>        
 1   112 William McKinley  1900 1897-1901    Repu… written   sena… senat        
 2   112 William McKinley  1900 1897-1901    Repu… written   house hous         
 3   112 William McKinley  1900 1897-1901    Repu… written   repr… repres       
 4   112 William McKinley  1900 1897-1901    Repu… written   old   old          
 5   112 William McKinley  1900 1897-1901    Repu… written   inco… incom        
 6   112 William McKinley  1900 1897-1901    Repu… written   new   new          
 7   112 William McKinley  1900 1897-1901    Repu… written   cent… centuri      
 8   112 William McKinley  1900 1897-1901    Repu… written   begin begin        
 9   112 William McKinley  1900 1897-1901    Repu… written   last  last         
10   112 William McKinley  1900 1897-1901    Repu… written   sess… session      
# ℹ 285,193 more rows

These steps have brought down the vocabulary from 19263 to 10971.

4.1.4 In a nutshell

Well, all those things could also be summarized in one nice cleaning pipeline:

sotu_20cent_clean <- sotu_raw |> 
  filter(between(year, 1900, 2000)) |> 
  unnest_tokens(output = token, input = text) |> 
  anti_join(get_stopwords(), by = c("token" = "word")) |> 
  filter(!str_detect(token, "[0-9]")) |> 
  mutate(token = wordStem(token, language = "en")) |> 
  group_by(token) |> 
  filter(n() > n_rows/2000)

Now I have created a nice tibble containing the SOTU addresses of the 20th century in a tidy format. This is a great point of departure for subsequent analyses.

4.1.5 Exercises

Download Twitter timeline data (timelines <- read_csv("https://www.dropbox.com/s/dpu5m3xqz4u4nv7/tweets_house_rep_party.csv?dl=1") |> filter(!is.na(party)). Let’s look at abortion-related tweets and how the language may differ between parties. Filter relevant tweets using a vector of keywords and a regular expression (hint: filter(str_detect(text, str_c(keywords, collapse = "|")))). Preprocess the Tweets as follows:

Unnest the tokens.
Remove stop words.
Perform stemming.

needs(tidyverse, tidytext, stopwords, SnowballC)

timelines <- read_csv("https://www.dropbox.com/s/dpu5m3xqz4u4nv7/tweets_house_rep_party.csv?dl=1") |> 
  filter(!is.na(party))

keywords <- c("abortion", "prolife", " roe ", " wade ", "roevswade", "baby", "fetus", "womb", "prochoice", "leak")

4.2 Converting between formats

While the tidytext format lends itself nicely to “basic” operations and visualizations, you will have to use different representations of text data for other applications such as topic models or word embeddings. On the other hand, you might want to harness, for instance, the ggplot2 package for visualization. In this case, you will need to project the data into a tidy format. The former operations are performed using multiple cast_.*() functions, the latter using the tidy() function from the broom package whose purpose is to bring data from foreign structures to tidy representations.

In the following, I will briefly explain common representations and the packages that use them. Thereby, I draw heavily on the chapter in Tidy Text Mining with R that is dedicated to the topic.

4.2.1 Document-term matrix

A document-term matrix contains rows that represent a document and columns that represent terms. The values usually correspond to how often the term appears in the respective document.

In R, a common implementation of DTMs is the DocumentTermMatrix class in the tm package. The topicmodels package which we will use for performing LDA comes with a collection of example data.

needs(topicmodels)
data("AssociatedPress")

class(AssociatedPress)

[1] "DocumentTermMatrix"    "simple_triplet_matrix"

AssociatedPress

<<DocumentTermMatrix (documents: 2246, terms: 10473)>>
Non-/sparse entries: 302031/23220327
Sparsity           : 99%
Maximal term length: 18
Weighting          : term frequency (tf)

This data set contains 2246 Associated Press articles which consist of 10,473 different terms. Moreover, the matrix is 99% sparse, meaning that 99% of word-document pairs are zero. The weighting is by term frequency, hence the values correspond to the number of appearances a word has in an article.

AssociatedPress |> 
  head(2) |> 
  as.matrix() %>%
  .[, 1:10]

      Terms
Docs   aaron abandon abandoned abandoning abbott abboud abc abcs abctvs abdomen
  [1,]     0       0         0          0      0      0   0    0      0       0
  [2,]     0       0         0          0      0      0   0    0      0       0

Bringing these data into a tidy format is performed as follows:

associated_press_tidy <- tidy(AssociatedPress)

glimpse(associated_press_tidy)

Rows: 302,031
Columns: 3
$ document <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ term     <chr> "adding", "adult", "ago", "alcohol", "allegedly", "allen", "a…
$ count    <dbl> 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 4, 4, 1…

Transforming the data set into a DTM, the opposite operation, is achieved using cast_dtm(data, document, term, value):

associated_press_dfm <- associated_press_tidy |> 
  cast_dtm(document, term, count)

associated_press_dfm |> 
  head(2) |> 
  as.matrix() %>%
  .[, 1:10]

    Terms
Docs adding adult ago alcohol allegedly allen apparently appeared arrested
   1      1     2   1       1         1     1          2        1        1
   2      0     0   0       0         0     0          0        1        0
    Terms
Docs assault
   1       1
   2       0

4.2.2 Document-feature matrix

The so-called document-feature matrix is the data format used in the quanteda package. It is basically a document-term matrix, but the authors of the quanteda package chose the term feature over term to be more accurate:

“We call them ‘features’ rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.”

data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- data_corpus_inaugural |>
  quanteda::tokens() |>
  quanteda::dfm(verbose = FALSE)

inaug_dfm

Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives :
  1789-Washington               1  71 116      1  48     2               2 1
  1793-Washington               0  11  13      0   2     0               0 1
  1797-Adams                    3 140 163      1 130     0               2 0
  1801-Jefferson                2 104 130      0  81     0               0 1
  1805-Jefferson                0 101 143      0  93     0               0 0
  1809-Madison                  1  69 104      0  43     0               0 0
                 features
docs              among vicissitudes
  1789-Washington     1            1
  1793-Washington     0            0
  1797-Adams          4            0
  1801-Jefferson      1            0
  1805-Jefferson      7            0
  1809-Madison        0            0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,427 more features ]

This, again, can just be tidy()ed.

inaug_tidy <- tidy(inaug_dfm)

glimpse(inaug_tidy)

Rows: 45,452
Columns: 3
$ document <chr> "1789-Washington", "1797-Adams", "1801-Jefferson", "1809-Madi…
$ term     <chr> "fellow-citizens", "fellow-citizens", "fellow-citizens", "fel…
$ count    <dbl> 1, 3, 2, 1, 1, 5, 1, 11, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 71,…

Of course, the resulting tibble can now be cast back into the DFM format using cast_dfm(data, document, term, value). Here, the value corresponds to the number of appearances of the term in the respective document.

inaug_tidy |> 
  cast_dfm(document, term, count)

Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 0 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives :
  1789-Washington               1  71 116      1  48     2               2 1
  1797-Adams                    3 140 163      1 130     0               2 0
  1801-Jefferson                2 104 130      0  81     0               0 1
  1809-Madison                  1  69 104      0  43     0               0 0
  1813-Madison                  1  65 100      0  44     0               0 0
  1817-Monroe                   5 164 275      0 122     0               1 0
                 features
docs              among vicissitudes
  1789-Washington     1            1
  1797-Adams          4            0
  1801-Jefferson      1            0
  1809-Madison        0            0
  1813-Madison        1            0
  1817-Monroe         3            0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,427 more features ]

4.2.3 Corpus objects

Another common way of storing data is in so-called corpora. This is usually a collection of raw documents and metadata. An example would be the collection of State of the Union speeches we worked with earlier. The tm package has a class for corpora.

data("acq", package = "tm")

acq

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 50

#str(acq |> head(1))

It is basically a list containing different elements that refer to metadata or the content. This is a nice and effective framework for storing documents, yet it does not lend itself nicely for analysis with tidy tools. You can use tidy() to clean it up a bit:

acq_tbl <- acq |> 
  tidy()

This results in a tibble that contains the relevant metadata and a text column. A good point of departure for subsequent tidy analyses.

4.2.4 Exercises

Use the data set from exercise #1

Preprocess them and transform it into the different formats DTM and DFM.
Vary your pre-processing (no stemming, different stopword lists, etc.). How will it alter the dimensions (dim()) of your DTM/DFM?

needs(tidyverse, tidytext, stopwords, SnowballC)

timelines <- read_csv("https://www.dropbox.com/s/dpu5m3xqz4u4nv7/tweets_house_rep_party.csv?dl=1") |> 
  filter(!is.na(party))

keywords <- c("abortion", "prolife", " roe ", " wade ", "roevswade", "baby", "fetus", "womb", "prochoice", "leak")

4.3 Further links

Tidy text mining with R.
A more general introduction by Christopher Bail.

Each observation has its own row, each variable its own column, each value has its own cell, find more here ↩︎

4.1 Pre-processing with tidytext

4.1.1 unnest_tokens()