7 Lemmatization, Named Entity Recognition, POS-tagging, and Dependency Parsing with `spaCyR`

Advanced operations to the end of extracting information and annotating text (and more!) can be done with spaCyr (Benoit and Matsuo 2020). spaCyr is an R wrapper around the spaCy Python package and, therefore, a bit tricky to install at first. You can find instructions here.

The functionalities spaCyr offers you are the following¹:

parsing texts into tokens or sentences;
lemmatizing tokens;
parsing dependencies (to identify the grammatical structure of the sentence); and
identifying, extracting, or consolidating token sequences that form named entities or noun phrases.

In brief, preprocessing with spaCyr is computationally more expensive than using, for instance, tidytext, but it will give you more accurate lemmatization instead of “stupid,” rule-based stemming.. Also, it allows you to break up documents into smaller entities, sentences, which might be more suitable, e.g., as input for classifiers (since sentences tend to be about one topic, they allow for more fine-grained analyses). Part-of-speech (POS) tagging basically provides you with the functions of the different terms within the sentence. This might prove useful for tasks such as sentiment analysis. The final task spaCyr can help you with is Named Entity Recognition (NER) which can be used for tasks such as sampling relevant documents.

7.1 Initializing spaCy

Before using spaCyr, it needs to be initialized. What happens during this process is that R basically opens a connection to Python so that it can then run the spaCyr functions in Python’s spaCy. Once you have set up everything properly (see instructions), you can initialize it using spacy_initialize(model). Different language models can be specified and an overview can be found here. Note that a process of spaCy is started when you spacy_initialize() and continues running in the background. Hence, once you don’t need it anymore, or want to load a different model, you should spacy_finalize().

needs(spacyr, tidyverse, sotu)

spacy_initialize(model = "en_core_web_sm")

successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

# to download new model -- here: French
#spacy_finalize()
#spacy_download_langmodel(model = "fr_core_news_sm")
#spacy_initialize(model = "fr_core_news_sm") #check that it has worked

spacy_finalize()
#spacy_initialize(model = "de_core_web_sm") # for German

7.2 `spacy_parse()`

spaCyr’s workhorse function is spacy_parse(). It takes a character vector or TIF-compliant data frame. The latter is basically a tibble containing at least two columns, one named doc_id with unique document ids and one named text, containing the respective documents.

tif_toy_example <- tibble(
  doc_id = "doc1",
  text = "Look, this is a brief example for how tokenization works. This second sentence allows me to demonstrate another functionality of spaCy."
)

toy_example_vec <- tif_toy_example$text

spacy_parse(tif_toy_example)

successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

   doc_id sentence_id token_id         token         lemma   pos    entity
1    doc1           1        1          Look          look  VERB          
2    doc1           1        2             ,             , PUNCT          
3    doc1           1        3          this          this  PRON          
4    doc1           1        4            is            be   AUX          
5    doc1           1        5             a             a   DET          
6    doc1           1        6         brief         brief   ADJ          
7    doc1           1        7       example       example  NOUN          
8    doc1           1        8           for           for   ADP          
9    doc1           1        9           how           how SCONJ          
10   doc1           1       10  tokenization  tokenization  NOUN          
11   doc1           1       11         works          work  VERB          
12   doc1           1       12             .             . PUNCT          
13   doc1           2        1          This          this   DET          
14   doc1           2        2        second        second   ADJ ORDINAL_B
15   doc1           2        3      sentence      sentence  NOUN          
16   doc1           2        4        allows         allow  VERB          
17   doc1           2        5            me             I  PRON          
18   doc1           2        6            to            to  PART          
19   doc1           2        7   demonstrate   demonstrate  VERB          
20   doc1           2        8       another       another   DET          
21   doc1           2        9 functionality functionality  NOUN          
22   doc1           2       10            of            of   ADP          
23   doc1           2       11         spaCy         spaCy PROPN          
24   doc1           2       12             .             . PUNCT

The output of spacy_parse() and the sotu-speeches looks as follows:

sotu_speeches_tif <- sotu_meta |> 
  mutate(text = sotu_text) |> 
  distinct(text, .keep_all = TRUE) |> 
  filter(between(year, 1990, 2000)) |> 
  group_by(year) |> 
  summarize(text = str_c(text, collapse = " ")) |> 
  select(doc_id = year, text)

glimpse(sotu_speeches_tif)

Rows: 11
Columns: 2
$ doc_id <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000
$ text   <chr> "\n\nMr. President, Mr. Speaker, Members of the United States C…

sotu_parsed <- spacy_parse(sotu_speeches_tif,
                           pos = TRUE,
                           tag = TRUE,
                           lemma = TRUE,
                           entity = TRUE,
                           dependency = TRUE,
                           nounphrase = TRUE,
                           multithread = TRUE)

# if you haven't installed spacy yet, uncomment and run the following line
#sotu_parsed <- read_rds("https://github.com/fellennert/sicss-paris-2023/raw/main/code/sotu_parsed.rds")

Note that this is already fairly similar to the output of tidytext’s unnest_tokens() function. The advantages are that the lemmas are more accurate, that we have a new sub-entity – sentences –, and that there is now more information on the type and meanings of the words.

7.3 POS tags, NER, and nounphrases

The abbreviations in the pos column follow the format of Universal POS tags. Entities can be extracted by passing the parsed object on to entity_extract().

entity_extract(sotu_parsed, type = "all") |> glimpse()

Rows: 4,269
Columns: 4
$ doc_id      <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1…
$ sentence_id <int> 1, 1, 1, 3, 3, 3, 4, 4, 6, 6, 6, 6, 7, 9, 9, 10, 10, 11, 1…
$ entity      <chr> "Speaker", "Senate", "House", "Tonight", "Government", "th…
$ entity_type <chr> "PERSON", "ORG", "ORG", "TIME", "ORG", "DATE", "NORP", "GP…

The following entities are recognized (overview taken from this article):

PERSON: People, including fictional.
NORP: Nationalities or religious or political groups.
FAC: Buildings, airports, highways, bridges, etc.
ORG: Companies, agencies, institutions, etc.
GPE: Countries, cities, states.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PRODUCT: Objects, vehicles, foods, etc. (Not services.)
EVENT: Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW: Named documents made into laws.
LANGUAGE: Any named language.
DATE: Absolute or relative dates or periods.
TIME: Times smaller than a day.
PERCENT: Percentage, including “%”.
MONEY: Monetary values, including unit.
QUANTITY: Measurements, as of weight or distance.
ORDINAL: “first,” “second,” etc.
CARDINAL: Numerals that do not fall under another type.

To properly represent entities in our corpus, you can use entity_consolidate(). This collapses words that belong to the same entity into single tokens (e.g., “the” “white” “house” becomes “the_white_house”).

entity_consolidate(sotu_parsed) |> glimpse()

Note: removing head_token_id, dep_rel for named entities

Rows: 81,724
Columns: 8
$ doc_id      <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1…
$ sentence_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ token_id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ token       <chr> "\n\n", "Mr.", "President", ",", "Mr.", "Speaker", ",", "M…
$ lemma       <chr> "\n\n", "Mr.", "President", ",", "Mr.", "Speaker", ",", "m…
$ pos         <chr> "SPACE", "PROPN", "PROPN", "PUNCT", "PROPN", "ENTITY", "PU…
$ tag         <chr> "_SP", "NNP", "NNP", ",", "NNP", "ENTITY", ",", "NNS", "IN…
$ entity_type <chr> "", "", "", "", "", "PERSON", "", "", "", "", "", "", "", …

If you want to extract only nouns, you can simply filter them.

sotu_parsed |> 
  entity_consolidate() |> 
  filter(pos == "NOUN") |> 
  glimpse()

Note: removing head_token_id, dep_rel for named entities

Rows: 14,150
Columns: 8
$ doc_id      <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1…
$ sentence_id <int> 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6…
$ token_id    <dbl> 8, 10, 17, 9, 19, 28, 31, 12, 15, 22, 25, 31, 4, 6, 13, 16…
$ token       <chr> "Members", "privilege", "state", "state", "initiative", "l…
$ lemma       <chr> "member", "privilege", "state", "state", "initiative", "li…
$ pos         <chr> "NOUN", "NOUN", "NOUN", "NOUN", "NOUN", "NOUN", "NOUN", "N…
$ tag         <chr> "NNS", "NN", "NN", "NN", "NN", "NN", "NN", "NNS", "NN", "N…
$ entity_type <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…

However, a better way is to extract the “complete” noun phrases:

nounphrase_extract(sotu_parsed) |> glimpse()

Rows: 21,604
Columns: 3
$ doc_id      <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1…
$ sentence_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3…
$ nounphrase  <chr> "\n\nMr._President", "Mr._Speaker", "Members", "the_United…

Usually, entities and noun phrases can give you a good idea of what texts are about. Therefore, you might want to only extract them without parsing the entire text.

spacy_extract_entity(sotu_speeches_tif |> slice(1:3)) |> glimpse()

Rows: 689
Columns: 5
$ doc_id   <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990…
$ text     <chr> "Speaker", "Senate", "House", "Tonight", "Government", "the c…
$ ent_type <chr> "PERSON", "ORG", "ORG", "TIME", "ORG", "DATE", "NORP", "GPE",…
$ start_id <dbl> 6, 24, 32, 56, 67, 78, 100, 129, 158, 174, 180, 197, 205, 243…
$ length   <int> 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 1, 4, 2, 5, 4, 1, 6, 1, 1, 1…

spacy_extract_nounphrases(sotu_speeches_tif |> slice(1:3)) |> glimpse()

Rows: 3,887
Columns: 6
$ doc_id    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "199…
$ text      <chr> "\n\nMr. President", "Mr. Speaker", "Members", "the United S…
$ root_text <chr> "President", "Speaker", "Members", "Congress", "I", "Preside…
$ start_id  <dbl> 1, 5, 8, 10, 16, 19, 23, 26, 30, 38, 40, 42, 47, 49, 52, 57,…
$ root_id   <dbl> 3, 6, 8, 13, 16, 21, 24, 28, 32, 38, 40, 43, 47, 50, 53, 57,…
$ length    <int> 3, 2, 1, 4, 1, 3, 2, 3, 3, 1, 1, 2, 1, 2, 2, 1, 2, 2, 3, 1, …

7.4 Exercises

Use spacyr to investigate which countries were mentioned the most in the SOTU addresses over time in the 20th century. Do you find patterns? (step-by-step: take the SOTUs; filter them (1900–2000); spacyr::spacy_extract_entity(), filter geographical units, normalize them – str_replace_all + replacement_pattern; plot them in ggplot2 with year on x-axis, count on y-axis, colored by country)

sotu_entities <- sotu_meta |> 
  mutate(text = sotu_text) 

replacement_pattern <- c( "^America$" = "United States",
                          "^States$" = "United States", 
                          "Washington" = "United States", 
                          "U.S." = "United States", 
                          "Viet-Nam" = "Vietnam", 
                          "AMERICA" = "United States", 
                          "^Britain$" = "Great Britain",
                          "^BRITAIN$" = "Great Britain", 
                          "Alaska" = "US State/City", 
                          "District of Columbia" = "US State/City", 
                          "Hawaii" = "US State/City", 
                          "California" = "US State/City", 
                          "New York" = "US State/City", 
                          "Mississippi" = "US State/City", 
                          "Texas" = "US State/City", 
                          "Chicago" = "US State/City", 
                          "United States of America" = "United States", 
                          "Berlin" = "Germany" )

spacy_finalize()

overview copied from the webpage↩︎

7.1 Initializing spaCy

7.2 spacy_parse()

7.3 POS tags, NER, and nounphrases

7.4 Exercises

7.2 `spacy_parse()`