7 Lemmatization, Named Entity Recognition, POS-tagging, and Dependency Parsing with spaCyR
Advanced operations to the end of extracting information and annotating text (and more!) can be done with spaCyr
(Benoit and Matsuo 2020). spaCyr
is an R wrapper around the spaCy
Python package and, therefore, a bit tricky to install at first. You can find instructions here.
The functionalities spaCyr
offers you are the following1:
- parsing texts into tokens or sentences;
- lemmatizing tokens;
- parsing dependencies (to identify the grammatical structure of the sentence); and
- identifying, extracting, or consolidating token sequences that form named entities or noun phrases.
In brief, preprocessing with spaCyr
is computationally more expensive than using, for instance, tidytext
, but it will give you more accurate lemmatization instead of “stupid,” rule-based stemming.. Also, it allows you to break up documents into smaller entities, sentences, which might be more suitable, e.g., as input for classifiers (since sentences tend to be about one topic, they allow for more fine-grained analyses). Part-of-speech (POS) tagging basically provides you with the functions of the different terms within the sentence. This might prove useful for tasks such as sentiment analysis. The final task spaCyr
can help you with is Named Entity Recognition (NER) which can be used for tasks such as sampling relevant documents.
7.1 Initializing spaCy
Before using spaCyr
, it needs to be initialized. What happens during this process is that R basically opens a connection to Python so that it can then run the spaCyr
functions in Python’s spaCy
. Once you have set up everything properly (see instructions), you can initialize it using spacy_initialize(model)
. Different language models can be specified and an overview can be found here. Note that a process of spaCy
is started when you spacy_initialize()
and continues running in the background. Hence, once you don’t need it anymore, or want to load a different model, you should spacy_finalize()
.
needs(spacyr, tidyverse, sotu)
spacy_initialize(model = "en_core_web_sm")
successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)
# to download new model -- here: French
#spacy_finalize()
#spacy_download_langmodel(model = "fr_core_news_sm")
#spacy_initialize(model = "fr_core_news_sm") #check that it has worked
spacy_finalize()
#spacy_initialize(model = "de_core_web_sm") # for German
7.2 spacy_parse()
spaCyr
’s workhorse function is spacy_parse()
. It takes a character vector or TIF-compliant data frame. The latter is basically a tibble containing at least two columns, one named doc_id
with unique document ids and one named text
, containing the respective documents.
<- tibble(
tif_toy_example doc_id = "doc1",
text = "Look, this is a brief example for how tokenization works. This second sentence allows me to demonstrate another functionality of spaCy."
)
<- tif_toy_example$text
toy_example_vec
spacy_parse(tif_toy_example)
successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)
doc_id sentence_id token_id token lemma pos entity
1 doc1 1 1 Look look VERB
2 doc1 1 2 , , PUNCT
3 doc1 1 3 this this PRON
4 doc1 1 4 is be AUX
5 doc1 1 5 a a DET
6 doc1 1 6 brief brief ADJ
7 doc1 1 7 example example NOUN
8 doc1 1 8 for for ADP
9 doc1 1 9 how how SCONJ
10 doc1 1 10 tokenization tokenization NOUN
11 doc1 1 11 works work VERB
12 doc1 1 12 . . PUNCT
13 doc1 2 1 This this DET
14 doc1 2 2 second second ADJ ORDINAL_B
15 doc1 2 3 sentence sentence NOUN
16 doc1 2 4 allows allow VERB
17 doc1 2 5 me I PRON
18 doc1 2 6 to to PART
19 doc1 2 7 demonstrate demonstrate VERB
20 doc1 2 8 another another DET
21 doc1 2 9 functionality functionality NOUN
22 doc1 2 10 of of ADP
23 doc1 2 11 spaCy spaCy PROPN
24 doc1 2 12 . . PUNCT
The output of spacy_parse()
and the sotu-speeches looks as follows:
<- sotu_meta |>
sotu_speeches_tif mutate(text = sotu_text) |>
distinct(text, .keep_all = TRUE) |>
filter(between(year, 1990, 2000)) |>
group_by(year) |>
summarize(text = str_c(text, collapse = " ")) |>
select(doc_id = year, text)
glimpse(sotu_speeches_tif)
Rows: 11
Columns: 2
$ doc_id <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000
$ text <chr> "\n\nMr. President, Mr. Speaker, Members of the United States C…
<- spacy_parse(sotu_speeches_tif,
sotu_parsed pos = TRUE,
tag = TRUE,
lemma = TRUE,
entity = TRUE,
dependency = TRUE,
nounphrase = TRUE,
multithread = TRUE)
# if you haven't installed spacy yet, uncomment and run the following line
#sotu_parsed <- read_rds("https://github.com/fellennert/sicss-paris-2023/raw/main/code/sotu_parsed.rds")
Note that this is already fairly similar to the output of tidytext
’s unnest_tokens()
function. The advantages are that the lemmas are more accurate, that we have a new sub-entity – sentences –, and that there is now more information on the type and meanings of the words.
7.4 Exercises
- Use
spacyr
to investigate which countries were mentioned the most in the SOTU addresses over time in the 20th century. Do you find patterns? (step-by-step: take the SOTUs; filter them (1900–2000); spacyr::spacy_extract_entity(), filter geographical units, normalize them – str_replace_all + replacement_pattern; plot them in ggplot2 with year on x-axis, count on y-axis, colored by country)
<- sotu_meta |>
sotu_entities mutate(text = sotu_text)
<- c( "^America$" = "United States",
replacement_pattern "^States$" = "United States",
"Washington" = "United States",
"U.S." = "United States",
"Viet-Nam" = "Vietnam",
"AMERICA" = "United States",
"^Britain$" = "Great Britain",
"^BRITAIN$" = "Great Britain",
"Alaska" = "US State/City",
"District of Columbia" = "US State/City",
"Hawaii" = "US State/City",
"California" = "US State/City",
"New York" = "US State/City",
"Mississippi" = "US State/City",
"Texas" = "US State/City",
"Chicago" = "US State/City",
"United States of America" = "United States",
"Berlin" = "Germany" )
spacy_finalize()
overview copied from the webpage↩︎