Chapter 4 Text Mining

In the final part of this brief introduction to computational techniques for the Social Sciences, I will introduce you to a set of methods that you can use to draw inferences from large text corpora. In specific, this chapter will cover the pre-processing of text, basic (dictionary-based) sentiment analysis, and topic modeling. The analyses are performed using “tidy” data principles.

There are a couple of packages around which you can use for text mining, such as quanteda (Benoit et al. 2018) or tm (Feinerer, Hornik, and Meyer 2008), and tidytext (Silge and Robinson 2016) is probably the most recent addition to them. As you could probably tell from its name, tidytext obeys the tidy data principles. “Every observation is a row” translates here to “each token has its own row” – “token” not necessarily relating to a singular term, but also to n-gram, sentence, or paragraph.

In the following, I will demonstrate what text mining using tidy principles can look like in R. The sotu package contains all the so-called “State of the Union” addresses – the president gives them to the congress annually – since 1790.

library(tidyverse)
library(sotu)
sotu_raw <- sotu_meta %>% 
  bind_cols(sotu_text) %>% 
  rename(content = `...6`) %>% 
  distinct(content, .keep_all = TRUE)

4.1 Pre-processing: put it into tidy text format

Now that the data is read in, I need to clean it. For this purpose, I take a look at the first entry of the tibble.

sotu_raw %>% slice(1) %>% pull(content) %>% str_sub(1, 500)
## [1] "Fellow-Citizens of the Senate and House of Representatives: \n\nI embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, an"

4.1.1 Cleaning

Nice, that looks pretty clean already. However, I do not need capital letters, line breaks (\n), and punctuation. str_to_lower(), str_replace_all(), and str_squish() from the stringr package (Wickham 2019a) are the right tools for this job. The first one transforms every letter to lowercase, the second one replaces all the occurrences of certain classes with whatever I want it to (a white space in my case), and the final one removes redundant white space (i.e., repeated occurrences of white spaces are reduced to 1).

sotu_clean <- sotu_raw %>% 
  mutate(content = str_to_lower(content),
         content = str_replace_all(content, "[^[:alnum:] ]", " "),
         content = str_squish(content))

The next step is to remove stop words – they are not necessary for the sentiment analyses I want to perform first. The stopwords package has a nice list for English.

library(stopwords)
stopwords_vec <- stopwords(language = "en")
#stopwords(language = "de") # the german equivalent
#stopwords_getlanguages() # find the languages that are available

However, it might be easier if I first bring it into the tidy format – every token in a row. Stop words can then be removed by a simple anti_join()

4.1.2 unnest_tokens()

I will focus on the 20th century SOTUs. Here, the dplyr::between() function comes in handy.

sotu_20cent_clean <- sotu_clean %>% 
  filter(between(year, 1900, 2000))

Now I can tokenize them:

library(tidytext)

sotu_20cent_tokenized <- sotu_20cent_clean %>% 
  unnest_tokens(output = token, input = content)
glimpse(sotu_20cent_tokenized)
## Rows: 917,678
## Columns: 6
## $ president    <chr> "William McKinley", "William McKinley", "William McKinley…
## $ year         <int> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 190…
## $ years_active <chr> "1897-1901", "1897-1901", "1897-1901", "1897-1901", "1897…
## $ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
## $ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
## $ token        <chr> "to", "the", "senate", "and", "house", "of", "representat…

The new tibble consists of 917,678 rows. Please note that usually you have to put some sort of id column into your original tibble before tokenizing it, e.g., by giving each case – representing a document, or chapter, or whatever – a separate id (e.g., using tibble::rowid_to_column()). This does not apply here, because my original tibble came with a bunch of meta data (president, year, party).

Removing the stop words now is straight-forward:

sotu_20cent_tokenized_nostopwords <- sotu_20cent_tokenized %>% 
  filter(!token %in% stopwords_vec)

Another option would have been to anti_join() the tibble which the get_stopwords() function returns. For doing this, the column which contains the singular tokens needs to be called word or a named vector needs to be provided which links the name to word:

sotu_20cent_tokenized_nostopwords <- sotu_20cent_tokenized %>% 
  anti_join(get_stopwords(language = "en"), by = c("token" = "word"))

Another thing I forgot to remove are digits. They do not matter for the analyses either:

sotu_20cent_tokenized_nostopwords_nonumbers <- sotu_20cent_tokenized_nostopwords %>% 
  filter(!str_detect(token, "[:digit:]"))

Beyond that, I can stem my words using the SnowballC package and its function wordStem():

library(SnowballC)
sotu_20cent_tokenized_nostopwords_nonumbers_stemmed <- sotu_20cent_tokenized_nostopwords_nonumbers %>% 
  mutate(token = wordStem(token, language = "en"))

#SnowballC::getStemLanguages() # if you want to know the abbreviations for other languages as well

Maybe I should also remove insignificant words, i.e. ones that appear less than 0.5 percent of the time.

sotu_20cent_tokenized_nostopwords_nonumbers_stemmed %>% 
  group_by(token) %>% 
  filter(n() > nrow(.)/200)
## # A tibble: 30,257 × 6
## # Groups:   token [10]
##    president         year years_active party      sotu_type token   
##    <chr>            <int> <chr>        <chr>      <chr>     <chr>   
##  1 William McKinley  1900 1897-1901    Republican written   congress
##  2 William McKinley  1900 1897-1901    Republican written   nation  
##  3 William McKinley  1900 1897-1901    Republican written   american
##  4 William McKinley  1900 1897-1901    Republican written   peopl   
##  5 William McKinley  1900 1897-1901    Republican written   govern  
##  6 William McKinley  1900 1897-1901    Republican written   year    
##  7 William McKinley  1900 1897-1901    Republican written   nation  
##  8 William McKinley  1900 1897-1901    Republican written   congress
##  9 William McKinley  1900 1897-1901    Republican written   state   
## 10 William McKinley  1900 1897-1901    Republican written   state   
## # … with 30,247 more rows

4.1.3 In a nutshell

Well, all those things could also be summarized in one nice cleaning pipeline:

sotu_20cent_clean <- sotu_clean %>% 
  filter(between(year, 1900, 2000)) %>% 
  unnest_tokens(output = token, input = content) %>% 
  anti_join(get_stopwords(), by = c("token" = "word")) %>% 
  filter(!str_detect(token, "[:digit:]")) %>% 
  mutate(token = wordStem(token, language = "en")) %>% 
  group_by(token) %>% 
  filter(n() > 5)

Now I have created a nice tibble containing the SOTU addresses of the 20th century in a tidy format. This is a great point of departure for subsequent analyses.

4.2 Sentiment Analysis

Sentiment analyses are fairly easy when you have your data in tidy text format. As they basically consist of matching the particular words’ sentiment values to the corpus, this can be done with an inner_join(). tidytext comes with four dictionaries: bing, loughran, afinn, and nrc:

walk(c("bing", "loughran", "afinn", "nrc"), ~get_sentiments(lexicon = .x) %>% 
       head() %>% 
       print())
## # A tibble: 6 × 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative 
## # A tibble: 6 × 2
##   word         sentiment
##   <chr>        <chr>    
## 1 abandon      negative 
## 2 abandoned    negative 
## 3 abandoning   negative 
## 4 abandonment  negative 
## 5 abandonments negative 
## 6 abandons     negative 
## # A tibble: 6 × 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2
## # A tibble: 6 × 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear

As you can see here, the dictionaries are mere tibbles with two columns: “word” and “sentiment.” For easier joining, I should rename my column “token” to word.

library(magrittr)
sotu_20cent_clean %<>% rename(word = token)

The AFINN dictionary is the only one with numeric values. You might have noticed that its words are not stemmed. Hence, I need to do this before I can join it with my tibble. To get the sentiment value per document, I need to average it.

sotu_20cent_afinn <- get_sentiments("afinn") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_20cent_clean) %>% 
  group_by(year) %>% 
  summarize(sentiment = mean(value))

Thereafter, I can just plot it:

sotu_20cent_afinn %>% 
  ggplot() +
  geom_line(aes(x = year, y = sentiment))

That’s a bit hard to interpret. geom_smooth() might help:

sotu_20cent_afinn %>% 
  ggplot() +
  geom_smooth(aes(x = year, y = sentiment))

Interesting. When you think of the tone in the SOTU addresses as a proxy measure for the circumstances, the worst phase appears to be during the 1920s and 1930s – might make sense given the then economic circumstances, etc. The maximum was in around the 1960s and since then it has, apparently, remained fairly stable.

4.2.1 Assessing the results

However, we have no idea whether we are capturing some valid signal or not. Let’s look at what drives those classifications the most:

sotu_20cent_contribution <- get_sentiments("afinn") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_20cent_clean)  %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(value))
sotu_20cent_contribution %>%
  slice_max(contribution, n = 10) %>%
  bind_rows(sotu_20cent_contribution %>% slice_min(contribution, n = 10)) %>% 
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(contribution, word, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  labs(y = NULL)

Let’s split this up per decade:

get_sentiments("afinn") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_20cent_clean) %>%
  mutate(decade = ((year - 1900)/10) %>% floor()) %>%
  group_by(decade, word) %>%
  summarize(occurrences = n(),
            contribution = sum(value)) %>% 
  slice_max(contribution, n = 5) %>%
  bind_rows(get_sentiments("afinn") %>% 
              mutate(word = wordStem(word, language = "en")) %>% 
              inner_join(sotu_20cent_clean) %>%
              mutate(decade = ((year - 1900)/10) %>% floor()) %>%
              group_by(decade, word) %>%
              summarize(occurrences = n(),
                        contribution = sum(value)) %>% 
              slice_min(contribution, n = 5)) %>% 
  mutate(word = reorder_within(word, contribution, decade)) %>%
  ggplot(aes(contribution, word, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~decade, ncol = 3, scales = "free") +
  scale_y_reordered()

4.2.2 Assessing the quality of the rating

We need to assess the reliability of our classification (would different raters come to the same conclusion; and, if we compare it to a gold standard, how does the classification live up to its standards). One measure we can use here is Krippendorf’s Alpha which is defined as

\[\alpha = \frac{D_o}{D_e}\]

where \(D_{o}\) is the observed disagreement and \(D_{e}\) is the expected disagreement (by chance). The calculation of the measure is far more complicated, but R can easily take care of that – we just need to feed it with proper data. For this example I use a commonly used benchmark data set containing IMDb reviews of movies and whether they’re positive or negative.

imdb_reviews <- read_csv("data/imdb_reviews.csv")

glimpse(imdb_reviews)
## Rows: 25,000
## Columns: 2
## $ text      <chr> "Once again Mr. Costner has dragged out a movie for far long…
## $ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
imdb_reviews_afinn <- imdb_reviews %>% 
  rowid_to_column("doc") %>% 
  unnest_tokens(token, text) %>% 
  anti_join(get_stopwords(), by = c("token" = "word")) %>% 
  mutate(stemmed = wordStem(token)) %>% 
  inner_join(get_sentiments("afinn") %>% mutate(stemmed = wordStem(word))) %>% 
  group_by(doc) %>% 
  summarize(sentiment = mean(value)) %>% 
  mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
                                     TRUE ~ "negative") %>% 
           factor(levels = c("positive", "negative")))

Now we have two classifications, one “gold standard” from the data and the one obtained through AFINN.

review_coding <- imdb_reviews %>% 
  mutate(true_sentiment = sentiment %>% 
           factor(levels = c("positive", "negative"))) %>% 
  select(-sentiment) %>% 
  rowid_to_column("doc") %>% 
  left_join(imdb_reviews_afinn %>% select(doc, sentiment_afinn)) 

First, we can check how often AFINN got it right, the accuracy:

sum(review_coding$true_sentiment == review_coding$sentiment_afinn, na.rm = TRUE)/25000
## [1] 0.64712

However, accuracy is not a perfect metric because it doesn’t tell you anything about the details. For instance, your classifier might just predict “positive” all of the time. If your gold standard has 50 percent “positive” cases, the accuracy would lie at 0.5. We can address this using the following measures.

For the calculation of Krippendorff’s Alpha, the data must be in a different format: a matrix containing with documents as columns and the respective ratings as rows.

library(irr)
mat <- review_coding %>% 
  select(-text) %>% 
  as.matrix() %>% 
  t()

mat[1:3, 1:5]
##                 [,1]       [,2]       [,3]       [,4]       [,5]      
## doc             "    1"    "    2"    "    3"    "    4"    "    5"   
## true_sentiment  "negative" "negative" "negative" "negative" "negative"
## sentiment_afinn "positive" "negative" "negative" "positive" "positive"
colnames(mat) <- mat[1,]

mat <- mat[2:3,]
mat[1:2, 1:5]
##                     1          2          3          4          5     
## true_sentiment  "negative" "negative" "negative" "negative" "negative"
## sentiment_afinn "positive" "negative" "negative" "positive" "positive"
kripp.alpha(mat, method = "nominal")
##  Krippendorff's alpha
## 
##  Subjects = 25000 
##    Raters = 2 
##     alpha = 0.266

Good are alpha values of around 0.8 – AFINN missed that one.

Another way to evaluate the quality of classification is through a confusion matrix.

Now we can calculate precision (when it predicts “positive,” how often is it correct), recall/sensitivity (when it is “positive,” how often is this predicted), specificity (when it’s “negative,” how often is it actually negative). The F1-score is the harmonic mean of precision and recall and defined as \(F_1 = \frac{2}{\frac{1}{recall}\times \frac{1}{precision}} = 2\times \frac{precision\times recall}{precision + recall}\) and the most commonly used measure to assess the accuracy of the classification. The closer to 1 it is, the better. You can find a more thorough description of the confusion matrix and the different measures in this blog post.

We can do this in R using the caret package.

library(caret)
confusion_matrix <- confusionMatrix(data = review_coding$sentiment_afinn, 
                                    reference = review_coding$true_sentiment,
                                    positive = "positive")
confusion_matrix$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8439750            0.4504721            0.6056500 
##       Neg Pred Value            Precision               Recall 
##            0.7427441            0.6056500            0.8439750 
##                   F1           Prevalence       Detection Rate 
##            0.7052216            0.5000000            0.4219875 
## Detection Prevalence    Balanced Accuracy 
##            0.6967515            0.6472236

4.3 TF-IDF

But how did the singular decades differ from each other? An answer might lie in the tf-idf (term frequency – inverse document frequency). The idea is to emphasize words which are not used very often in the corpus (hence, inverse document frequency) and multiply them with how often words appear in a document (term frequency). The tf-idf is “the frequency of a term adjusted for how rarely it is used.” (Silge and Robinson 2016: 31) If a term is rarely used overall but appears comparably often in a singular document, it might be safe to assume that it plays a bigger role in that document.

The tf-idf is calculated as follows:

\[w_{i,j}=tf_{i,j}\times log(\frac{N}{df_{i}})\]

–> \(tf_{i,j}\): number of occurrences of term \(i\) in document \(j\)

–> \(df_{i}\): number of documents containing \(i\)

–> \(N\): total number of documents

I will split up the data into decades (documents) and check which terms appear to be more important in them. I use the texts including stop words:

First, decades are introduced:

sotu_20cent_clean_decades <- sotu_clean %>% 
  filter(between(year, 1900, 2000)) %>% 
  unnest_tokens(input = content, output = word) %>% 
  filter(!str_detect(word, "[:digit:]")) %>% 
  mutate(decade = paste0(as.character(floor(year / 10) * 10), "s"),
         decade = as_factor(decade))

Second, the singular terms in the decades are counted:

sotu_20cent_termcount <- sotu_20cent_clean_decades %>% 
  count(decade, word)

The tf-idf is now calculated using bind_tf_idf().

sotu_20cent_tf_idf <- sotu_20cent_termcount %>% 
  bind_tf_idf(word, decade, n)

sotu_20cent_tf_idf
## # A tibble: 66,220 × 6
##    decade word            n         tf    idf     tf_idf
##    <fct>  <chr>       <int>      <dbl>  <dbl>      <dbl>
##  1 1900s  a            3035 0.0160     0      0         
##  2 1900s  abandon         9 0.0000473  0.0953 0.00000451
##  3 1900s  abandoned       2 0.0000105  0.318  0.00000335
##  4 1900s  abandoning      3 0.0000158  0.788  0.0000124 
##  5 1900s  abandonment     4 0.0000210  0.606  0.0000127 
##  6 1900s  abate           1 0.00000526 1.30   0.00000683
##  7 1900s  abated          1 0.00000526 1.70   0.00000896
##  8 1900s  abatement       1 0.00000526 1.70   0.00000896
##  9 1900s  abating         1 0.00000526 1.70   0.00000896
## 10 1900s  abdicating      2 0.0000105  2.40   0.0000252 
## # … with 66,210 more rows

Do you see how those common words – in this example “a” get a TF-IDF score of zero? When you work with TF-IDF weighted terms, there is no need to remove stop words, because they are universally common across documents and, hence, will get a value which is basically 0.

Let’s look at the important words…

sotu_20cent_tf_idf %>% 
  arrange(-tf_idf)
## # A tibble: 66,220 × 6
##    decade word              n       tf   idf   tf_idf
##    <fct>  <chr>         <int>    <dbl> <dbl>    <dbl>
##  1 2000s  tonight          27 0.00296  0.788 0.00234 
##  2 2000s  ve               23 0.00252  0.788 0.00199 
##  3 1990s  tonight         126 0.00194  0.788 0.00153 
##  4 1960s  vietnam          65 0.00111  1.01  0.00113 
##  5 2000s  internet          6 0.000658 1.70  0.00112 
##  6 1980s  afghanistan      51 0.000455 2.40  0.00109 
##  7 2000s  eitc              4 0.000439 2.40  0.00105 
##  8 2000s  globalization     4 0.000439 2.40  0.00105 
##  9 1990s  ve               84 0.00130  0.788 0.00102 
## 10 2000s  affordable        9 0.000987 1.01  0.000999
## # … with 66,210 more rows

Now I can extract the top-five terms for every decade:

sotu_20cent_tf_idf_top5 <- sotu_20cent_tf_idf %>% 
  group_by(decade) %>% 
  slice_max(tf_idf, n = 5) %>% 
  mutate(word = reorder_within(word, tf_idf, decade))

And plot them:

sotu_20cent_tf_idf_top5 %>% 
  ggplot() +
  geom_col(aes(x = tf_idf, y = word)) +
  scale_y_reordered() +
  facet_wrap(~decade, scales = "free")

This graph already gives you some hints on what sort of topics were prevalent in the 20th century’s SOTU addresses.

4.4 Latent Dirichlet Allocation (LDA)

In the former section, I, first, explored how the sentiment in the SOTU addresses has evolved over the 20th century. Then, I looked at the decade-specific vocabulary. This, paired with previous knowledge of what happened throughout the 20th century, sufficed to gain some sort of insights. However, another approach to infer meaning from text is to search it for topics. This is also possible with the SOTU corpus which I have at hand.

The two main assumptions of LDA are as follows:

  • Every document is a mixture of topics.
  • Every topic is a mixture of words.

Hence, singular documents do not necessarily be distinct in terms of their content. They can be related – if they contain the same topics. This is definitely more in line with natural language’s use.

The following graphic depicts a flowchart of text analysis with the tidytext package.

Text analysis flowchart

What becomes evident is that the actual topic modeling does not happen within tidytext. For this, the text needs to be transformed into a document-term-matrix and then passed on to the topicmodels package (Grün et al. 2020), which will take care of the modeling process. Thereafter, the results are turned back into tidy format, so that they can be visualized using ggplot2.

4.4.1 Document-term matrix

In order to search for the topics which are prevalent in the singular addresses, I need to transform the tidy tibble into a document-term matrix. This can be achieved with cast_dtm().

sotu_dtm <- sotu_20cent_clean %>% 
  filter(str_length(word) > 1) %>% 
  count(year, word) %>% 
  filter(between(year, 1950, 2000)) %>% 
  cast_dtm(document = year, term = word, value = n)

A DTM contains of Documents (rows) and Terms (columns) and specifies how often a term appears in a document.

sotu_dtm %>% as.matrix() %>% .[1:5, 1:5]
##       Terms
## Docs   abandon abat abdic abid abil
##   1951       1    0     0    0    2
##   1952       1    0     0    0    0
##   1959       1    0     0    0    1
##   1961       1    0     0    1    6
##   1962       2    0     1    0    3

This DTM can now be used to create an LDA model.

4.4.2 Inferring the number of topics

The thing with LDA models is that I need to tell the model in advance how many topics I assume to be present within the document. Since I have neither read the SOTU addresses nor any secondary literature about them, I cannot make a guess on how many topics are in there. Furthermore, you want to use LDA to uncover the topics within the corpus in general. This is totally doable with LDA, but you will need to go some extra-miles to assess the number of topics and then evaluate your choice.

4.4.2.1 Making guesses

One approach might be to just providing it with wild guesses on how many topics might be in there and then trying to make sense of it afterwards.

library(topicmodels)
library(broom)
sotu_lda_k10 <- LDA(sotu_dtm, k = 10, control = list(seed = 123))

sotu_lda_k10_tidied <- tidy(sotu_lda_k10)

The tidy() function from the broom package (Robinson 2020) brings the LDA output back into a tidy format. It consists of three columns: the topic, the term, and beta, which is the probability that the term stems from this topic.

sotu_lda_k10_tidied %>% glimpse()
## Rows: 41,920
## Columns: 3
## $ topic <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,…
## $ term  <chr> "abandon", "abandon", "abandon", "abandon", "abandon", "abandon"…
## $ beta  <dbl> 2.631696e-04, 7.018939e-07, 4.746713e-04, 1.054944e-09, 3.522191…

Now, I can wrangle it a bit, and then visualize it with ggplot2.

top_terms_k10 <- sotu_lda_k10_tidied %>%
  group_by(topic) %>%
  slice_max(beta, n = 5) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms_k10 %>%
  mutate(topic = factor(topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~topic, scales = "free", ncol = 2) +
  coord_flip()

Now the hard part begins: making sense of it in an inductive manner. But, of course, there is a large probability that I just chose the wrong number of topics. Therefore, before scratching my head trying to come to meaningful conclusions, I should first assess what the optimal number of models is.

4.4.2.2 More elaborate methods

LDA offers a couple of parameters to tune, but the most crucial one probably is k, the number of topics.

library(ldatuning)
determine_k <- FindTopicsNumber(
  sotu_dtm,
  topics = seq(from = 2, to = 30, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 4L,
  verbose = TRUE
)
FindTopicsNumber_plot(determine_k)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

I would actually go with the 23 topics here, as they seem to maximize at least one of the metrics that shall be maximized and minimizes the other ones quite well.

4.4.3 Sense-making

Now, the harder part begins: making sense of the different topics. In LDA, words can exist across topics, making them not perfectly distinguishable. Also, as the number of topics becomes greater, plotting them doesn’t make to much sense anymore.

sotu_lda_k23 <- LDA(sotu_dtm, k = 23, control = list(seed = 77))

sotu_lda_k23_tidied <- tidy(sotu_lda_k23)
topic_list <- sotu_lda_k23_tidied %>% 
  group_by(topic) %>% 
  group_split() %>% 
  map_dfc(~.x %>% slice_max(beta, n = 20) %>% arrange(-beta) %>% slice(1:20)  %>% select(term))
## New names:
## * term -> term...1
## * term -> term...2
## * term -> term...3
## * term -> term...4
## * term -> term...5
## * ...
names(topic_list) <- str_c("topic", 1:23, sep = " ")

One way to assessing the topics in a pair-wise manner is to compute their log-ratio:

log_top_1_2 <- sotu_lda_k23_tidied %>% 
  filter(topic %in% c(1, 2)) %>%
  mutate(topic = str_c("topic", topic, sep = "_")) %>% 
  pivot_wider(names_from = topic, values_from = beta) %>% 
  filter(topic_1 > .001 | topic_2 > .001) %>%
  mutate(log_ratio = log2(topic_2 / topic_1))
log_top_1_2 %>% 
  slice_max(log_ratio, n = 10) %>% 
  bind_rows(log_top_1_2 %>% slice_min(log_ratio, n = 10)) %>% 
  mutate(term = reorder(term, log_ratio)) %>% 
  ggplot() +
  geom_col(aes(log_ratio, term))

However, this approach has its weaknesses, too, the biggest being that it doesn’t work for multiple topics (there, you could introduce some sort of cut-off for beta and then calculate a modified TF-IDF).

weighted_topics <- sotu_lda_k23_tidied %>% 
  filter(beta > 0.001) %>%
  bind_tf_idf(term, topic, beta) %>% 
  group_by(topic) %>% 
  group_split() %>% 
  map_dfc(~.x %>% slice_max(tf_idf, n = 20, with_ties = FALSE) %>% arrange(-beta) %>% select(term))
## New names:
## * term -> term...1
## * term -> term...2
## * term -> term...3
## * term -> term...4
## * term -> term...5
## * ...
colnames(weighted_topics) <- str_c("topic", 1:23,  sep = " ")

kableExtra::kable(weighted_topics)
topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9 topic 10 topic 11 topic 12 topic 13 topic 14 topic 15 topic 16 topic 17 topic 18 topic 19 topic 20 topic 21 topic 22 topic 23
energi energi administr free shall legisl crime shall ve freedom open ask share centuri energi cut free tonight free legisl children administr communist
initi five energi communist free energi becom object re re centuri tonight transport tonight presid know strength vietnam allianc soviet challeng four advanc
addit rural rural whole prosper nuclear clean farmer preserv tonight man re revenu know oil get fight tri session energi school women project
oil result ensur upon expenditur regul land surplus hold ve door thing substanti ve common re necessari think alon regulatori ask region upon
result indian foundat aggress vigor oil spirit request half see chanc ve influenc hand report say ahead men poverti region thank implement construct
suppli role agenc atom advanc urban decad mind happen never enjoy get older re strateg thing real south faith manag pass rural eight
conserv transport manag ruler self passag air immigr recess dream affair term often think submit think aggress asia short oil tonight refuge abroad
serious environment effici answer enterpris review qualiti detail crimin spirit greatest put indian child term don produc poverti side implement parent coal reserv
low bicentenni implement demand matter reorgan water disast trust god vision ll agenda leadership suppli someth missil fight vote pursu colleg regulatori scienc
strateg structur export question materi environment sens integr offici ll shape laughter manpow idea branch parent kind conflict west afghanistan child sever missil
coordin flexibl amend fear affair consum chanc facil rais faith recoveri almost career histor constitut ll ideal carri berlin youth centuri sector mutual
non crisi regulatori immedi adequ effici seventi highway return revolut center won properti around barrel class threat enough lack refuge teacher africa asset
appropri student water republ januari contribut question extens needi speak doubl line emerg readi per lot korea beauti afford approv bipartisan bank vast
fuel disast congression side particip earli longer subject medicar love declin chamber soon kid proud got side entir mere bank gun rapid rang
relationship receiv enhanc korea respect africa rather employe incent excel function big monetari bless coal parti equip north doubt ii medicar financi inform
took week fy faith adjust treati dream foster januari whose grown recoveri court mission gas renew scientif almost corp resolv rais solar launch
enabl offici solar heart stabl lower victim consider travel nicaragua feel freez emphasi account conserv cold disarma desir ultim highest fellow popul hour
disabl avail relationship understand wide particip realism mention behind veto gone didn pressur iraq crude coverag defend river instrument intellig student afghanistan prepar
sourc privaci antitrust ii mainten grain enter mobil abund courag affect yes strike gulf petroleum coven critic wish destruct pakistan read adopt gold
access district anti collect expans fuel primari thousand found central mankind strategi communic saddam judici everybodi practic skill will quick internet african suffici

4.4.4 Document-topic probabilities

Another thing to assess is document-topic probabilities gamma: which document belongs to which topic. By doing so, you can choose the documents that have the highest probability of belonging to a topic and then read these specifically. This might give you some better understanding of what the different topics might imply.

sotu_lda_k23_document <- tidy(sotu_lda_k23, matrix = "gamma")

This shows you the proportion of words of the document which were drawn from the specific topics. In 1990, for instance, many words were drawn from the first topic.

sotu_lda_k23_document %>% 
  group_by(document) %>% 
  slice_max(gamma, n = 1)
## # A tibble: 50 × 3
## # Groups:   document [50]
##    document topic gamma
##    <chr>    <int> <dbl>
##  1 1950         5 0.679
##  2 1951        17 0.984
##  3 1952        17 0.970
##  4 1953         4 0.872
##  5 1954         8 0.410
##  6 1955         5 0.907
##  7 1956         8 0.975
##  8 1957         5 1.00 
##  9 1958        17 0.704
## 10 1959        17 0.452
## # … with 40 more rows

An interesting pattern is that the topics show some time-dependency. This intuitively makes sense, as they might represent some sort of deeper underlying issue.

4.4.4.1 Further readings

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5). https://doi.org/10.18637/jss.v025.i05.
Grün, Bettina, Kurt Hornik, David Blei, John Lafferty, Xuan-Hieu Phan, Makoto Matsumoto, Nishimura Takuji, and Shawn Cokus. 2020. “Topicmodels: Topic Models.”
Robinson, David. 2020. “Broom: Convert Statistical Analysis Objects into Tidy Data Frames.”
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.
Wickham, Hadley. 2019a. “Rvest: Easily Harvest (Scrape) Web Pages.”