Chapter 19 STM

Hello! Today, we’ll learn about structural topic modeling. For this tutorial, we will use the same data as we did in the LDA topic modeling class. So let’s begin by loading the data!

options(scipen=999)
library(tidyverse)
library(quanteda)
library(tidytext)
library(topicmodels)

library(stm)
tweet_data <- read_csv("data/tweets_academia.csv") %>%
  select(user_id, status_id, created_at, screen_name, text, is_retweet, favorite_count, retweet_count, verified)

For this analysis, we will again focus on the tweet_data$text column, which contains the tweet message posted by the individual.

19.1 Data Cleaning/Wrangling

This time, in addition to removing URLs, we’re gong to also delete duplicates. The main reason for this is to avoid any one retweet or message “oveweighing” our model. When this step is not done, sometimes you will get a topic model with a topic that is predominantly one tweet or one account. Removing duplicates (which, in the case of Twitter, is typically retweets) can help with this issue.

tweet_data <- tweet_data[!duplicated(tweet_data$text), ]
tweet_data$text <- str_replace_all(tweet_data$text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", "")

For the wrangling, there are two possible options that you have. The first is to use textProcessor(), the default processor of stm. It is a “wrapper” from tm, which means that it will take the same arguments as tm(). You can use textProcessor() to remove a variety of things, including stopwords, numbers, and punctuation marks (all of these things default to TRUE for removal). There are other things you can remove, so I encourage you to check out the ?textProcessor help page

tweet_processed <- textProcessor(tweet_data$text, 
                                 metadata = tweet_data,
                                 lowercase = TRUE,
                                 striphtml = TRUE)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

Alternatively, you can also create a quanteda document feature matrix (dfm). Refer back to our Week 8 bonus tutorial on quanteda for more.

19.2 prepDocuments

In addition to this processing step, stm also has one additional step prepDocuments(). This function is used to “clean up” your document term matrix. One reason this step is especially useful is that you can determine an upper or lower threshold for words to include.

Why is this important? As we have discussed, NLP data can be very sparse. When you construct a dfm (or dtm), you may have noticed that your matrix is considered very sparse, like 90% or even 99% scarcity. This is pretty typical of text analysis; after all, you will probably have more words than messages. For this reason, it is sometimes helpful to establish a lower threshold. We state this in prepDocuments() using the lower.thresh argument). In our case, lower.thresh = 20 means that any word appearing in fewer than 20 documents would be automatically excluded from our analysis. This helps make the data less sparse. If you have words that appear too frequently in your corpus (this can be common when you do not include a custom stop word or when the terms you searched by appear in your stm), you can also remove there using the upper.thresh argument.

out <- prepDocuments(tweet_processed$documents, tweet_processed$vocab, 
                     tweet_processed$meta, lower.thresh = 20)

## Removing 44955 of 46961 terms (82528 of 386882 tokens) due to frequency 
## Removing 38 Documents with No Words 
## Your corpus now has 25118 documents, 2006 terms and 304354 tokens.

19.3 Choosing the K

Like LDA (and other clustering strategie in general), determining a k number of topics can be tricky. In stm, this is done using the searchK() function. This is similar to the LDA k search: it works by building a k model (in this case, we start with 15) and then iteratively compare one model to the next, so k = 15 is compared to k = 16, which is then compared to k = 17 and so on. This results in a somewhat time-consuming process, which is important to keep in mind if you plan to run the next chunk of code.

tnum <- searchK(out$documents, out$vocab, K = c(15: 20), 
                prevalence =~ verified,
                data = out$meta)

tnum

19.4 Structural Topic Modeling

Let us now proceed with building our structural topic model! One thing you’ll notice about the stm() model is that it takes many arguments. For our analysis, we use the output of the prepDocuments() function (which returns a documents column, a vocab column and a meta column). In addition to this, we also have to state the k number of topics we want (we’ll use 10 here), as well as the init.type (this is similar to the sampling strategy argument in LDA). Finally, there is the prevalence argument, which allows you to specify meta-data variables you are interested in using as covariates. This complicates your model, so you don’t want to throw all your possible meta-data in. But, if you have an especially important meta-data variable, this is one way to account for it.

tweets_stm <- stm(documents = out$documents, vocab = out$vocab,
                       K = 10, 
                       prevalence =~ verified,
                       max.em.its = 50, 
                       data = out$meta,
                       init.type = "Spectral", 
                       seed = 100)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ....................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (6 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -6.453) 
## ....................................................................................................
## Completed E-Step (5 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -6.389, relative change = 1.001e-02) 
## ....................................................................................................
## Completed E-Step (5 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -6.337, relative change = 8.080e-03) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -6.302, relative change = 5.516e-03) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -6.280, relative change = 3.535e-03) 
## Topic 1: grad, school, program, learn, student 
##  Topic 2: school, like, just, write, read 
##  Topic 3: get, school, grad, student, job 
##  Topic 4: school, home, academ, grad, privat 
##  Topic 5: school, year, grad, dont, know 
##  Topic 6: grad, academia, love, high, one 
##  Topic 7: school, amp, went, law, work 
##  Topic 8: grad, first, just, got, time 
##  Topic 9: academ, student, year, univers, high 
##  Topic 10: school, grad, high, work, take 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -6.266, relative change = 2.176e-03) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -6.258, relative change = 1.322e-03) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -6.253, relative change = 7.991e-04) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -6.250, relative change = 4.774e-04) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 10 (approx. per word bound = -6.248, relative change = 2.749e-04) 
## Topic 1: school, grad, learn, program, graduat 
##  Topic 2: school, like, just, realli, read 
##  Topic 3: get, school, grad, can, job 
##  Topic 4: school, home, privat, teach, big 
##  Topic 5: year, school, know, dont, last 
##  Topic 6: grad, love, academia, one, book 
##  Topic 7: amp, school, went, law, peopl 
##  Topic 8: grad, first, got, just, time 
##  Topic 9: academ, student, year, teacher, univers 
##  Topic 10: school, work, take, high, colleg 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 11 (approx. per word bound = -6.247, relative change = 1.470e-04) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 12 (approx. per word bound = -6.247, relative change = 6.913e-05) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Completing Iteration 13 (approx. per word bound = -6.247, relative change = 2.103e-05) 
## ....................................................................................................
## Completed E-Step (4 seconds). 
## Completed M-Step. 
## Model Converged

19.4.1 Results

Let’s see what these topics look like!

labelTopics(tweets_stm, c(1:10))

## Topic 1 Top Words:
##       Highest Prob: school, grad, learn, program, graduat, student, research 
##       FREX: gpa, posit, weight, schedul, fair, sat, scienc 
##       Lift: bigfacesportss, ncaa, scam, tampa, height, jefferson, weight 
##       Score: school, scam, grad, program, learn, gpa, weight 
## Topic 2 Top Words:
##       Highest Prob: school, like, realli, read, just, write, want 
##       FREX: write, point, realli, word, hate, fuck, thought 
##       Lift: racism, <U+FFFD><U+FFFD>s, citat, word, nigga, cite, bruh 
##       Score: school, racism, like, write, realli, read, paper 
## Topic 3 Top Words:
##       Highest Prob: get, school, grad, can, will, job, don<U+FFFD>t 
##       FREX: get, tell, loan, pay, debt, money, job 
##       Lift: proctor, ban, invas, privaci, surveil, protect, loan 
##       Score: invas, school, get, grad, tell, job, pay 
## Topic 4 Top Words:
##       Highest Prob: school, teach, home, state, big, privat, requir 
##       FREX: privat, assist, compani, academi, comfort, tutor, learner 
##       Lift: princ, comfort, compani, learner, assist, tutor, academi 
##       Score: princ, school, home, privat, assist, tutor, academi 
## Topic 5 Top Words:
##       Highest Prob: year, school, know, dont, last, now, live 
##       FREX: last, dont, without, thesi, ago, motiv, can<U+FFFD>t 
##       Lift: asuustrik, unnecessarili, elong, relax, endasuustrikenow, pain, nigerian 
##       Score: asuustrik, school, year, last, grad, dont, can<U+FFFD>t 
## Topic 6 Top Words:
##       Highest Prob: grad, love, academia, one, book, friend, watch 
##       FREX: hero, watch, academia, god, dark, favorit, pictur 
##       Lift: boku, haikyuu, naruto, ouran, slayer, hero, demon 
##       Score: boku, grad, academia, love, hero, watch, book 
## Topic 7 Top Words:
##       Highest Prob: amp, school, went, peopl, law, talk, mani 
##       FREX: law, went, white, women, amp, often, woman 
##       Lift: red, conserv, male, woman, deni, women, opinion 
##       Score: school, red, amp, went, law, black, peopl 
## Topic 8 Top Words:
##       Highest Prob: grad, first, just, got, time, week, start 
##       FREX: cri, first, final, semest, got, done, lol 
##       Lift: academiccel, midterm, tear, mail, cri, cmohri, final 
##       Score: grad, academiccel, first, got, semest, cri, finish 
## Topic 9 Top Words:
##       Highest Prob: academ, student, year, help, new, teacher, univers 
##       FREX: fafsa, freez, ako, ang, yung, mga, lang 
##       Lift: closur, dahil, walang, rin, yung, nga, mga 
##       Score: academ, closur, student, year, fafsa, freez, educ 
## Topic 10 Top Words:
##       Highest Prob: work, school, high, take, colleg, better, plan 
##       FREX: plan, that, better, take, senior, cut, colleg 
##       Lift: -plus, adultfiction<U+FFFD>, erot, erotica, kelsey, scandals<U+FFFD>, smith 
##       Score: school, -plus, work, high, take, colleg, better

We can show these words differently:

plot.STM(tweets_stm, type = "labels")

You can also plot the distribution of these topics

plot.STM(tweets_stm, type = "summary")

Check out plot.stm() for more information

19.4.1.1 Correlations

One thing that distinguished structural topic modeling from LDA topic modeling is the ability to see whether topics are correlated in STM. For this, we will use the topicCorr() function in stm. When we plot the outut of topicCorr(), we get an interesting network diagram of the topics. If there is a line in betweenn those two documents, they are correlated. You can establish a specific cutoff point using the cutoff argument in topiccorr. If you have data that are not quite normal, you may also want to consider changing the default method argument from "simple" to "huge".

set.seed(381)
mod.out.corr <- topicCorr(tweets_stm)
plot(mod.out.corr)

19.4.2 Extracting Thetas

In structural topic modeling, gammas (the scores we use to evaluate the proportion that a document belongs to a topic) are called “thetas.” Don’t be confused by the word-change: thetas serve a similar function to gamma scores: they allow you to figure out how to assign a topic to each document.

theta_scores <- tweets_stm$theta %>% as.data.frame()
theta_scores$status_id <- out$meta$status_id #from the "out" processed file
#View(theta_scores)

If you View(theta_scores), you’ll notice that tweets_stm$theta is already structured in a wide format. To isolate the topics with the highest theta for each document (as we did in the LDA tutorial), we will need to convert this to a “long” format.

topics_long <- theta_scores %>%
  pivot_longer(cols = V1:V10,
               names_to = "topic",
              values_to = "theta")

Now that we have our long data, we can proceed with extracting the top thetas…

toptopics <- topics_long %>%
  group_by(status_id) %>%
  slice_max(theta)

colnames(toptopics)[1] <- "status_id"
colnames(toptopics)[2] <- "topics"
toptopics$status_id <- as.numeric(toptopics$status_id)

And plotting our results…

table(toptopics$topics) %>% as.data.frame() %>%
  ggplot(aes(x = Var1, y = Freq)) +
    geom_bar(stat = "identity")

Ta da!

Want more practice with Structural Topic Modeling? Check out these tutorials * https://blogs.uoregon.edu/rclub/2016/04/05/structural-topic-modeling/ * STM Website * Julia Silge’s Tutorial(it also has a great video) * R Bloggers Tutorial