Chapter 19 STM
Hello! Today, we’ll learn about structural topic modeling. For this tutorial, we will use the same data as we did in the LDA topic modeling class. So let’s begin by loading the data!
options(scipen=999)
library(tidyverse)
library(quanteda)
library(tidytext)
library(topicmodels)
library(stm)
<- read_csv("data/tweets_academia.csv") %>%
tweet_data select(user_id, status_id, created_at, screen_name, text, is_retweet, favorite_count, retweet_count, verified)
For this analysis, we will again focus on the tweet_data$text
column, which contains the tweet message posted by the individual.
19.1 Data Cleaning/Wrangling
This time, in addition to removing URLs, we’re gong to also delete duplicates. The main reason for this is to avoid any one retweet or message “oveweighing” our model. When this step is not done, sometimes you will get a topic model with a topic that is predominantly one tweet or one account. Removing duplicates (which, in the case of Twitter, is typically retweets) can help with this issue.
<- tweet_data[!duplicated(tweet_data$text), ]
tweet_data $text <- str_replace_all(tweet_data$text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", "") tweet_data
For the wrangling, there are two possible options that you have. The first is to use textProcessor()
, the default processor of stm. It is a “wrapper” from tm
, which means that it will take the same arguments as tm()
. You can use textProcessor()
to remove a variety of things, including stopwords, numbers, and punctuation marks (all of these things default to TRUE
for removal). There are other things you can remove, so I encourage you to check out the ?textProcessor
help page
<- textProcessor(tweet_data$text,
tweet_processed metadata = tweet_data,
lowercase = TRUE,
striphtml = TRUE)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Alternatively, you can also create a quanteda
document feature matrix (dfm). Refer back to our Week 8 bonus tutorial on quanteda
for more.
19.2 prepDocuments
In addition to this processing step, stm
also has one additional step prepDocuments()
. This function is used to “clean up” your document term matrix. One reason this step is especially useful is that you can determine an upper or lower threshold for words to include.
Why is this important? As we have discussed, NLP data can be very sparse. When you construct a dfm (or dtm), you may have noticed that your matrix is considered very sparse, like 90%
or even 99%
scarcity. This is pretty typical of text analysis; after all, you will probably have more words than messages. For this reason, it is sometimes helpful to establish a lower threshold. We state this in prepDocuments()
using the lower.thresh
argument). In our case, lower.thresh = 20
means that any word appearing in fewer than 20 documents would be automatically excluded from our analysis. This helps make the data less sparse. If you have words that appear too frequently in your corpus (this can be common when you do not include a custom stop word or when the terms you searched by appear in your stm), you can also remove there using the upper.thresh
argument.
<- prepDocuments(tweet_processed$documents, tweet_processed$vocab,
out $meta, lower.thresh = 20) tweet_processed
## Removing 44955 of 46961 terms (82528 of 386882 tokens) due to frequency
## Removing 38 Documents with No Words
## Your corpus now has 25118 documents, 2006 terms and 304354 tokens.
19.3 Choosing the K
Like LDA (and other clustering strategie in general), determining a k
number of topics can be tricky. In stm
, this is done using the searchK()
function. This is similar to the LDA k
search: it works by building a k
model (in this case, we start with 15) and then iteratively compare one model to the next, so k = 15
is compared to k = 16
, which is then compared to k = 17
and so on. This results in a somewhat time-consuming process, which is important to keep in mind if you plan to run the next chunk of code.
<- searchK(out$documents, out$vocab, K = c(15: 20),
tnum prevalence =~ verified,
data = out$meta)
tnum
19.4 Structural Topic Modeling
Let us now proceed with building our structural topic model! One thing you’ll notice about the stm()
model is that it takes many arguments. For our analysis, we use the output of the prepDocuments()
function (which returns a documents
column, a vocab
column and a meta
column). In addition to this, we also have to state the k
number of topics we want (we’ll use 10 here), as well as the init.type
(this is similar to the sampling strategy argument in LDA
). Finally, there is the prevalence
argument, which allows you to specify meta-data variables you are interested in using as covariates. This complicates your model, so you don’t want to throw all your possible meta-data in. But, if you have an especially important meta-data variable, this is one way to account for it.
<- stm(documents = out$documents, vocab = out$vocab,
tweets_stm K = 10,
prevalence =~ verified,
max.em.its = 50,
data = out$meta,
init.type = "Spectral",
seed = 100)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## ..........
## Recovering initialization...
## ....................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -6.453)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -6.389, relative change = 1.001e-02)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -6.337, relative change = 8.080e-03)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -6.302, relative change = 5.516e-03)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -6.280, relative change = 3.535e-03)
## Topic 1: grad, school, program, learn, student
## Topic 2: school, like, just, write, read
## Topic 3: get, school, grad, student, job
## Topic 4: school, home, academ, grad, privat
## Topic 5: school, year, grad, dont, know
## Topic 6: grad, academia, love, high, one
## Topic 7: school, amp, went, law, work
## Topic 8: grad, first, just, got, time
## Topic 9: academ, student, year, univers, high
## Topic 10: school, grad, high, work, take
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -6.266, relative change = 2.176e-03)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -6.258, relative change = 1.322e-03)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -6.253, relative change = 7.991e-04)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -6.250, relative change = 4.774e-04)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -6.248, relative change = 2.749e-04)
## Topic 1: school, grad, learn, program, graduat
## Topic 2: school, like, just, realli, read
## Topic 3: get, school, grad, can, job
## Topic 4: school, home, privat, teach, big
## Topic 5: year, school, know, dont, last
## Topic 6: grad, love, academia, one, book
## Topic 7: amp, school, went, law, peopl
## Topic 8: grad, first, got, just, time
## Topic 9: academ, student, year, teacher, univers
## Topic 10: school, work, take, high, colleg
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -6.247, relative change = 1.470e-04)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -6.247, relative change = 6.913e-05)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -6.247, relative change = 2.103e-05)
## ....................................................................................................
## Completed E-Step (4 seconds).
## Completed M-Step.
## Model Converged
19.4.1 Results
Let’s see what these topics look like!
labelTopics(tweets_stm, c(1:10))
## Topic 1 Top Words:
## Highest Prob: school, grad, learn, program, graduat, student, research
## FREX: gpa, posit, weight, schedul, fair, sat, scienc
## Lift: bigfacesportss, ncaa, scam, tampa, height, jefferson, weight
## Score: school, scam, grad, program, learn, gpa, weight
## Topic 2 Top Words:
## Highest Prob: school, like, realli, read, just, write, want
## FREX: write, point, realli, word, hate, fuck, thought
## Lift: racism, <U+FFFD><U+FFFD>s, citat, word, nigga, cite, bruh
## Score: school, racism, like, write, realli, read, paper
## Topic 3 Top Words:
## Highest Prob: get, school, grad, can, will, job, don<U+FFFD>t
## FREX: get, tell, loan, pay, debt, money, job
## Lift: proctor, ban, invas, privaci, surveil, protect, loan
## Score: invas, school, get, grad, tell, job, pay
## Topic 4 Top Words:
## Highest Prob: school, teach, home, state, big, privat, requir
## FREX: privat, assist, compani, academi, comfort, tutor, learner
## Lift: princ, comfort, compani, learner, assist, tutor, academi
## Score: princ, school, home, privat, assist, tutor, academi
## Topic 5 Top Words:
## Highest Prob: year, school, know, dont, last, now, live
## FREX: last, dont, without, thesi, ago, motiv, can<U+FFFD>t
## Lift: asuustrik, unnecessarili, elong, relax, endasuustrikenow, pain, nigerian
## Score: asuustrik, school, year, last, grad, dont, can<U+FFFD>t
## Topic 6 Top Words:
## Highest Prob: grad, love, academia, one, book, friend, watch
## FREX: hero, watch, academia, god, dark, favorit, pictur
## Lift: boku, haikyuu, naruto, ouran, slayer, hero, demon
## Score: boku, grad, academia, love, hero, watch, book
## Topic 7 Top Words:
## Highest Prob: amp, school, went, peopl, law, talk, mani
## FREX: law, went, white, women, amp, often, woman
## Lift: red, conserv, male, woman, deni, women, opinion
## Score: school, red, amp, went, law, black, peopl
## Topic 8 Top Words:
## Highest Prob: grad, first, just, got, time, week, start
## FREX: cri, first, final, semest, got, done, lol
## Lift: academiccel, midterm, tear, mail, cri, cmohri, final
## Score: grad, academiccel, first, got, semest, cri, finish
## Topic 9 Top Words:
## Highest Prob: academ, student, year, help, new, teacher, univers
## FREX: fafsa, freez, ako, ang, yung, mga, lang
## Lift: closur, dahil, walang, rin, yung, nga, mga
## Score: academ, closur, student, year, fafsa, freez, educ
## Topic 10 Top Words:
## Highest Prob: work, school, high, take, colleg, better, plan
## FREX: plan, that, better, take, senior, cut, colleg
## Lift: -plus, adultfiction<U+FFFD>, erot, erotica, kelsey, scandals<U+FFFD>, smith
## Score: school, -plus, work, high, take, colleg, better
We can show these words differently:
plot.STM(tweets_stm, type = "labels")
You can also plot the distribution of these topics
plot.STM(tweets_stm, type = "summary")
Check out plot.stm()
for more information
19.4.1.1 Correlations
One thing that distinguished structural topic modeling from LDA topic modeling is the ability to see whether topics are correlated in STM. For this, we will use the topicCorr()
function in stm
. When we plot the outut of topicCorr()
, we get an interesting network diagram of the topics. If there is a line in betweenn those two documents, they are correlated. You can establish a specific cutoff point using the cutoff
argument in topiccorr
. If you have data that are not quite normal, you may also want to consider changing the default method
argument from "simple"
to "huge"
.
set.seed(381)
<- topicCorr(tweets_stm)
mod.out.corr plot(mod.out.corr)
19.4.2 Extracting Thetas
In structural topic modeling, gammas (the scores we use to evaluate the proportion that a document belongs to a topic) are called “thetas.” Don’t be confused by the word-change: thetas serve a similar function to gamma scores: they allow you to figure out how to assign a topic to each document.
<- tweets_stm$theta %>% as.data.frame()
theta_scores $status_id <- out$meta$status_id #from the "out" processed file
theta_scores#View(theta_scores)
If you View(theta_scores)
, you’ll notice that tweets_stm$theta
is already structured in a wide format. To isolate the topics with the highest theta for each document (as we did in the LDA tutorial), we will need to convert this to a “long” format.
<- theta_scores %>%
topics_long pivot_longer(cols = V1:V10,
names_to = "topic",
values_to = "theta")
Now that we have our long data, we can proceed with extracting the top thetas…
<- topics_long %>%
toptopics group_by(status_id) %>%
slice_max(theta)
colnames(toptopics)[1] <- "status_id"
colnames(toptopics)[2] <- "topics"
$status_id <- as.numeric(toptopics$status_id) toptopics
And plotting our results…
table(toptopics$topics) %>% as.data.frame() %>%
ggplot(aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity")
Ta da!
Want more practice with Structural Topic Modeling? Check out these tutorials * https://blogs.uoregon.edu/rclub/2016/04/05/structural-topic-modeling/ * STM Website * Julia Silge’s Tutorial(it also has a great video) * R Bloggers Tutorial