Day 5 Machine Learning
In the following script, we will introduce you to the supervised and unsupervised classification of text. Supervised means that we will need to “show” the machine a data set that already contains the value or label we want to predict (the “dependent variable”) as well as all the variables that are used to predict the class/value (the independent variables or, in ML lingo, features). In the examples we will showcase, the features are the tokens that are contained in a document. Dependent variables are in our examples sentiment or party affiliation.
Unsupervised Learning, on the other hand, implies that the machine is fed with a corpus of different documents and asked to find a new organization for them based on their content. In our examples, we deal with so-called topic models, models that are organizing the documents with regard to their content. Documents which deal with the same set of tokens, i.e., tokens that come from the same topic, are classified as being similar. Moreover, the model learns how different tokens are related to one another. Again, the rationale is that if tokens often appear together in documents, they are probably related to one another and, hence, stem from the same topic.
5.1 Supervised Classification
Overall, the process of supervised classification using text in R encompasses the following steps:
- Split data into training and test set
- Pre-processing and featurization
- Training
- Evaluation and tuning (through cross-validation) (… repeat 2.-4. as often as necessary)
- Applying the model to the held-out test set
- Final evaluation
This is mirrored in the workflow()
function from the workflows
(Vaughan 2022) package. There, you define the pre-processing procedure (add_recipe()
– created through the recipe()
function from the recipes
(Kuhn and Wickham 2022) and/or textrecipes
(Hvitfeldt 2022) package(s)), the model specification with add_spec()
– taking a model specification as created by the parsnip
(Kuhn, Vaughan, and Hvitfeldt 2022) package.
In the next part, other approaches such as Support Vector Machines (SVM), penalized logistic regression models (penalized here means, loosely speaking, that insignificant predictors which contribute little will be shrunk and ignored – as the text contains many tokens that might not contribute much, those models lend themselves nicely to such tasks), random forest models, or XGBoost will be introduced. Those approaches are not to be explained in-depth, third-party articles will be linked though, but their intuition and the particularities of their implementation will be described. Since we use the tidymodels
(Kuhn and Wickham 2020) framework for implementation, trying out different approaches is straightforward. Also, the pre-processing differs, recipes
and textrecipes
facilitate this task decisively. Third, the evaluation of different classifiers will be described. Finally, the entire workflow will be demonstrated using a Twitter data set.
The first example for today’s session is the IMDb data set. First, we load a whole bunch of packages and the data set.
library(tidymodels)
library(textrecipes)
library(workflows)
library(discrim)
library(glmnet)
library(tidytext)
library(tidyverse)
<- read_csv("https://www.dropbox.com/s/0cfr4rkthtfryyp/imdb_reviews.csv?dl=1") imdb_data
5.1.1 Split the data
The first step is to divide the data into training and test sets using initial_split()
. You need to make sure that the test and training set are fairly balanced which is achieved by using strata =
. prop =
refers to the proportion of rows that make it into the training set.
<- initial_split(imdb_data, prop = 0.8, strata = sentiment)
split
<- training(split)
imdb_train <- testing(split)
imdb_test
glimpse(imdb_train)
## Rows: 20,000
## Columns: 2
## $ text <chr> "Once again Mr. Costner has dragged out a movie for far long…
## $ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
%>% count(sentiment) imdb_train
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 10000
## 2 positive 10000
5.1.2 Pre-processing and featurization
In the tidymodels
framework, pre-processing and featurization are performed through so-called recipes
. For text data, so-called textrecipes
are available.
5.1.3 textrecipes
– basic example
In the initial call, the formula needs to be provided. In our example, we want to predict the sentiment (“positive” or “negative”) using the text in the review. Then, different steps for pre-processing are added. Similar to what you have learned in the prior chapters containing measures based on the bag of words assumption, the first step is usually tokenization, achieved through step_tokenize()
. In the end, the features need to be quantified, either through step_tf()
, for raw term frequencies, or step_tfidf()
, for TF-IDF. In between, various pre-processing steps such as word normalization (i.e., stemming or lemmatization), and removal of rare or common words Hence, a recipe for a very basic model just using raw frequencies and the 1,000 most common words would look as follows:
<- recipe(sentiment ~ text, data = imdb_train) %>%
imdb_basic_recipe step_tokenize(text) %>% # tokenize text
step_tokenfilter(text, max_tokens = 1000) %>% # only retain 1000 most common words
# additional pre-processing steps can be added, see next chapter
step_tfidf(text) # final step: add term frequencies
In case you want to know what the data set for the classification task looks like, you can prep()
and finally bake()
the recipe. Note that we need to specify the data set we want to pre-process in the recipe’s manner. In our case, we want to perform the operations on the data specified in the basic_recipe
and, hence, need to specify new_data = NULL
.
%>%
imdb_basic_recipe prep() %>%
bake(new_data = NULL)
## # A tibble: 20,000 × 1,001
## sentiment tfidf_text_1 tfidf_text_10 tfidf_text_2 tfidf_text_20 tfidf_text_3
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 negative 0 0 0 0 0
## 2 negative 0 0.0112 0 0 0
## 3 negative 0.0154 0 0.0136 0 0
## 4 negative 0 0 0 0 0
## 5 negative 0 0 0 0 0
## 6 negative 0 0 0 0 0
## 7 negative 0 0.0167 0.0195 0 0
## 8 negative 0 0 0 0 0
## 9 negative 0.0274 0 0.0242 0 0
## 10 negative 0 0 0 0 0
## # … with 19,990 more rows, and 995 more variables: tfidf_text_30 <dbl>,
## # tfidf_text_4 <dbl>, tfidf_text_5 <dbl>, tfidf_text_7 <dbl>,
## # tfidf_text_8 <dbl>, tfidf_text_80 <dbl>, tfidf_text_9 <dbl>,
## # tfidf_text_a <dbl>, tfidf_text_able <dbl>, tfidf_text_about <dbl>,
## # tfidf_text_above <dbl>, tfidf_text_absolutely <dbl>,
## # tfidf_text_across <dbl>, tfidf_text_act <dbl>, tfidf_text_acted <dbl>,
## # tfidf_text_acting <dbl>, tfidf_text_action <dbl>, tfidf_text_actor <dbl>, …
5.1.4 textrecipes
– further preprocessing steps
More steps exist. These always follow the same structure: their first two arguments are the recipe (which in practice does not matter, because they are generally used in a “pipeline”) and the variable that is affected (in our example “text” because it is the one to be modified). The rest of the arguments depends on the function. In the following, we will briefly list them and their most important arguments. Find the exhaustive list here,
step_tokenfilter()
: filters tokensmax_times =
upper threshold for how often a term can appear (removes common words)min_times =
lower threshold for how often a term can appear (removes rare words)max_tokens =
maximum number of tokens to be retained; will only keep the ones that appear the most often- you should filter before using
step_tf
orstep_tfidf
to limit the number of variables that are created
step_lemma()
: allows you to extract the lemma- in case you want to use it, make sure you tokenize via
spacyr
(by usingstep_tokenize(text, engine = "spacyr"))
- in case you want to use it, make sure you tokenize via
step_pos_filter()
: adds the Part-of-speech tagsstep_stem()
: stems tokenscustom_stem =
specifies the stemming function. Defaults toSnowballC
. Custom functions can be provided.options =
can be used to provide arguments (stored as named elements of a list) to the stemming function. E.g.,step_stem(text, custom_stem = "SnowballC", options = list(language = "russian"))
step_stopwords()
: removes stopwordssource =
alternative stopword lists can be used; potential values are contained instopwords::stopwords_getsources()
custom_stopword_source =
provide your own stopword listlanguage =
specify language of stop word list; potential values can be found instopwords::stopwords_getlanguages()
step_ngram()
: takes into account order of terms, provides more contextnum_tokens =
number of tokens in n-gram – defaults to 3 – trigramsmin_num_tokens =
minimal number of tokens in n-gram –step_ngram(text, num_tokens = 3, min_num_tokens = 1)
will return all uni-, bi-, and trigrams.
step_word_embeddings()
: use pre-trained embeddings for wordsembeddings()
: tibble of pre-trained embeddings
step_normalize()
: performs unicode normalization as a preprocessing stepnormalization_form =
which Unicode Normalization to use, overview instringi::stri_trans_nfc()
themis::step_upsample()
takes care of unbalanced dependent variables (which need to be specified in the call)over_ratio =
ratio of desired minority-to-majority frequencies
5.1.5 Model specification
Now that the data is ready, the model can be specified. The parsnip
package is used for this. It contains a model specification, the type, and the engine. For Naïve Bayes, this would look like the following (note that you will need to install the relevant packages – here: discrim
– before using them):
<- naive_Bayes() %>% # the initial function, coming from the parsnip package
nb_spec set_mode("classification") %>% # classification for discrete values, regression for continuous ones
set_engine("naivebayes") # needs to be installed
Other model specifications you might deem relevant:
- Logistic regression
<- logistic_reg() %>%
lr_spec set_engine("glm") %>%
set_mode("classification")
- Logistic regression (penalized with Lasso):
<- logistic_reg(mixture = 1) %>%
lasso_spec set_engine("glm") %>%
set_mode("classification")
- SVM (here,
step_normalize(all_predictors())
needs to be the last step in the recipe)
<- svm_linear() %>%
svm_spec set_mode("regression") %>% # can also be "classification"
set_engine("LiblineaR")
- Random Forest (with 1000 decision trees):
<- rand_forest(trees = 1000) %>%
rf_spec set_engine("ranger") %>%
set_mode("regression") # can also be "classification"
- xgBoost (with 20 decision trees):
<- boost_tree(trees = 20) %>%
xg_spec set_engine("xgboost") %>%
set_mode("regression") # can also be classification
5.1.6 Model training – workflows
A workflow can be defined to train the model. It will contain the recipe, hence taking care of the pre-processing, and the model specification. In the end, it can be used to fit the model.
<- workflow() %>%
imdb_nb_wf add_recipe(imdb_basic_recipe) %>%
add_model(nb_spec)
It can then be fit using fit()
.
<- imdb_nb_wf %>% fit(data = imdb_train) imdb_nb_basic
5.1.7 Model evaluation
Now that a first model has been trained, its performance can be evaluated. In theory, we have a test set for this. However, the test set is precious and should only be used once we are sure that we have found a good model. Hence, for these intermediary tuning steps, we need to come up with another solution. So-called cross-validation lends itself nicely to this task. The rationale behind it is that chunks from the training set are used as test sets. So, in the case of 10-fold cross-validation, the test set is divided into 10 distinctive chunks of data. Then, 10 models are trained on the respective 9/10 of the training set that is not used for evaluation. Finally, each model is evaluated against the respective held-out “test set” and the performance metrics averaged.
First, the folds need to be determined. we set a seed in the beginning to ensure reproducibility.
library(tune)
set.seed(123)
<- vfold_cv(imdb_train) imdb_folds
fit_resamples()
trains models on the respective samples. (Note that for this to work, no model must have been fit to this workflow before. Hence, you either need to define a new workflow first or restart the session and skip the fit-line from before.)
<- fit_resamples(
imdb_nb_resampled
imdb_nb_wf,
imdb_folds,control = control_resamples(save_pred = TRUE),
metrics = metric_set(accuracy, recall, precision)
)
#imdb_nb_resampled %>% write_rds("imdb_nb_resampled.rds")
collect_metrics()
can be used to evaluate the results.
- Accuracy tells me the share of correct predictions overall
- Precision tells me the number of correct positive predictions
- Recall tells me how many actual positives are predicted properly
In all cases, values close to 1 are better.
collect_predictions()
will give you the predicted values.
<- collect_metrics(imdb_nb_resampled)
nb_rs_metrics <- collect_predictions(imdb_nb_resampled) nb_rs_predictions
This can also be used to create the confusion matrix by hand.
<- nb_rs_predictions %>%
confusion_mat group_by(id) %>%
mutate(confusion_class = case_when(.pred_class == "positive" & sentiment == "positive" ~ "TP",
== "positive" & sentiment == "negative" ~ "FP",
.pred_class == "negative" & sentiment == "negative" ~ "TN",
.pred_class == "negative" & sentiment == "positive" ~ "FN")) %>%
.pred_class count(confusion_class) %>%
ungroup() %>%
pivot_wider(names_from = confusion_class, values_from = n)
Now you can go back and adapt the pre-processing recipe, fit a new model, or try a different classifier, and evaluate it against the same set of folds. Once you are satisfied, you can proceed to check the workflow on the held-out test data.
5.1.8 Hyperparameter tuning
Some models also require the tuning of hyperparameters (for instance, lasso regression). If we wanted to tune these values, we could do so using the tune
package. There, the parameter that needs to be tuned gets a placeholder in the model specification. Through variation of the placeholder, the optimal solution can be empirically determined.
So, in the first example, we will try to determine a good penalty value for LASSO regression.
<- logistic_reg(penalty = tune(), mixture = 1) %>%
lasso_tune_spec set_mode("classification") %>%
set_engine("glmnet")
We will also play with the numbers of tokens to be included:
<- recipe(sentiment ~ text, data = imdb_train) %>%
imdb_tune_basic_recipe step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = tune()) %>%
step_tf(text)
The dials
(Kuhn and Frick 2022) package provides the handy grid_regular()
function which chooses suitable values for certain parameters.
<- grid_regular(
lambda_grid penalty(range = c(-4, 0)),
max_tokens(range = c(1e3, 2e3)),
levels = c(penalty = 3, max_tokens = 2)
)
Then, we need to define a new workflow, too.
<- workflow() %>%
lasso_tune_wf add_recipe(imdb_tune_basic_recipe) %>%
add_model(lasso_tune_spec)
For the resampling, we can use tune_grid()
which will use the workflow, a set of folds (we use the ones we created earlier), and a grid containing the different parameters.
set.seed(123)
<- tune_grid(
tune_lasso_rs
lasso_tune_wf,
imdb_folds,grid = lambda_grid,
metrics = metric_set(accuracy, sensitivity, specificity)
)
Again, we can access the resulting metrics using collect_metrics()
:
collect_metrics(tune_lasso_rs)
## # A tibble: 18 × 8
## penalty max_tokens .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 1000 accuracy binary 0.862 10 0.00241 Preprocessor1_…
## 2 0.0001 1000 sensitivity binary 0.856 10 0.00199 Preprocessor1_…
## 3 0.0001 1000 specificity binary 0.868 10 0.00344 Preprocessor1_…
## 4 0.01 1000 accuracy binary 0.849 10 0.00212 Preprocessor1_…
## 5 0.01 1000 sensitivity binary 0.813 10 0.00384 Preprocessor1_…
## 6 0.01 1000 specificity binary 0.885 10 0.00305 Preprocessor1_…
## 7 1 1000 accuracy binary 0.491 10 0.00240 Preprocessor1_…
## 8 1 1000 sensitivity binary 0.6 10 0.163 Preprocessor1_…
## 9 1 1000 specificity binary 0.4 10 0.163 Preprocessor1_…
## 10 0.0001 2000 accuracy binary 0.865 10 0.00229 Preprocessor2_…
## 11 0.0001 2000 sensitivity binary 0.865 10 0.00276 Preprocessor2_…
## 12 0.0001 2000 specificity binary 0.866 10 0.00311 Preprocessor2_…
## 13 0.01 2000 accuracy binary 0.858 10 0.00194 Preprocessor2_…
## 14 0.01 2000 sensitivity binary 0.823 10 0.00302 Preprocessor2_…
## 15 0.01 2000 specificity binary 0.894 10 0.00199 Preprocessor2_…
## 16 1 2000 accuracy binary 0.491 10 0.00240 Preprocessor2_…
## 17 1 2000 sensitivity binary 0.6 10 0.163 Preprocessor2_…
## 18 1 2000 specificity binary 0.4 10 0.163 Preprocessor2_…
autoplot()
can be used to visualize them:
autoplot(tune_lasso_rs) +
labs(
title = "Lasso model performance across 3 regularization penalties"
)
Also, we can use show_best()
to look at the best result. Subsequently, select_best()
allows me to choose it. In real life, we would choose the best trade-off between a model as simple and as good as possible. Using select_by_pct_loss()
, we choose the one that performs still more or less on par with the best option (i.e., within 2 percent accuracy) but is considerably simpler. Finally, once we are satisfied with the outcome, we can finalize_workflow()
and fit the final model to the test data.
show_best(tune_lasso_rs, "accuracy")
## # A tibble: 5 × 8
## penalty max_tokens .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 2000 accuracy binary 0.865 10 0.00229 Preprocessor2_Mode…
## 2 0.0001 1000 accuracy binary 0.862 10 0.00241 Preprocessor1_Mode…
## 3 0.01 2000 accuracy binary 0.858 10 0.00194 Preprocessor2_Mode…
## 4 0.01 1000 accuracy binary 0.849 10 0.00212 Preprocessor1_Mode…
## 5 1 1000 accuracy binary 0.491 10 0.00240 Preprocessor1_Mode…
<- finalize_workflow(lasso_tune_wf, select_by_pct_loss(tune_lasso_rs, metric = "accuracy", -penalty)) final_lasso_imdb
5.1.9 Final fit
Now we can finally fit our model to the training data and predict on the test data. last_fit()
is the way to go. It takes the workflow and the split (as defined by initial_split()
) as parameters.
<- last_fit(final_lasso_imdb, split)
final_fitted
collect_metrics(final_fitted)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.852 Preprocessor1_Model1
## 2 roc_auc binary 0.928 Preprocessor1_Model1
collect_predictions(final_fitted) %>%
conf_mat(truth = sentiment, estimate = .pred_class) %>%
autoplot(type = "heatmap")
5.1.10 Supervised ML with tidymodels
in a nutshell
Here, we give you the condensed version of how you would train a model on a number of documents and predict on a certain test set. To exemplify this, I use Tweets from British MPs from the two largest parties.
First, you take your data and split it into training and test set:
set.seed(1)
<- read_csv("https://www.dropbox.com/s/1lrv3i655u5d7ps/timelines_gb_2022.csv?dl=1") timelines_gb
## Rows: 59444 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): party, text
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- initial_split(timelines_gb, prop = 0.3, strata = party)
split_gb <- training(split_gb)
party_tweets_train_gb <- testing(split_gb) party_tweets_test_gb
Second, define your text pre-processing steps. In this case, we want to predict partisanship based on textual content. We tokenize the text, upsample to have balanced classed in terms of party in the training set, and retain only the 1,000 most commonly appearing tokens.
<- recipe(party ~ text, data = party_tweets_train_gb) %>%
twitter_recipe step_tokenize(text) %>% # tokenize text
::step_upsample(party) %>%
themisstep_tokenfilter(text, max_tokens = 1000) %>%
step_tfidf(text)
Third, the model specification needs to be defined. In this case, we go with a random forest classifier containing 50 decision trees.
<- rand_forest(trees = 50) %>%
rf_spec set_engine("ranger") %>%
set_mode("classification")
Finally, the pre-processing “recipe” and the model specification can be summarized in one workflow and the model can be fit to the training data.
<- workflow() %>%
twitter_party_rf_workflow add_recipe(twitter_recipe) %>%
add_model(rf_spec)
<- twitter_party_rf_workflow %>% fit(data = party_tweets_train_gb) party_gb
Now we have arrived at a model we can apply to make predictions using augment()
. Finally, we can evaluate its accuracy on the test data by taking the share of correctly predicted values.
<- augment(party_gb, party_tweets_test_gb)
predictions_gb mean(predictions_gb$party == predictions_gb$.pred_class)
## [1] 0.6960252
5.2 Unsupervised Learning: Topic Models
The two main assumptions of the most commonly used topic model, Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003), are as follows:
- Every document is a mixture of topics.
- Every topic is a mixture of words.
Hence, singular documents do not necessarily be distinct in terms of their content. They can be related if they contain the same topics. This is fairly in line with natural language use.
The following graphic depicts a flowchart of text analysis with the tidytext
package.
What becomes evident is that the actual topic modeling will not happen within tidytext
. For this, the text needs to be transformed into a document-term-matrix and then passed on to the topicmodels
package (Grün et al. 2020), which will take care of the modeling process. Thereafter, the results are turned back into a tidy format, using broom
so that they can be visualized using ggplot2
.
5.2.1 Document-term matrix
To search for the topics which are prevalent in the singular addresses through LDA, we need to transform the tidy tibble into a document-term matrix first. This can be achieved with cast_dtm()
.
library(sotu)
library(tidytext)
library(SnowballC)
<- sotu_meta %>%
sotu_clean mutate(text = sotu_text %>%
str_replace_all("[,.]", " ")) %>%
filter(between(year, 1900, 2000)) %>%
unnest_tokens(output = token, input = text) %>%
anti_join(get_stopwords(), by = c("token" = "word")) %>%
filter(!str_detect(token, "[:digit:]")) %>%
mutate(token = wordStem(token, language = "en"))
<- sotu_clean %>%
sotu_dtm filter(str_length(token) > 1) %>%
count(year, token) %>%
group_by(token) %>%
filter(n() < 95) %>% # remove tokens that appear in more than 95 documents (i.e., years)
cast_dtm(document = year, term = token, value = n)
A DTM contains Documents (rows) and Terms (columns), and specifies how often a term appears in a document.
%>% as.matrix() %>% .[1:5, 1:5] sotu_dtm
## Terms
## Docs abandon abat abettor abey abid
## 1900 1 3 1 2 1
## 1901 2 0 0 0 4
## 1902 3 0 0 0 0
## 1903 3 1 0 0 0
## 1904 1 0 0 1 0
5.2.2 Inferring the number of topics
We need to tell the model in advance how many topics we assume to be present within the document. Since we have neither read all the SOTU addresses (if so, we would probably also not have to use the topic model), we cannot make an educated guess on how many topics are in there.
5.2.2.1 ldatuning
LDA offers a couple of parameters to tune, but the most crucial one probably is k
, the number of topics. One approach might be to just provide it with wild guesses on how many topics might be in there and then try to make sense of them afterward. ldatuning
offers a more structured approach to finding the optimal number of k
. It trains multiple models with varying k
s and compares them with regard to certain performance metrics.
library(ldatuning)
<- FindTopicsNumber(
determine_k
sotu_dtm,topics = seq(from = 2, to = 30, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 16L,
verbose = TRUE
)
#determine_k %>% write_rds("lda_tuning.rds")
Then we can plot the results and determine which maximizes/minimizes the respctive metrics:
FindTopicsNumber_plot(determine_k)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
We would go with the 16 topics here, as they seem to maximize the metrics that shall be maximized and minimizes the other ones quite well.
<- LDA(sotu_dtm, k = 16, control = list(seed = 77))
sotu_lda_k16
<- tidy(sotu_lda_k16)
sotu_lda_k16_tidied
#write_rds(sotu_lda_k16, "lda_16.rds")
Then we can learn a model with this parameter k
using the LDA()
function.
library(topicmodels)
library(broom)
<- LDA(sotu_dtm, k = 16, control = list(seed = 77))
sotu_lda_k16
<- tidy(sotu_lda_k16) sotu_lda_k16_tidied
The tidy()
function from the broom
package (Robinson 2020) brings the LDA output back into a tidy format. It consists of three columns: the topic, the term, and beta
, which is the probability that the term stems from this topic.
%>% glimpse() sotu_lda_k16_tidied
## Rows: 171,952
## Columns: 3
## $ topic <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, …
## $ term <chr> "abandon", "abandon", "abandon", "abandon", "abandon", "abandon"…
## $ beta <dbl> 7.884872e-11, 3.156698e-04, 3.691549e-04, 1.782611e-04, 2.385691…
Now, we can wrangle it a bit, and then visualize it with ggplot2
.
<- sotu_lda_k16_tidied %>%
top_terms_k16 group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
top_terms_k16 mutate(topic = factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_x_reordered() +
facet_wrap(~topic, scales = "free", ncol = 4) +
coord_flip()
Another thing to assess is document-topic probabilities gamma
: which document belongs to which topic. By doing so, you can choose the documents that have the highest probability of belonging to a topic and then read these specifically. This might give you a better understanding of what the different topics might imply.
<- tidy(sotu_lda_k16, matrix = "gamma") sotu_lda_k16_document
This shows you the proportion of words in the document which were drawn from the specific topics. In 1990, for instance, many words were drawn from the first topic.
%>%
sotu_lda_k16_document group_by(document) %>%
slice_max(gamma, n = 1) %>%
mutate(gamma = round(gamma, 3))
## # A tibble: 99 × 3
## # Groups: document [99]
## document topic gamma
## <chr> <int> <dbl>
## 1 1900 13 1
## 2 1901 2 1
## 3 1902 2 0.997
## 4 1903 8 0.72
## 5 1904 8 0.727
## 6 1905 8 0.998
## 7 1906 8 0.973
## 8 1907 2 0.817
## 9 1908 8 0.613
## 10 1909 11 1
## # … with 89 more rows
An interesting pattern is that the topics show some time-dependency. This intuitively makes sense, as they might represent some sort of deeper underlying issue.
5.2.3 Sense-making
Now, the harder part begins: making sense of the different topics. In LDA, words can exist across topics, making them not perfectly distinguishable. Also, as the number of topics becomes greater, plotting them doesn’t make too much sense anymore.
<- sotu_lda_k16_tidied %>%
topic_list group_by(topic) %>%
group_split() %>%
map_dfc(~.x %>%
slice_max(beta, n = 20, with_ties = FALSE) %>%
arrange(-beta) %>%
select(term)) %>%
set_names(str_c("topic", 1:16, sep = "_"))
## New names:
## • `term` -> `term...1`
## • `term` -> `term...2`
## • `term` -> `term...3`
## • `term` -> `term...4`
## • `term` -> `term...5`
## • `term` -> `term...6`
## • `term` -> `term...7`
## • `term` -> `term...8`
## • `term` -> `term...9`
## • `term` -> `term...10`
## • `term` -> `term...11`
## • `term` -> `term...12`
## • `term` -> `term...13`
## • `term` -> `term...14`
## • `term` -> `term...15`
## • `term` -> `term...16`
5.2.3.1 LDAvis
LDAvis
is a handy tool we can use to inspect our model visually. Preprocessing the data is a bit tricky though, therefore we define a quick function first.
library(LDAvis)
<- function(dtm, lda_output){
prep_lda_output <- dtm %>%
doc_length as.matrix() %>%
as_tibble() %>%
rowwise() %>%
summarize(doc_sum = c_across() %>% sum()) %>%
pull(doc_sum)
<- posterior(lda_output)$terms %>% as.matrix()
phi <- posterior(lda_output)$topics %>% as.matrix()
theta <- colnames(dtm)
vocab <- dtm %>%
term_sums as.matrix() %>%
as_tibble() %>%
summarize(across(everything(), ~sum(.x))) %>%
as.matrix()
<- function(x) tsne::tsne(svd(x)$u)
svd_tsne ::createJSON(phi = phi,
LDAvistheta = theta,
vocab = vocab,
doc.length = doc_length,
term.frequency = term_sums[1,],
mds.method = svd_tsne
) }
Thereafter, getting the data into format and running the app works as follows.
<- prep_lda_output(sotu_dtm, sotu_lda_k16)
json_lda
serVis(json_lda, out.dir = 'vis', open.browser = TRUE)
::daemon_stop(1) servr
5.2.4 Structural Topic Models
Structural Topic Models offer a framework for incorporating metadata into topic models. In particular, you can have these metadata affect the topical prevalence, i.e., the frequency a certain topic is discussed can vary depending on some observed non-textual property of the document. On the other hand, the topical content, i.e., the terms that constitute topics, may vary depending on certain covariates.
Structural Topic Models are implemented in R via a dedicated package. The following overview provides information on the workflow and the functions that facilitate it.
In the following example, I will use the State of the Union addresses to run you through the process of training and evaluating an STM.
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
<- sotu_meta %>%
sotu_stm mutate(text = sotu_text) %>%
distinct(text, .keep_all = TRUE) %>%
filter(between(year, 1900, 2000))
The package requires a particular data structure and has included several functions that help you preprocess your data. textProcessor()
takes care of preprocessing the data. It takes as a first argument the text as a character vector as well as the tibble containing the metadata. Its output is a list containing a document list containing word indices and counts, a vocabulary vector containing words associated with these word indices, and a data.frame containing associated metadata. prepDocuments()
finally brings the resulting list into a shape that is appropriate for training an STM. It has certain threshold parameters which are geared towards further reducing the vocabulary. lower.thresh = n
removes words that are not present in at least n documents, upper.thresh = m
removes words that are present in more than m documents. The ramifications of these parameter settings can be explored graphically using the plotRemoved()
function.
<- textProcessor(sotu_stm$text, metadata = sotu_stm %>% select(-text)) processed
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
plotRemoved(processed$documents, lower.thresh = seq(1, 50, by = 2))
<- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 3, upper.thresh = 95) prepped_docs
## Removing 9447 of 14346 terms (27061 of 136130 tokens) due to frequency
## Your corpus now has 109 documents, 4899 terms and 109069 tokens.
Now that the data is properly preprocessed and prepared, we can estimate the actual model. As mentioned before, covariates can influence topical prevalence as well as their content. I assume topical prevalence to be influenced by the party of the speaker as well as the year the SOTU was held. The latter is assumed to influence the topical prevalence in a non-linear way (SOTU addresses usually deal with acute topics which do not gradually build over time) and is therefore estimated with a spline through the s()
function that comes from the stm
package. It defaults to a spline with 10 degrees of freedom. Moreover, I assume the content of topics to be influenced by party affiliation. Both prevalence =
and content =
take their arguments in formula notation.
As determined before, I assume the presence of K = 16
topics (stm
also offers the searchK()
function to tune this hyperparameter)
<- stm(documents = prepped_docs$documents,
sotu_content_fit vocab = prepped_docs$vocab,
K = 16,
prevalence = ~party + s(year),
content = ~party,
max.em.its = 75,
data = prepped_docs$meta,
init.type = "Spectral",
verbose = FALSE)
Let’s look at a summary of the topics and their prevalence. For this, we can use a shiny app developed by Carsten Schwemmer
library(stminsights)
<- list(documents = prepped_docs$documents,
out vocab = prepped_docs$vocab,
meta = prepped_docs$meta)
$meta$party <- as.factor(prepped_docs$meta$party)
prepped_docs<- estimateEffect(1:16 ~ party + s(year), sotu_content_fit, meta = prepped_docs$meta, uncertainty = "Global")
prep map(1:16, ~summary(prep, topics = .x))
save(prepped_docs, sotu_content_fit, prep, out, file = "stm_insights.RData")
run_stminsights()
5.3 Further readings
- Check out the SMLTAR book
- More on tidymodels
- Basic descriptions of ML models
- More on prediction with text using tidymodels
- A
shiny
introduction to STM by Thierry Warin
5.4 Exercises
5.4.1 Supervised Machine Learning
- Measuring polarization of language through a “fake prediction.” Train the same model that we trained on British MPs earlier on
timelines_us <- read_csv("https://www.dropbox.com/s/dpu5m3xqz4u4nv7/tweets_house_rep_party.csv?dl=1")
. First, split the new data into training and test set (prop = 0.3
should suffice, make sure that you setstrata = party
). Train the model using the same workflow but new training data that predicts partisanship based on the Tweets’ text. Predict on the test set and compare the models’ accuracy.
set.seed(1)
<- read_csv("https://www.dropbox.com/s/1lrv3i655u5d7ps/timelines_gb_2022.csv?dl=1")
timelines_gb <- read_csv("https://www.dropbox.com/s/iglayccyevgvume/timelines_us.csv?dl=1")
timelines_us
<- initial_split(timelines_gb, prop = 0.3, strata = party)
split_gb <- training(split_gb)
party_tweets_train_gb <- testing(split_gb)
party_tweets_test_gb
<- initial_split(timelines_us, prop = 0.3, strata = party)
split_us <- training(split_us)
party_tweets_train_us <- testing(split_us)
party_tweets_test_us
<- recipe(party ~ text, data = party_tweets_train_gb) %>%
twitter_recipe step_tokenize(text) %>% # tokenize text
::step_upsample(party) %>%
themisstep_tokenfilter(text, max_tokens = 1000) %>%
step_tfidf(text)
<- rand_forest(trees = 50) %>%
rf_spec set_engine("ranger") %>%
set_mode("classification")
<- workflow() %>%
twitter_party_rf_workflow add_recipe(twitter_recipe) %>%
add_model(rf_spec)
<- twitter_party_rf_workflow %>% fit(data = party_tweets_train_gb)
party_gb <- twitter_party_rf_workflow %>% fit(data = party_tweets_train_us)
party_us
<- augment(party_gb, party_tweets_test_gb)
predictions_gb mean(predictions_gb$party == predictions_gb$.pred_class)
<- augment(party_us, party_tweets_test_us)
predictions_us mean(predictions_us$party == predictions_us$.pred_class)
- Extract Tweets from U.S. timelines that are about abortion by using yesterday’s approach (
keywords <- c("abortion", "prolife", " roe ", " wade ", "roevswade", "baby", "fetus", "womb", "prochoice", "leak")
,timelines_us_abortion <- timelines_us %>% filter(str_detect(text, keywords %>% str_c(collapse = "|")))
). Perform the same prediction task (but now withinitial_split(prop = 0.8)
). How does the accuracy change?
<- c("abortion", "prolife", " roe ", " wade ", "roevswade", "baby", "fetus", "womb", "prochoice", "leak")
keywords <- timelines_us %>% filter(str_detect(text, keywords %>% str_c(collapse = "|")))
timelines_us_abortion
<- initial_split(timelines_us_abortion, prop = 0.8, strata = party)
split_us_abortion <- training(split_us_abortion)
abortion_tweets_train_us <- testing(split_us_abortion)
abortion_tweets_test_us
<- twitter_party_rf_workflow %>% fit(data = abortion_tweets_train_us)
abortion_us
<- augment(abortion_us, abortion_tweets_test_us)
predictions_abortion_us mean(predictions_abortion_us$party == predictions_abortion_us$.pred_class)
5.4.2 Topic Models
- Check out
LDAvis
and how it orders the topics. Try to make some sense of how they are related etc.
<- prep_lda_output(sotu_dtm, sotu_lda_k16)
json_lda
serVis(json_lda, out.dir = 'vis', open.browser = TRUE)
::daemon_stop(1) servr
- Do the same thing for
stminsights
. The trained stm model can be downloadedsotu_stm <- read_rds("https://www.dropbox.com/s/65bukmm42byq0dy/sotu_stm_k16.rds?dl=1")
.
library(sotu)
library(tidytext)
library(SnowballC)
library(stminsights)
<- sotu_meta %>%
sotu_clean mutate(text = sotu_text %>%
str_replace_all("[,.]", " ")) %>%
filter(between(year, 1900, 2000)) %>%
unnest_tokens(output = token, input = text) %>%
anti_join(get_stopwords(), by = c("token" = "word")) %>%
filter(!str_detect(token, "[:digit:]")) %>%
mutate(token = wordStem(token, language = "en"))
<- sotu_clean %>%
sotu_dtm filter(str_length(token) > 1) %>%
count(year, token) %>%
#filter(between(year, 1900, 2000)) %>%
group_by(token) %>%
filter(n() < 95) %>%
cast_dtm(document = year, term = token, value = n)
<- sotu_meta %>%
sotu_stm mutate(text = sotu_text) %>%
distinct(text, .keep_all = TRUE) %>%
filter(between(year, 1900, 2000))
<- textProcessor(sotu_stm$text, metadata = sotu_stm %>% select(-text))
processed #, custompunctuation = "-")
#?textProcessor() # check out the different arguments
#?prepDocuments()
plotRemoved(processed$documents, lower.thresh = seq(1, 50, by = 2))
<- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 3, upper.thresh = 80)
prepped_docs
$meta$party <- as.factor(prepped_docs$meta$party)
prepped_docs<- estimateEffect(1:16 ~ party + s(year), sotu_content_fit, meta = prepped_docs$meta, uncertainty = "Global")
prep map(1:16, ~summary(prep, topics = .x))
save(prepped_docs, sotu_stm, prep, file = "stm_insights.RData")
run_stminsights()