Chapter 3 Week 17 - Exploring text with topic models

The purpose of this workshop is to guide you through some essential techniques for processing and exploring text data. This includes stopword removal, contraction expansion, lemmatizing, word frequency analysis, term frequency-inverse document frequency (TF-IDF) computation, term-document matrices, and topic modelling.

3.1 R commands for NLP

Below are a number of important R libraries, commands, and functions which will help you process and explore text, as well as fit topic models.

3.1.1 Required packages

Below are the required packages for this workshop. You will need to install them if you have not already and load them.

# Install required packages
install.packages("tidyverse")
install.packages("tidytext")
install.packages("textclean")
install.packages("textstem")
install.packages("tm")
install.packages("quanteda")
install.packages("psych")
install.packages("topicmodels")

# Load required packages
library(tidyverse)
library(tidytext)
library(textclean)
library(textstem)
library(tm)
library(quanteda)
library(psych)
library(topicmodels)

3.1.2 Case

It can be a good idea to set your text to lower case prior to any analysis. This ensures that text identification, conversions, and mappings are most likely to be robust. Many functions include operators which make them case insensitive, but some do not. It would be unfortunate if, for example, a function you applied correctly identified “let’s go!” but not “Let’s go!”. We can use the tolower() function to easily do this.

# Define a vector with three text strings
text = c("Text analysis in R is interesting.", 
         "We're learning NLP.", 
         "After this workshop, I'll know how to explore text data and fit topic models")

# Convert the texts to lowercase
text_lower <- tolower(text)

# Print the texts to confirm they are now all in lowercase
print(text_lower)

3.1.3 Contractions

Contractions should be expanded prior to further analysis (e.g., “we’re” → “we are”). We can use the replace_contraction() function to handle this. It contains a large number of ‘hard-coded’ mappings between contractions (e.g., “I’ll”) and expansions (“I will”). It effectively finds all contractions in its vocabulary, in either a single string or vector of strings, and replaces them with the corresponding expansion.

# Define a vector with three text strings
text = c("text analysis in r is interesting.", 
         "we're learning nlp.", 
         "after this workshop, i'll know how to explore text data and fit topic models")
 
# Expand contractions
text_expanded <- replace_contraction(text)

# Print the texts to confirm all contractions are now expanded
print(text_expanded)

3.1.4 Lemmatizing

Lemmatizing reduces words to their base forms (e.g., “running” → “run”). We can use the lemmatize_words() function to do this. It contains a large number of ‘hard-coded’ mappings between contractions (e.g., “I’ll”) and expansions (“I will”). It effectively finds all contractions in its vocabulary, in either a single string or vector of strings, and replaces them with the corresponding expansion.

# Define a vector with three words
text <- c("running", "jumps", "doggies")

# Lemmatize words
text_lemmatized <- lemmatize_words(text)

# Confirm words are now lemmatized
print(text_lemmatized)

# Define another vector with four more words
text <- c("girlies", "goods", "better", "joyfully")

# Lemmatize words
text_lemmatized <- lemmatize_words(text)

# Confirm words are now lemmatized. Note how some convertions may not be as expected and so must be applied with care
# "girlies" is not lemmatized to "girl" because it is not in the lemma vocabulary
# "goods" is lemmatized to "good", despite "goods" potentially denoting merchandise (and not the plural of good)
# "better" is lemmatized to "good"
# "joyfully" is not lemmatized to "joy" (despite being semantically related)
print(text_lemmatized)

3.1.5 Stopwords

Stopwords are common words (e.g., “the”, “is”) that are typically considered to be devoid of substantive meaning We can remove them using an established stopword list provided by the tidytext package. You can call this list and examine it by running View(stop_words). We can use the anti_join() function to remove any and all stopwords in our data.

# Examine the stop_word list to understand which words are identified as meaningless and for removal.
# Note there are multiple sub-lists contained within these which are more, and less, comprehensive
View(stop_words)

# Define a dataframe with words
data <- data.frame(
  word = c("after", "this", "workshop", "i", "will", "know", "how", "to", "explore", "text", "data")
)

# Remove rows with stopwords by using the anti_join() function. 
# This function matches and removes cases in two dataframes. 
# Note, the column names must be identical in both. 
# The function succeeds because both relevant columns are denoted "word".
data_nostopwords <- anti_join(data,
                              stop_words,
                              by = "word")

# Confirm the stop words have been removed
View(data_nostopwords)

It is important to note that some stopwords may be relevant to your analysis. As an example, psychological scientists have expressed an interest in first-person pronoun use as an indicator of self-focus, perhaps being more prevalent in those who are narcissistic (see e.g., Carey et al., 2015). “I” and “me” are typically included in lists of stopwords and so would be removed when applying them. This would preclude analyses of them.

Carey, A. L., Brucks, M. S., Küfner, A. C., Holtzman, N. S., Back, M. D., Donnellan, M. B., … & Mehl, M. R. (2015). Narcissism and the use of personal pronouns revisited. Journal of personality and social psychology, 109(3), e1.

If you find yourself in the position of being interested in words which feature in stopword lists (you are encouraged to check stopwords()), you have two options. The first is to skip the removal of stopwords entirely. The second is to remove specific words from the stopwords() list. This can be achieved as follows.

# Define a dataframe with stop words you wish to retain
meaningful_stopwords <- data.frame(
  word = c("i", "me", "myself", "mine")
  )

# Create a new set of stopwords from the established one (stop_word)
# But without the ones you wished to retain, again using the anti_join() function
my_stop_words <- 
  anti_join(stop_words, 
            meaningful_stopwords, 
            by = "word")

# Confirm your new stopword list does not contain the words you wish to retain
# This can be done by scrolling down to confirm that, for example, "i" is no longer in the list
View(my_stop_words)

3.1.6 Unnesting tokens

An early step you will almost always need to take is to split your data into word tokens. This can be done via the unnest_tokens() function. It will identify all tokens, separated by white space or delineators (e.g., “.”), in a vector of strings, pull them out and place them into individual cells for further analyses. This step will enable many functions which operate on single tokens, as opposed to longer strings.

# Define a dataframe with three text strings
data <- data.frame(
  text = c("Text analysis in R is interesting.", 
           "We are learning NLP.",
           "After this workshop, I will know how to explore text data and fit topic models")
)

# Unnest the tokens from all the strings so that each word is represented 
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Examine the new dataframe where each word is saved on a single row
View(data_tokens)

3.1.7 Bringing it all together to examine word frequencies and TF-IDF

Let’s apply all of the previous operations to a dataset to examine relevant word frequencies and TF-IDF scores.

It is recommended that you explore your dataset by examining how often each unique word occurs. This can be done via the functions provided by the tidytext package, and supported by the tidyverse package. It first involves calling the count() function to derive the frequency with which words occur.

TF-IDF helps highlight important words. To remind you, TF-IDF stands for Term Frequency - Inverse Document Frequency. TF captures the frequency with terms (or words) occur in each document. IDF captures how frequently terms (or words) occur throughout a corpus. Words which occur in many documents are given a low IDF score, whilst those which occur in very few are given a high IDF score. TF-IDF is the product of TF and IDF. The final score therefore captures how frequent and unique words are.

# Define a dataframe with five text strings, labelled as documents 1-5
data <- data.frame(
  document = c(1, 2, 3, 4, 5),
  text = c("Text analysis in R is interesting.", 
           "We're learning NLP and text analysis.", 
           "After this workshop, I'll know how to explore text data and fit topic models to text.",
           "This will help me get a sense of what is going in my own data.",
           "And get the best possible grade for my project.")
  )

# Convert the text to lower case
data$text <- tolower(data$text)

# Expand contractions
data$text <- replace_contraction(data$text)

# Unnest tokens
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Lemmatize words
data_tokens$word <- lemmatize_words(data_tokens$word) 

# Remove stop words
data_tokens_nostopwords <- 
  anti_join(data_tokens, 
            stop_words, 
            by = "word")

# Count the number of times each word occurs in each document
word_counts <- data_tokens_nostopwords %>% 
  count(document,
        word, 
        sort = TRUE)

# View the word counts
View(word_counts)

# Add the TF-IDF scores via the bind_tf_idf() function
tf_idf <- word_counts %>% 
  bind_tf_idf(word,
              document, 
              n)

# View the word counts and TF-IDF scores. 
# Notice how "text" has some of the lowest scores becuase it is the most common across documents
# Whilst "grade" has one of the highest because it is the least common.
View(tf_idf)

3.1.8 Document-Term Matrices (DTM)

We can further explore our data by examining how often each word in your corpus appears in each document (or text). This will allow us to see how words tend to co-occur across documents, and if there are any systematic correlations between them indicative of latent themes or topics. The first step to doing this requires that we compute a document-term matrix. This can be acheived by applying the cast_dfm() function on a dataframe of wordcounts.

# Define a dataframe with five text strings, labelled as documents 1-5
data <- data.frame(
  document = c(1, 2, 3, 4, 5),
  text = c("Text analysis in R is interesting.", 
           "We're learning NLP and text analysis.", 
           "After this workshop, I'll know how to explore text data and fit topic models to text.",
           "This will help me get a sense of what is going in my own data.",
           "And get the best possible grade for my project.")
  )

# Unnest tokens
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Count the number of times each word occurs in each document
word_counts <- data_tokens %>% 
  count(document, 
        word, 
        sort = TRUE)

# Convert to a document-term matrix via the cast_dfm() function. 
# We then use the convert() function to make the output into a data.frame for easier processing
dtm <- word_counts %>%
  cast_dfm(document, 
           word, 
           n) %>%
  convert(to = "data.frame")

# Examine the final document-term matrix
View(dtm)

3.1.9 Topic modelling

Now we can apply a dimension reduction technique to the DTM to examine if there is systematic variance in the way words tend to co-occur, potentially revealing underlying themes. The fa.parallel() function provides scree plots to help interpret how substantial each component is. The principal() function conducts the principal component analysis on a DTM.

# Define a DFM dataframe with eight numeric variables, and a single id string variable
dtm <- data.frame(
  doc_id = c("1", "2", "3", "4", "5", "6", "7", "8", "9"),
  friendly = c(4, 2, 7, 3, 5, 3, 7, 4, 2),
  sociable = c(3, 2, 5, 1, 6, 2, 3, 4, 1),
  outgoing = c(4, 1, 7, 4, 5, 3, 8, 4, 2),
  talkative = c(3, 2, 7, 1, 5, 3, 7, 4, 2),
  diligent = c(1, 6, 7, 4, 2, 5, 5, 4, 9),
  hard_working = c(2, 6, 4, 4, 4, 5, 5, 4, 9),
  responsible = c(2, 4, 7, 6, 2, 6, 5, 4, 9),
  strict = c(1, 6, 7, 6, 4, 5, 5, 4, 6)
)

# Remove the id variable to prepare the DFM dataframe for further analyses. 
# The functions we will call in the next parts requires a DTM that contains only the term frequency counts.
# It will not run if the dataframe contains id or document indicators.
dtm_4_pca <- dtm %>% 
  select(!doc_id) # Adding ! before doc_id translates to selecting all columns except for doc_id

# Get the scree plot to get a sense of how many compenents (themes) there could be in the DFM
# There is clear sign of two components with large eigenvalues
fa.parallel(dtm_4_pca, 
            main="Scree Plot", 
            fa = "pc",
            sim = F)

# Conduct a principal compenent analysis whilst extracting two components
pca <- principal(dtm_4_pca,
                 nfactors = 2,
                 residuals = FALSE,
                 rotate = "varimax", 
                 method = "regression")

# Examine the component loadings to gain a sense of which words are most closley associated with each component.
# You can click on the column headings to sort them by size
pca$loadings %>%                          # get the component loadings
  tibble(word = colnames(dtm_4_pca)) %>%  # add in the word labels to enable interpretation
  View()

3.1.10 Final note on Topic modelling

There are many additional snags and nuances you will hit upon when fitting topic models to real data. Creating a DTM for large corpora of text can result in matrices that are many thousands of words wide. Many of these words may only be used in a single document. This can make estimating principal components difficult and computationally-heavy. It is generally recommended that researchers remove words which do not occur in more that 10% of documents.

Here is a method for calculating the number of non-zero values in each column of a DTM, reflecting the number of documents which where each word is present. From this, we can select all which occur over a certain threshold (e.g., 10 or more). This will depend on the overall size of your corpus.

words_to_keep <- 
  apply(dtm_4_pca, 2, function(x) sum(x != 0)) %>%
  as.data.frame() %>%
  filter(. > 10) %>%  # change this number to match 10% of your corpus
  row.names()

dtm_4_pca <- dtm %>% 
  select(words_to_keep)

3.2 Test your knowledge

Now it’s time to put your knowledge to the test to see if you can fit a topic model to some real data. Your task is to examine a collection of 2246 news articles from an American News Agency, mostly published around 1988. This is provided by the topicmodels package and can be accessed via the code below.

# Load the dataset
data("AssociatedPress")

# Process it into a DTM
dtm <- 
  AssociatedPress %>%
  tidy() %>%
  cast_dfm(document, term, count) %>%
  convert(to = "data.frame")

Once you have the data loaded into a DTM, you are tasked to:

Remove words which do not occur in at least 200 documents
Apply a PCA to the DFM and interpret the first three components as a function of the words which load most highly on them

3.3 Solutions

# Remove the doc_id column
dtm_4_pca <- dtm %>% 
  select(!doc_id)

# Get all words which feature in more than 200 documents
words_to_keep <- 
  apply(dtm_4_pca, 2, function(x) sum(x != 0)) %>%
  as.data.frame() %>%
  filter(. > 200) %>%
  row.names()

# Remove all the word from the DTM which occur in less than 200 documents
dtm_4_pca <- dtm_4_pca %>% 
  select(words_to_keep)

# Check scree plot to get a sense of how many components there are; seems to be three, potentially five
fa.parallel(dtm_4_pca, 
            main="Scree Plot", 
            fa = "pc",
            sim = F)

# Extract three components via principal component analysis
pca <- principal(dtm_4_pca, 
                 nfactors = 3,  
                 residuals = FALSE, 
                 rotate = "varimax", 
                 method = "regression")

# Examine word component loadings to interpret themes components. 
# One seems to be about presidents and government
# One about money and economics
# One not super clear, up for debate
pca$loadings %>% 
  tibble(word = colnames(dtm_4_pca)) %>% 
  View()