Chapter 4 Week 18 - Dictionary approaches to text classification

The purpose of this workshop is to introduce tools for text classification using predefined dictionaries. You will have the chance to apply dictionary methods with both LIWC and R. Step-by-step guides are provided below.

LIWC is highly user friendly and contains a number of validated and informative dictionaries. However, it is not free to use and, as such, you may not have access to it in the future. For this reason, I think it is valuable to also introduce you to R packages which can apply dictionary methods to texts. These packages mirror LIWC in many ways, but differ in an important one. The R packages do not have any predefined dictionaries, including those which come with LIWC.

It is unfortunately not permissible to directly import LIWC dictionaries into R, as these are proprietary to the software. There are, however, many other dictionaries out there which are not part of the LIWC ecosystem. There is a decent chance of finding a dictionary which captures a construct of interest without having to rely on LIWC.

It is possible to complete this workshop by using LIWC or R. You are encouraged to explore both approaches.

4.1 LIWC

LIWC can be installed via the following website: https://www.liwc.app/download.

Note, you will need a license key to download and use LIWC. These will be made available in the workshop. If you require access to LIWC outside the workshops, you can inquire about borrowing a laptop from the Psychology Department with LIWC installed and licensed. Alternatively, we may be able to provide you with a license for your personal machine. Please email: s.leach1@lancaster.ac.uk.

A basic tutorial video on how to use LIWC can be found here: https://youtu.be/IGBI8LnYGNs?si=BpvO6e5DXC7UTxpd

And a second video worth watching can be found here: https://youtu.be/iEy4rf0vwUw?si=hm4FxHo8xipcpDjV

4.2 R

Below are a number of important R libraries, commands, and functions which will enable you to apply dictionary methods for text classification.

4.2.1 Required packages

First, install and load the necessary packages: quanteda and quanteda.dictionaries for working with dictionaries. quanteda.dictionaries is not currently available on CRAN and so must be installed from a github repository. This will require the devtools package to call the install_github() function.

# Install devtools and quanteda (if not already installed)
install.packages("quanteda")
install.packages("devtools")

# Load devtools and quanteda
library(devtools)
library(quanteda)

# Install quanteda.dictionaries package from github
devtools::install_github("kbenoit/quanteda.dictionaries") 

# Load quanteda.dictionaries
library(quanteda.dictionaries)

4.2.2 Loading dictionaries

Now that we have the required packages installed, it’s time to load some predefined dictionaries. These are typically saved as .dic text files. They are first formatted with a key denoting what the dimension captured by the dictionary (delineated by %%). Each dictionary word is then listed one-by-one on a new line.

A number of non-LIWC dictionary files can be accessed here: https://drive.google.com/drive/folders/1oDkTuzzgcsnt87JpblL8vtidmFo11PMl?usp=sharing

These include .dic files with dictionaries capturing:

first-person singular pronouns includes words denoting the self (e.g., me, my)
first-person plural pronouns includes words denoting collectives one is a part of (e.g., we, us)
prosocial includes words capturing cooperation and helping (e.g., charity, donation)
care includes words capturing the moral foundation of care (e.g., help, harm)
fairness includes words capturing the moral foundation of fairness (e.g., equity, reciprocity)
threat includes words capturing psychological danger and threat (e.g., afraid, risk)
moral-emotional includes words denoting emotional and moral terms (e.g., abandon, kill)
communion includes words capturing interpersonal warmth (e.g., love, friend)
agency includes words capturing goal-striving (e.g., want, achieve)

To read these into R, you can run the read_dict_liwc() function as follows (make sure the .dic files are in your working folder).

# Load the communion.dic file into R
communion_dictionary <- quanteda:::read_dict_liwc("communion.dic")

# Load the threat.dic file into R
threat_dictionary <- quanteda:::read_dict_liwc("threat.dic")

# Load the moral-emotional.dic file into R
moral_emotional_dictionary <- quanteda:::read_dict_liwc("moral-emotional.dic")

4.2.3 Exploring dictionaries

It is typically a good idea to get a sense of the words in your dictionary. You can do this by simply opening the .dic files in Notepad. You will also want to confirm that they have been correctly read into R.

You can examine your dictionaries in R via the following commands:

# Examine the communion_dictionary
communion_dictionary$Communion

# Examine the threat dictionary
threat_dictionary$Threat

# Examine the moral-emotional dictionary
moral_emotional_dictionary$Moral_Emotional

Notice how some of the words in the moral-emotional dictionary are followed by *. This operator is used as a wildcard character to match any words that start with a given prefix. This allows for flexible pattern matching in dictionaries, enabling you to capture related terms that share the same root or beginning part of the word.

When using * in a dictionary file, it will match any word that begins with the specified prefix. For example, if you have an entry like abandon* in a dictionary, it will match words such as:

abandon
abandoned
abandonment
abandons
abandoning

4.2.4 Binding dictionaries

The individual word lists we have need to be explicitly denoted as a dictionary class. It can also be convenient to bind them together into a single object which can be applied to derive dictionary scores all at once.

This can be achieved in the following manner:

# Define a dictionary object with the three word lists, communion, threat, and moral-emotional
dictionaries <- 
  quanteda::dictionary(
    list(
      communion = communion_dictionary$Communion,
      threat = threat_dictionary$Threat,
      moral_emotional = moral_emotional_dictionary$Moral_Emotional
    )
  )

# View the dictionaries
View(dictionaries)

4.2.5 Applying dictionaries

We are now ready to apply our dictionaries to a set of texts to classify them in terms of the appearance of keywords. This can be done by calling the liwcalike() function - named in acknowledgment to the LIWC software which inspired the package.

This can be achieved in the following manner:

# Define an example vector of four texts
texts <- c("The company had terrible financial performance.",
           "We felt entirley abondoned by upper management.",
           "It was really a horrible situation with no compassion.",
           "I don't know what I'm going to do, my whole livlihood is udner attack.")

# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(texts, 
                               dictionaries)

4.2.6 Examining dictionary scores

Accessing your newly-created dataframe dictionary_scores reveals the score provided liwcalike(). These are modeled very closely on LIWC’s outputs.

# View the outputs of the dictionary analysis
View(dictionary_scores)

You will immediately notice that several new scores (columns) have been added for each text (rows). Many of these provide general information about the texts, for example:

WC indicates the total number of words in each text
Dic indicates the percent of words in each text which are in any of the applied dictionaries
AllPunc indicates the percent of the text which is comprised punctuation
Comma, Punc, etc. indicates the percent of the text which is comprised of each type of punctuation

You will also notice columns corresponding to the applied dictionaries - in this example communion, threat, and moral_emotional. Scores in these columns indicate the percent of words in each text which are present in each dictionary.

# Access the communion scores
dictionary_scores$communion

# Access the communion scores
dictionary_scores$threat

# Access the communion scores
dictionary_scores$moral_emotional

4.2.7 Tidying up the data

As you may have noticed, the liwcalike() function creates an entirely new dataframe when calculating word frequencies. Oftentimes you will apply this function to a dataset which includes your texts and many other indices, such as who produced them and when. If you wish to analyze dictionary scores as a function of such indices, you will need to incorporate the dictionary scores provided by liwcalike() back into your original dataset.

This can be done as follows:

# Define an example dataframe with four texts from two groups recorded on certain dates
text_data <- data.frame(author = c("employee 1", "employee 2", "employee 3", "employee 4"),
                        group = c("HR", "HR", "Sales", "Sales"),
                        date = c("01/04/2022", "02/04/2022", "03/04/2022", "04/04/2022"),
                        text = c("The company had terrible financial performance.",
                                 "We felt entirley abondoned by upper management.",
                                 "It was really a horrible situation with no compassion.",
                                 "I don't know what I'm going to do, my whole livlihood is udner attack."))


# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(text_data$text, 
                               dictionaries)

# Attach the communion, threat, and moral_emotion dictionary scores to the original dataset
text_data$communion <- dictionary_scores$communion
text_data$threat <- dictionary_scores$threat
text_data$moral_emotional <- dictionary_scores$moral_emotional

# View the original dataset, now with dictionary scores
View(text_data)

4.3 Test your knowledge

Below are two real datasets. You are challenged to extract psychological insights from them by applying the dictionary methods outlined above. Dictionary scores can be extracted either via R or LIWC. Exploration of the data will need to be done in R.

4.3.1 State of the Union Addresses

Here is a dataset of 135 State of the Union addresses: https://drive.google.com/drive/folders/1jSDu1sr_p0feeAlQs83msn3NaV1AK8v-?usp=sharing

The State of the Union Address is an annual speech delivered by the President of the United States to a joint session of Congress. It provides an overview of the current condition of the country, highlights the administration’s accomplishments over the past year, and outlines the president’s legislative agenda and priorities for the year ahead.

You are tasked with analyzing how these communications have changed over the last 100 years of so. Specifically, whether American presidents have increasingly used first-person plural pronouns over time (e.g., we, us), as might indicate a greater focus on common and shared group identity (e.g., Teten, 2007).

4.4 Solutions

4.4.1 State of the Union Addresses

The code below documents how to read in the data, define and apply the first-person plural pronoun dictionary, and then plot the scores over time. Looking at the plot, we can see a general increase in the use of first-person plural pronouns, illustrating a potentially important shift in presidential rhetoric.

# Read the data
text_data <- read.csv("state_of_the_union_addresses.csv")

# Load the first-person plural pronoun word list and define a dictionary object
ffpp_dictionary <- quanteda:::read_dict_liwc("first-person_plural_pronouns.dic")

dictionaries <- 
  quanteda::dictionary(
    list(
      ffpp = ffpp_dictionary$First_Person_Plural_Pronouns
    )
  )

# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(text_data$text, 
                               dictionaries)

# Attach the first person plural pronoun scores to the original dataset
text_data$ffpp <- dictionary_scores$ffpp

# plot changes first person plural pronoun use over time
text_data %>%
  ggplot(aes(x = year,
             y = ffpp)) +
  geom_line()

4.4.2 Politicians’ social media posts

The code below documents how to read in the data, define and apply the moral-emotional dictionary, and then plot the relationship between moral-emotional language and reposts (log). Looking at the plot, we can see that reposts increase with the proportion of moral-emotional words in posts, corroborating the idea that moralistic and emotional content tends to get more attention and go viral online.

# Read the data
text_data <- read.csv("politicians_tweets.csv")

# Load the moral-emotional word list and define a dictionary object
moral_emotional_dictionary <- quanteda:::read_dict_liwc("moral-emotional.dic")

dictionaries <- 
  quanteda::dictionary(
    list(
      moral_emotional = moral_emotional_dictionary$Moral_Emotional
    )
  )

# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(text_data$text, 
                               dictionaries)

# Attach the moral-emotional scores to the original dataset
text_data$moral_emotional <- dictionary_scores$moral_emotional

# Examine the relationship between moral-emotional scores and reposts
text_data %>%
  ggplot(aes(x = moral_emotional,
             y = reposts_log)) +
  geom_smooth(method = "lm")