Chapter 2 Week 13 - Fundamentals of Natural Language Processing
The purpose of this workshop is to tackle some basic problems of how to read and manage text data; and, if you have time, start to identify an appropriate dataset to support your research project.
2.1 Homework - Compiling Potential Sources of Data
To remind you, prior to this workshop session you were tasked with searching for potential sources and contributing them to a collective data bank.
You are expected to contribute at least three sources of data.
2.1.1 Collective Data Bank
The collective data bank can be accessed here: https://docs.google.com/spreadsheets/d/19_Tvwe-eWtIVInTZFouKvYNkt0aq-gcpMaBTyNfCCLk/edit?gid=0#gid=0
Please fill in each cell for your contributions. Note, you must confirm that the raw text is available for each of your contributions. You will often find that sources unfortunately do not share the raw texts, often due to ethical considerations. Your contributions must contain text data.
When adding your contributions, please also take a look at those which already exist in the document and only add your contribution if it does not replicate one which is already present.
2.1.2 Sources of Data
You are expected to conduct an independent search via Google and also to look for relevant scholarly articles which may share their data via public repositories. Below are some initial resources. Please note that these are only meant to be starting-points.
Harvard Dataverse Repository
The Harvard Dataverse Repository is a free resource open to all researchers from any discipline, both inside and outside of the Harvard community, where you can share, archive, cite, access, and explore research data.
The Harvard Dataverse Repository can be accessed here: https://dataverse.harvard.edu/
Papers with Code
The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables.
The website can be accessed here: https://paperswithcode.com/
Kaggle
Kaggle is a data science competition platform and online community for data scientists and machine learning practitioners under Google LLC. Has a huge repository of community-published data.
The website can be accessed here: https://www.kaggle.com/
Scientific articles
It is becoming increasingly common for scientific articles to share their data via public repositories (e.g., osf.io). You may be able to find interesting and relevant datasets by searching Google Scholar for articles on topics of interest which share their data.
2.2 Basic R commands
Below are a number of important R libraries, commands, and functions which you need to be familiar with to effectively process text data. Your first task is to familiarize yourself with these. You are encouraged to copy over the code and run them on your own machines.
2.2.1 File Types
This section provides an overview of common file types (.csv
, .xlsx
, .rds
, .json
, .txt
) and explains how to open them using conventional Windows programs and R.
.csv (Comma-Separated Values)
Description: A plain text file where data is stored in a tabular format, with each row on a new line and columns separated by commas.
How to open in Windows:
- Microsoft Excel: Right-click the file → Open with Excel. Alternatively, open Excel and import the file through the “Data” tab.
- Notepad/Notepad++: Right-click → Open with Notepad or another text editor.
How to open in R:
.xlsx (Excel Spreadsheet)
Description: A Microsoft Excel file format that stores data in worksheets with support for formatting, formulas, and multiple sheets.
How to open in Windows:
- Microsoft Excel: Double-click the file to open in Excel.
- Google Sheets: Upload the file to Google Drive and open it in Google Sheets (web-based).
How to open in R:
# Install required package
install.packages("openxlsx")
# Load required package
library(openxlsx)
# Read .xlsx file
data <- read.xlsx("path/to/file.xlsx", sheet = 1) # Specify the sheet if needed
.rds (R Data File)
Description: A file format unique to R that stores R objects in a binary format for efficient storage and loading.
How to open in Windows:
- Not Directly Openable: You cannot open
.rds
files in common Windows programs. They are intended for use within R.
How to open in R:
.json (JavaScript Object Notation)
Description: A lightweight, human-readable format for structuring data, often used in web APIs and configuration files.
How to open in Windows:
- Notepad/Notepad++: Right-click → Open with Notepad or another text editor.
- Web Browser: Drag the file into a browser for a formatted view.
How to open in R:
# Install required package
install.packages("jsonlite")
# Load required package
library(jsonlite)
# Read .json file
data <- fromJSON("path/to/file.json")
.txt (Plain Text File)
Description: A basic text file containing unformatted text, often used for simple data storage or documentation.
How to open in Windows:
- Notepad/Notepad++: Double-click to open in Notepad or another text editor.
- Microsoft Word: Open in Word to add formatting or modify the content.
How to open in R:
2.2.2 Accessing nested data
You may find that data formats you encounter will not seamlessly convert into the sorts of R objects you need for further analyses. Often you will find that text data is read into nested lists or dataframes which require further formatting. This is common for .json files.
Let’s emulate such a dataset and see how it operates; and how we can access the nested values in it.
# Install required package
install.packages("tidyverse")
# Load required package
library(tidyverse)
# Create a dataframe which emulates the nested structure of a .json file
data <- tibble(name = "John Doe",
age = "30",
address = data.frame(street = "123 Main St",
city = "New York",
state = "NY"),
blog = data.frame(url = "https://johndoe.com",
description = "John's personal blog about tech and coding."),
contacts = list(data.frame(type = c("email", "phone"),
value = c("johndoe@example.com", "555-1234"))))
# You will notice that not all variables are fully accessible in this data frame - pay particular attention to the "contacts" variable
View(data)
# Accessing top-level data (name, age)
data$name # Output: "John Doe"
data$age # Output: 30
# Accessing nested data (address, blog)
data$address$street # Output: "123 Main St"
data$address$city # Output: "New York"
data$address$state # Output: "NY"
data$blog$url # Output: "https://johndoe.com"
data$blog$description # Output: "John's personal blog about tech and coding."
# Accessing further nested arrays (e.g., first contact email)
data$contacts[[1]]$value[1] # Output: "johndoe@example.com"
data$contacts[[1]]$value[2] # Output: "555-1234"
# Pulling out all relevant details and saving them in a new dataframe
dataframe <- data.frame(name = data$name,
age = data$age,
street_address = data$address$street,
city_address = data$address$city,
state_address = data$address$state,
blog_url = data$blog$url ,
blog_description = data$blog$description,
email = data$contacts[[1]]$value[1],
phone = data$contacts[[1]]$value[2])
Explanation
Reading JSON: The fromJSON()
function from the jsonlite
package is used to read a JSON file into an R object (in this case, json_data).
- Accessing Top-Level Data: You can access simple fields (like name and age) directly using the
$
operator. - Accessing Nested Data: For the nested address object, we access it like
json_data$address$street
. - For arrays (like contacts), we use
[[ ]]
indexing to access elements. For example,json_data$contacts[[1]]
gives us the first contact, and we can then extract the type and value, as well as focus in on the first or second elements by using[ ]
2.2.3 Cleaning data
Many types of text data have characters or strings in them which may be irrelevant or undesirable to the quantitative researcher. Corpora can have metadata and tags which you may wish to remove prior to your analyses (e.g., “Start:” to indicate where a speech begins). Similarly, you may wish to remove certain information from social media texts prior to your analyses, like hashtags or URLs. Many of these operations rely on regular expressions–a formal language for specifying text strings. If you are curious, here is a short video: https://www.youtube.com/watch?v=UbIQxT3bApU
Below are a number of common functions in R which can help you clean your data. Try running them to see how they work.
# Sample string
text <- "Start: Hello world! Check out this link: https://example.com #excited"
# Remove the first 6 characters
text_without_first_6 <- substr(text, 7, nchar(text))
print(text_without_first_6) # Output: " Hello world! Check out this link: https://example.com #excited"
# Remove hashtags
text_no_hashtags <- gsub("#\\w+", "", text_without_first_6)
print(text_no_hashtags) # Output: " Hello world! Check out this link: https://example.com "
# Remove URLs
text_no_hashtags_urls <- gsub("https?://\\S+", "", text_no_hashtags)
print(text_no_hashtags_urls) # Output: " Hello world! Check out this link: "
# Remove leading and trailing spaces
text_cleaned <- trimws(text_no_hashtags_urls)
print(text_cleaned) # Output: "Hello world! Check out this link:"
Explanation
- Removing the first 6 characters: The
substr()
function removes letters in the string. We pass it the ‘text’ object to tell it which string to operate on, 6 to tell it the position to start deleting from, andnchar(text)
to tell it to delete all the way to the last position of the text–which is equal to the total number of characters in the string, given bynchar(text)
. - Removing hashtags: The
gsub()
function matches and replaces strings. We pass the regular expression #\w+ to match any word following a # symbol and “” to replace it with nothing (effectively deleting it). Finally, we pass thetext_without_first_6
object to tell it which string to operate on. - Removing URLs: We use the
gsub()
function again but this time with the regular expression https?://\S+ to match URLs that start with http:// or https://. We replace these with nothing “” and thetext_without_first_6
object. - Removing leading and trailing spaces: We use the
trimws()
to remove leading and trailing spaces.
2.2.4 Concatenating strings
An additional step you may have to take in some cases is to concatenate strings of text into larger units. For example, sometimes texts will be by-word, but you may wish to analyze them as full sentences of texts.
Below are two examples using the paste()
function to concatenate words.
# Individual words
word1 <- "Data"
word2 <- "Science"
word3 <- "is"
word4 <- "fun!"
# Concatenate words using paste() with a space separator
sentence <- paste(word1, word2, word3, word4)
# Print the sentence
print(sentence)
# Vector of words
words_vector <- c("Data", "Science", "is", "amazing!")
# Concatenate words with spaces using collapse
sentence <- paste(words_vector, collapse = " ")
# Print the sentence
print(sentence)
2.2.5 Counting words
In many cases it will be important for you to know how many words each of your texts contains. As it turns out, it is quite difficult to define exactly what constitutes a word. For example, is ? a word? Is never-ending one or two words? If you are curious to know more, this short video may be illuminating: https://www.youtube.com/watch?v=em0ePorcp48
For the purposes of this workshop, we’ll define a word as a string of characters separated from another string by at least 1 space.
Knowing how many words each string contains has many practical benefits, the most obvious is to filter out data which contains little or no content of interest. For example, it is often reasonably to filter out strings which have no words in them.
Here is an example which uses a simple function to count the number of words in a string, relying on the wordcount()
function from the ngram
library.
# Install the required packages
install.packages("tidyverse")
install.packages("ngram")
# Load required packages
library(tidyverse)
library(ngram)
# Define a text string
text <- "Data Science is fun!"
# Apply the wordcount function to the string
wordcount(text, sep = " ") # the sep = " " tells the function to identify strings as words on the basis of whether they are separated from other string by at least 1 space
# Define a dataframe with a vector of text strings and an empty vector for wordcounts
data <- data.frame(texts = c("Data Science is fun!", "Learn R Programming.", ":)!", "Analysis with R"),
wordcount = NA)
# Apply the wordcount function to each string in the dataframe and save the output to the corresponding vector
data$wordcount[1] <- wordcount(data$text[1], sep = " ")
data$wordcount[2] <- wordcount(data$text[2], sep = " ")
data$wordcount[3] <- wordcount(data$text[3], sep = " ")
data$wordcount[4] <- wordcount(data$text[4], sep = " ")
# Print the data
print(data)
# Filter out strings which have fewer than 2 words
data_filtered <- data %>%
filter(wordcount > 2)
# Print the filtered data
print(data_filtered)
Explanation
- After loading the required packages and defining the function to count the number of words in a vector of strings, we define a dataframe with four texts and an empty vector to hold their word-counts
- We apply the
nwords()
function to the texts indata$texts
to attain the word counts, and assign them to the empty vectordata$nwords
- We use the the pipe operator
%>%
andfilter()
function to retain only cases where nwords does not equal 0filter(!nwords==0)
which in this case, removes the row no words :)!
2.2.6 Identifying specific instances
For many types of analyses you will want to identify specific types of text data either because they are relevant or irrelevant to your analyses. The most basic way of doing this is via key-word searches.
The grepl()
function is used to search for patterns or keywords within a string and returns TRUE
or FALSE
depending on whether the pattern is found.
# Example text data
data <- data.frame(texts = c("Data Science is fun!", "Learn R Programming.", "Python is great.", "R is powerful."))
# Print the data
print(data)
# Search for the word "python"
result <- grepl("python", data$texts, ignore.case = TRUE)
# Print the result
print(result)
# Select only cases with the word "python"
data_python <- data %>%
filter(grepl("python", texts, ignore.case = TRUE))
# Print the data only with cases with the word "python"
print(data_python)
Explanation
- The
grepl("Programming", data$texts, ignore.case = TRUE)
function checks each string in thedata$texts
to see if it contains the string python. - Note that we defined
ignore.case = TRUE
in thegrepl()
function to allow python to match to Python. Try settingignore.case = FALSE
and see how this changes the output. - We can combine
grepl()
withfilter()
to remove rows which do now contain python indata$texts
.
2.2.7 Using loops
Now that you have learnt how to read and process individual files and examples, its time to learn how to automate this process for large amounts of data. Loops are indispensable for this. They are used to repeat a block of code multiple times. This is helpful when you need to perform repetitive tasks, like iterating over many data files or instances of text. The most common type of loop in R is the for
loop which iterates through a sequence of specified values.
Explanation
- The for loop will iterate over the sequence 1:5 (numbers 1 through 5).
- For each iteration, it will print “Iteration number:” followed by the current value of i.
Let’s see the value of looping over numerous texts to quickly clean and save them.
# Sample text data with 3 values
text_data <- c(" Hello World! ", " R is great! ", " CLEANING text DATA. ")
# Initialize an empty dataframe with 3 rows to store the cleaned text
cleaned_data <- data.frame(id = 1:3,
cleaned_text = NA)
# For loop from 1:3 to clean the text
for (i in 1:3) {
cleaned_data$cleaned_text[i] <- trimws(text_data[i]) # Remove extra spaces
}
# Print cleaned data
print(cleaned_data)
Explanation
- The for loop will iterate over each text in
text_data
in the sequence 1:3 by replacing the value i intext_data[i]
(e.g., the first iteration pulls" Hello World! "
by callingtext_data[1]
) - For each iteration, it uses
trimws()
to remove leading and trailing spaces. - Each cleaned text is saved in the initialized dataframe
cleaned_data
by placing it in the$cleaned_text
vector in the cell defined by[i]
(e.g., the third iteration assigns the cleaned text"CLEANING text DATA."
to the third row of viacleaned_data$cleaned_text[3]
)
2.3 Test your knowledge
Oftentimes text data will be made available in a way that is not conducive to psychological analyses. Typically, we want our data to be in tabular format saved as vectors in R data frames, where each chunk of text we plan to analyze is saved in a single cell. Here are some challenges to test your ability to wrangle text data into this format.
2.3.1 Pick out relevant cases from a .csv file
Download this file: https://drive.google.com/drive/folders/1M8caZ3axqEloZDMPc_U2E3mIwx7hLYnq?usp=sharing
It contains Reddit post titles from r/politics. Pick out all the titles which have any of the following key words regardless of case (“MAGA”, “Trump”, “Republican”) and which are 5 words or longer.
2.3.2 Build a dataframe from multiple .txt files
Download these files: https://drive.google.com/drive/folders/1J11aj_iz60zvNd2kAcp8cbAl9NUNniVu?usp=sharing
They contain descriptions of petitions to https://www.change.org/. Your task is to extract the text data in them and save them to a new dataframe. Your new dataframe should have the following variables.
- id: a basic indicator ranging from 1-6
- text_file: the name of the text file
- text: the full text of the file
2.3.3 Extract nested fields from a .json file
Download this file: https://drive.google.com/drive/folders/1JIb35rVG2p_-NEHgeJN21EgDpZtzvmEO?usp=sharing
It contains Twitter posts from a politician. It has multiple nested variables. Your task is to extract them and save them to a new dataframe. Your new dataframe should have the following variables, populated with the relevant scores from the .json file.
- id: a basic indicator ranging from 1-12
- text: Tweet text
- referenced_tweets: id of the referenced Tweets
- likes: count of how many times the Tweet was liked
- retweets: count of how many times the Tweet was shared
Tip: you will need to loop through each row in data to do this.
2.3.4 Solutions
Pick out relevant cases from a .csv file
Here is a solution for extracting the relevant cases.
# Install the required packages
install.packages("tidyverse")
install.packages("ngram")
# Load required packages
library(tidyverse)
library(ngram)
# Read the data
data <- read.csv("reddit.csv")
# Define a for loop from 1-5000 to compute the number of words in the "title" variable
for (i in 1:5000) {
data$title_wordcount[i] <- wordcount(data$title[i], sep = " ")
}
# Create a new dataset "data_maga" by filtering for titles which have more than 4 words and then which include the string "Maga"
data_maga <- data %>%
filter(title_wordcount > 4) %>%
filter(grepl("Maga", title, ignore.case = TRUE))
# Create a new dataset "data_trump" by filtering for titles which have more than 4 words and then which include the string "Trump"
data_trump <- data %>%
filter(title_wordcount > 4) %>%
filter(grepl("Trump", title, ignore.case = TRUE))
# Create a new dataset "data_republican" by filtering for titles which have more than 4 words and then which include the string "Republican"
data_republican <- data %>%
filter(title_wordcount > 4) %>%
filter(grepl("Republican", title, ignore.case = TRUE))
# Bind the three new datasets into a single object
data_filtered <-
data_maga %>%
rbind(data_trump) %>%
rbind(data_republican)
Build a dataframe from multiple .txt files
Here is a solution for compiling a dataframe from the .txt files.
# Create a vector will the five .txt file names
files <- c("NICKI MINAJ DESERVES A STAR.txt",
"AMANDAS LAW Oklahoma needs a law for prison and county jail officers to wear body cams!.txt",
"Save Alberta Parks from Shutting-down and Privatization.txt",
"Serve The Served Fox Hole Community, will Provide a Community for Homeless Veterans.txt",
"Stop Sadiq Khan expanding the ULEZ to all the London borough 2023.txt")
# Initialize an empty dataframe with 5 rows to store the file names and texts
data_cleaned <- data.frame(id = 1:5,
text_file = NA,
text = NA)
# Define a for loop from 1-5 to open each .txt file and save its name and text to the empty dataframe
for (i in 1:5) {
file_name <- files[i] # get the ith files name
data_cleaned$text_file[i] <- file_name # save the ith file name to the empty dataframe in the $text_file variable in the ith cell
text_data <- read.table(file_name, header = FALSE, sep = "\\", comment.char = "") # read the ith file (whose name is assigned to file_name) and save it to as the dataframe text_data
data_cleaned$text[i] <- text_data$V1 # extract the text from the text_data in the $V1 vector and save it to the empty dataframe in the $text variable in the ith cell
}
Tip: It is often unfeasible to manually define all of the files you wish to iterate over, as you may have many thousands of them. A solution to this can be found in the Sys.glob()
command, which will define a new vector with all file names in your currently-active folder of a certain sort. For example, to define a vector with all .txt
files in your working folder you can specify: files <- (Sys.glob("*.txt"))
Extract nested fields from a .json file
Here is a solution for extracting the required fields from the json file.
# Install required package
install.packages("jsonlite")
# Load required package
library(jsonlite)
# Read the data
data <- fromJSON("RepRandal.json")
# Initialize an empty dataframe with as many rows as the json files contains, defined by the nrow() function, and vectors to save the text, referenced_tweets, likes, and reposts
data_cleaned <- data.frame(id = 1:nrow(data),
text = NA,
referenced_tweets = NA,
likes = NA,
reposts = NA)
for (i in 1:nrow(data)) {
data_cleaned$text[i] <- data$text[i]
data_cleaned$referenced_tweets[i] <- data$referenced_tweets[[i]]$id
data_cleaned$likes[i] <- data$public_metrics$like_count[i]
data_cleaned$retweets[i] <- data$public_metrics$retweet_count[i]
}
Tip: When looping through dataframes you may hit upon cells which your code cannot deal with (e.g., those which are NA
). This will typically cause your code to stop looping and fail to complete the task. One solution to this is to place operations within the try()
instruction. This specifies that R should attempt to conduct the operation on the specified data for the given iteration and, if it fails, to continue to the next iteration.
2.4 Pick Your Data
Now that you have the knowledge and tools to deal with different sorts of text data, you should identify a dataset which lends itself to your research question.
Ask yourself:
- Is the sample from an appropriate source?
- Are the authors of the text relevant to your investigation?
- If the texts are about a specific topic, can your interest be investigated within that topic?
- Are there any additional variables in the dataset which can inform your interests?
The answer may very well be no to all of these; don’t force it.
Please note, you are not required to use a resource from the Collective Data Bank. If none of them spark your interest or afford the sorts of analyses you wish to do, you are free to continue looking for datasets in your free time.
2.5 Consider Ethical Issues
The lectures in week 11 intended to convey the importance of considering the potential risks arising from analyzing pre-existing text data. Before moving forward with your research, you must consider the following questions in relation to your dataset of choice:
- Are there potential impacts on well-being, social or legal standing?
- Are you analyzing potentially sensitive or stigmatizing topics?
- Are there risks related to consent?
- Is the data in the public domain?
- Are there risks to privacy?
- Are there any personal (e.g., names, user profiles, ip, addresses) data?
- Is there any way that someone could trace anyone’s identity from your analyses?
- Do you foresee any other risks?
- What steps can you take to address the potential risks you have identified?