Chapter 20 Supervised Machine Learning
Today, we’ll be talking about supervised machine learning for text classification. A few weeks ago, we talked about logistic regression, one model that is common in supervised machine learning tasks. Logistic regression is a classification analysis, meaning that it is used when dealing with categorical (specifically, binary) variables. This week, we’ll be going over some more classification supervised machine learning (SML) algorithms. However, rather than using vote data, we’re going to complicate our process slightly by focusing on text.
As we have discussed in previous classes, text data is extra tricky because it contains so much information. As a result, the data structure is much more complex: if we treated a word as a feature, our matrix (our “document-term matrix” or “document-feature matrix”) is very large and very sparse.
However, many of the things we want to analyze in mass communication exists in text or language–whether someone is emotionally happy or sad, whether people are using uncivil discourse, whether the sentiment of a message is positive or negative. In these instances, supervised machine learning can be very useful for applying one coding strategy across millions of messages.
For this tutorial, we will be learning about two supervised machine learning models that are common in text classification tasks: k-Nearest Neighbor (kNN) and Support-Vector Machines (SVM), decision trees, and random forests (which are more complicated decision trees).
A warning: This tutorial uses 130 labeled data points. As we have discussed in the class, this is an extremely small labeled set. A typical dataset with binary labels should have between 5,000 to 10,000 labels. However, for the purposes of illustrating the process, our small-n dataset will do.
A second warning More than any other topic we discussed in this class, supervised machine learning is far and away the most complex topic, and one that requires the most additional learning. Two tutorials cannot teach supervised machine learning, and each of the algorithms I will introduce in this tutorial are really worth their own full classes. Keep in mind that data scientists and engineers take many, many courses in supervised machine learning, and we will only be able to cover a fraction of that knowledge in this tutorial.
We’ll begin by installing some new packages and loading our data.
20.1 Setting Up
This week, we will learn SML using caret
, one of the most popular packages in R. caret
is short for “Classification And REgression Training”, and it provides a uniform interface for hundreds of supervised machine learning algorithms. Because of this, caret
has become a one-stop shop for R data scientists.
The main way caret
does this is by tapping into a variety of other packages that contain more specific supervised machine learning algorithms and then by standardizing each algorithms’ implementations. For this reason, it is often necessary to install other packages alongside caret
(i.e., the packages that actually contain the algorithm). In this tutorial, we will use 3 new packages: LiblineaR
, rpart
, and ranger
. Notice that in this tutorial, I provide lines for installing these packages but I do not load them in as libraries. While it is possible to do that, it is not necessary–caret
will call the library when appropriate.
options(scipen=999)
set.seed(381)
#install.packages("caret")
#install.packages("LiblineaR") #will be used for svm
#install.packages("rpart") #will be used for decision trees
#install.packages("ranger") #will be used for random forest
library(tidyverse)
library(tokenizers)
library(caret) #but we do load the caret package!
library(tidytext)
library(tm)
Next, we will load the data in. In this tutorial, we have 2 files: tweets_ballotharvesting_v_trumptaxes_v_scotus.csv
, which contains the original data and tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.CSV
, which contains the 130 labels. If you are producing labels (from a content analysis, for example), you should have your data structured similarly: one data frame of the raw dat and one data frame of the labeled data.
In addition to loading in this data, we will use select()
to focus on the specific variables we are interested in. Here, we select the id column, the text of the tweet (used for the about_ballot_harvesting
label), and the profile description (used for the conservative
label). For our labeled dataset, we obviously also want to include the columns containing the labels, so we will include about_ballot_harvesting
and conservative
. Make sure th
<- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus.csv") %>%
tweet_data select(`...1`, text, description)
colnames(tweet_data)[1] <- "id"
<- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.csv") %>%
tweet_labeled_data select(id, text, description, about_ballot_harvesting, conservative)
20.2 Data Cleaning
Let’s move onto the data cleaning!
Importantly, you want to make sure your labels are treated as factors. If they are set as numerics, R will treat them as numbers and not categories.
$about_ballot_harvesting <- as.factor(tweet_labeled_data$about_ballot_harvesting)
tweet_labeled_data$conservative <- as.factor(tweet_labeled_data$conservative) tweet_labeled_data
Because R can’t tell what is contained in a URL, it’s often optimal to exclude urls from your text data. We do this in the text
columns of both datasets (the full dataset and the labeled dataset) using a regular expression.
<- tweet_labeled_data %>%
tweet_labeled_data ::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", ""))
dplyr
<- tweet_data %>%
tweet_data ::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", "")) dplyr
Next, we want to make sure we exclude any rows with NA
.
<- na.exclude(tweet_data)
tweet_data
<- na.exclude(tweet_labeled_data) tweet_labeled_data
Now, we have two relatively clean datasets: tweet_data
, containing all the tweets and tweet_labeled_data
containing the labeled tweets. Let’s look at tweet_labeled_data
in more detail, since we will be using it for the modeling.
20.3 Imbalanced Data
Before proceeding with any supervised machine learning analysis, it is valuable and important to understand your variables further. For example, I often conduct a topic modeling or other NLP analyses before proceeding with a text classifier using supervised machine learning. Another thing I do is check the proportions of the dataset. Are more tweets coded as about_ballot_harvesting == 1
or not (about_ballot_harvesting == 0
)? We can do this by using table()
on the variable and then prop.table()
to get the proportion (use ?prop.table
to learn more about this function).
table(tweet_labeled_data$about_ballot_harvesting) %>% prop.table()
##
## 0 1
## 0.2076923 0.7923077
table(tweet_labeled_data$conservative) %>% prop.table()
##
## 0 1
## 0.4461538 0.5538462
About 79% of the ballot_harvesting code is coded as “1”. This is considered “imbalanced data” (or “unbalanced” data). Imbalanced data is pretty common in supervised machine learning, especially when working with social science datasets. Many of the things we are interested in tend to be over-represented or under-represented. For example, in our labeled dataset, 79% of the posts appear to be about ballot harvesting. A model that automatically codes all the posts about ballot harvesting as yes could appear to be incorrect “only” 21% of the time.
In an unbalanced dataset, the label with more observations is called the “majority class” (for us, this is when ballot_harvesting == 1
). The labels with fewer observations is called the “minority class” (for us, this is when ballot_harvesting == 0
).
The are a couple different ways we can deal with unbalanced data. One strategy is to over or under-sample. When we want to diminish the majority case, we would randomly remove instances in the majority class. When we want to increase the minority case, we would randomly duplicate instances. Learn more about different strategies here and learn how to do these things with caret
here.
As you become more advanced with R, I encourage you to also check out the package unbalanced
for more advanced strategies for dealing with unbalanced data. You can check out the documentation for unbalanced
here. As noted in this r-bloggers post, two of the most common strategies for dealing with imbalanced data in binary variables is ROSE
and SMOTE
.
For now, we will proceed with our analysis without changing the data. Because our conservative
variable is more balanced (compared to about_ballot_harvesting
), let’s work with this variable.
<- tweet_labeled_data %>%
conservative_data select(id, description, conservative)