Chapter 20 Supervised Machine Learning

Today, we’ll be talking about supervised machine learning for text classification. A few weeks ago, we talked about logistic regression, one model that is common in supervised machine learning tasks. Logistic regression is a classification analysis, meaning that it is used when dealing with categorical (specifically, binary) variables. This week, we’ll be going over some more classification supervised machine learning (SML) algorithms. However, rather than using vote data, we’re going to complicate our process slightly by focusing on text.

As we have discussed in previous classes, text data is extra tricky because it contains so much information. As a result, the data structure is much more complex: if we treated a word as a feature, our matrix (our “document-term matrix” or “document-feature matrix”) is very large and very sparse.

However, many of the things we want to analyze in mass communication exists in text or language–whether someone is emotionally happy or sad, whether people are using uncivil discourse, whether the sentiment of a message is positive or negative. In these instances, supervised machine learning can be very useful for applying one coding strategy across millions of messages.

For this tutorial, we will be learning about two supervised machine learning models that are common in text classification tasks: k-Nearest Neighbor (kNN) and Support-Vector Machines (SVM), decision trees, and random forests (which are more complicated decision trees).

A warning: This tutorial uses 130 labeled data points. As we have discussed in the class, this is an extremely small labeled set. A typical dataset with binary labels should have between 5,000 to 10,000 labels. However, for the purposes of illustrating the process, our small-n dataset will do.

A second warning More than any other topic we discussed in this class, supervised machine learning is far and away the most complex topic, and one that requires the most additional learning. Two tutorials cannot teach supervised machine learning, and each of the algorithms I will introduce in this tutorial are really worth their own full classes. Keep in mind that data scientists and engineers take many, many courses in supervised machine learning, and we will only be able to cover a fraction of that knowledge in this tutorial.

We’ll begin by installing some new packages and loading our data.

20.1 Setting Up

This week, we will learn SML using caret, one of the most popular packages in R. caret is short for “Classification And REgression Training”, and it provides a uniform interface for hundreds of supervised machine learning algorithms. Because of this, caret has become a one-stop shop for R data scientists.

The main way caret does this is by tapping into a variety of other packages that contain more specific supervised machine learning algorithms and then by standardizing each algorithms’ implementations. For this reason, it is often necessary to install other packages alongside caret (i.e., the packages that actually contain the algorithm). In this tutorial, we will use 3 new packages: LiblineaR, rpart, and ranger. Notice that in this tutorial, I provide lines for installing these packages but I do not load them in as libraries. While it is possible to do that, it is not necessary–caret will call the library when appropriate.

options(scipen=999)
set.seed(381)
#install.packages("caret")
#install.packages("LiblineaR") #will be used for svm
#install.packages("rpart") #will be used for decision trees
#install.packages("ranger") #will be used for random forest

library(tidyverse)
library(tokenizers)
library(caret) #but we do load the caret package!
library(tidytext)
library(tm)

Next, we will load the data in. In this tutorial, we have 2 files: tweets_ballotharvesting_v_trumptaxes_v_scotus.csv, which contains the original data and tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.CSV, which contains the 130 labels. If you are producing labels (from a content analysis, for example), you should have your data structured similarly: one data frame of the raw dat and one data frame of the labeled data.

In addition to loading in this data, we will use select() to focus on the specific variables we are interested in. Here, we select the id column, the text of the tweet (used for the about_ballot_harvesting label), and the profile description (used for the conservative label). For our labeled dataset, we obviously also want to include the columns containing the labels, so we will include about_ballot_harvesting and conservative. Make sure th

tweet_data <- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus.csv") %>%
  select(`...1`, text, description)
colnames(tweet_data)[1] <- "id"

tweet_labeled_data <- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.csv") %>%
  select(id, text, description, about_ballot_harvesting, conservative)

20.2 Data Cleaning

Let’s move onto the data cleaning!

Importantly, you want to make sure your labels are treated as factors. If they are set as numerics, R will treat them as numbers and not categories.

tweet_labeled_data$about_ballot_harvesting <- as.factor(tweet_labeled_data$about_ballot_harvesting)
tweet_labeled_data$conservative <- as.factor(tweet_labeled_data$conservative)

Because R can’t tell what is contained in a URL, it’s often optimal to exclude urls from your text data. We do this in the text columns of both datasets (the full dataset and the labeled dataset) using a regular expression.

tweet_labeled_data <- tweet_labeled_data %>% 
  dplyr::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", ""))

tweet_data <- tweet_data %>% 
  dplyr::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", ""))

Next, we want to make sure we exclude any rows with NA.

tweet_data <- na.exclude(tweet_data)

tweet_labeled_data <- na.exclude(tweet_labeled_data)

Now, we have two relatively clean datasets: tweet_data, containing all the tweets and tweet_labeled_data containing the labeled tweets. Let’s look at tweet_labeled_data in more detail, since we will be using it for the modeling.

20.3 Imbalanced Data

Before proceeding with any supervised machine learning analysis, it is valuable and important to understand your variables further. For example, I often conduct a topic modeling or other NLP analyses before proceeding with a text classifier using supervised machine learning. Another thing I do is check the proportions of the dataset. Are more tweets coded as about_ballot_harvesting == 1 or not (about_ballot_harvesting == 0)? We can do this by using table() on the variable and then prop.table() to get the proportion (use ?prop.table to learn more about this function).

table(tweet_labeled_data$about_ballot_harvesting) %>% prop.table()
## 
##         0         1 
## 0.2076923 0.7923077
table(tweet_labeled_data$conservative) %>% prop.table()
## 
##         0         1 
## 0.4461538 0.5538462

About 79% of the ballot_harvesting code is coded as “1”. This is considered “imbalanced data” (or “unbalanced” data). Imbalanced data is pretty common in supervised machine learning, especially when working with social science datasets. Many of the things we are interested in tend to be over-represented or under-represented. For example, in our labeled dataset, 79% of the posts appear to be about ballot harvesting. A model that automatically codes all the posts about ballot harvesting as yes could appear to be incorrect “only” 21% of the time.

In an unbalanced dataset, the label with more observations is called the “majority class” (for us, this is when ballot_harvesting == 1). The labels with fewer observations is called the “minority class” (for us, this is when ballot_harvesting == 0).

The are a couple different ways we can deal with unbalanced data. One strategy is to over or under-sample. When we want to diminish the majority case, we would randomly remove instances in the majority class. When we want to increase the minority case, we would randomly duplicate instances. Learn more about different strategies here and learn how to do these things with caret here.

As you become more advanced with R, I encourage you to also check out the package unbalanced for more advanced strategies for dealing with unbalanced data. You can check out the documentation for unbalanced here. As noted in this r-bloggers post, two of the most common strategies for dealing with imbalanced data in binary variables is ROSE and SMOTE.

For now, we will proceed with our analysis without changing the data. Because our conservative variable is more balanced (compared to about_ballot_harvesting), let’s work with this variable.

conservative_data <- tweet_labeled_data %>%
  select(id, description, conservative)