Chapter 24 Supervised ML

Today, we’ll be talking about supervised machine learning for text classification using caret . A few weeks ago, we talked about logistic regression, one model that is common in supervised machine learning tasks. Logistic regression is a classification analysis, meaning that it is used when dealing with categorical (specifically, binary) variables. This week, we’ll be going over some more classification supervised machine learning (SML) algorithms. However, rather than using vote data, we’re going to complicate our process slightly by focusing on text.

As we have discussed in previous classes, text data is extra tricky because it contains so much information. As a result, the data structure is much more complex: if we treated a word as a feature, our matrix (our “document-term matrix” or “document-feature matrix”) is very large and very sparse.

However, many of the things we want to analyze in mass communication exists in text or language–whether someone is emotionally happy or sad, whether people are using uncivil discourse, whether the sentiment of a message is positive or negative. In these instances, supervised machine learning can be very useful for applying one coding strategy across millions of messages.

For this tutorial, we will be learning about two supervised machine learning models that are common in text classification tasks: k-Nearest Neighbor (kNN) and Support-Vector Machines (SVM), decision trees, and random forests (which are more complicated decision trees).

A warning: This tutorial uses 130 labeled data points. As we have discussed in the class, this is an extremely small labeled set. A typical dataset with binary labels should have between 5,000 to 10,000 labels. However, for the purposes of illustrating the process, our small-n dataset will do.

A second warning More than any other topic we discussed in this class, supervised machine learning is far and away the most complex topic, and one that requires the most additional learning. Two tutorials cannot teach supervised machine learning, and each of the algorithms I will introduce in this tutorial are really worth their own full classes. Keep in mind that data scientists and engineers take many, many courses in supervised machine learning, and we will only be able to cover a fraction of that knowledge in this tutorial.

We’ll begin by installing some new packages and loading our data.

24.1 Setting Up

This week, we will learn SML using caret, one of the most popular packages in R. caret is short for “Classification And REgression Training”, and it provides a uniform interface for hundreds of supervised machine learning algorithms. Because of this, caret has become a one-stop shop for R data scientists.

The main way caret does this is by tapping into a variety of other packages that contain more specific supervised machine learning algorithms and then by standardizing each algorithms’ implementations. For this reason, it is often necessary to install other packages alongside caret (i.e., the packages that actually contain the algorithm). In this tutorial, we will use 3 new packages: LiblineaR, rpart, and ranger. Notice that in this tutorial, I provide lines for installing these packages but I do not load them in as libraries. While it is possible to do that, it is not necessary–caret will call the library when appropriate.

options(scipen=999)
set.seed(381)
#install.packages("caret")
#install.packages("LiblineaR") #will be used for svm
#install.packages("rpart") #will be used for decision trees
#install.packages("ranger") #will be used for random forest

library(tidyverse)
library(tokenizers)
library(caret) #but we do load the caret package!
library(tidytext)
library(tm)

Next, we will load the data in. In this tutorial, we have 2 files: tweets_ballotharvesting_v_trumptaxes_v_scotus.csv, which contains the original data and tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.CSV, which contains the 130 labels. If you are producing labels (from a content analysis, for example), you should have your data structured similarly: one data frame of the raw dat and one data frame of the labeled data.

In addition to loading in this data, we will use select() to focus on the specific variables we are interested in. Here, we select the id column, the text of the tweet (used for the about_ballot_harvesting label), and the profile description (used for the conservative label). For our labeled dataset, we obviously also want to include the columns containing the labels, so we will include about_ballot_harvesting and conservative. Make sure th

tweet_data <- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus.csv") %>%
  select(`...1`, text, description)
colnames(tweet_data)[1] <- "id"

tweet_labeled_data <- read_csv("data/tweets_ballotharvesting_v_trumptaxes_v_scotus_labels2.csv") %>%
  select(id, text, description, about_ballot_harvesting, conservative)

24.2 Data Cleaning

Let’s move onto the data cleaning!

Importantly, you want to make sure your labels are treated as factors. If they are set as numerics, R will treat them as numbers and not categories.

tweet_labeled_data$about_ballot_harvesting <- as.factor(tweet_labeled_data$about_ballot_harvesting)
tweet_labeled_data$conservative <- as.factor(tweet_labeled_data$conservative)

Because R can’t tell what is contained in a URL, it’s often optimal to exclude urls from your text data. We do this in the text columns of both datasets (the full dataset and the labeled dataset) using a regular expression.

tweet_labeled_data <- tweet_labeled_data %>% 
  dplyr::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", ""))

tweet_data <- tweet_data %>% 
  dplyr::mutate(text = stringr::str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", ""))

Next, we want to make sure we exclude any rows with NA.

tweet_data <- na.exclude(tweet_data)

tweet_labeled_data <- na.exclude(tweet_labeled_data)

Now, we have two relatively clean datasets: tweet_data, containing all the tweets and tweet_labeled_data containing the labeled tweets. Let’s look at tweet_labeled_data in more detail, since we will be using it for the modeling.

24.2.1 Imbalanced Data

Before proceeding with any supervised machine learning analysis, it is valuable and important to understand your variables further. For example, I often conduct a topic modeling or other NLP analyses before proceeding with a text classifier using supervised machine learning. Another thing I do is check the proportions of the dataset. Are more tweets coded as about_ballot_harvesting == 1 or not (about_ballot_harvesting == 0)? We can do this by using table() on the variable and then prop.table() to get the proportion (use ?prop.table to learn more about this function).

table(tweet_labeled_data$about_ballot_harvesting) %>% prop.table()
## 
##         0         1 
## 0.2076923 0.7923077
table(tweet_labeled_data$conservative) %>% prop.table()
## 
##         0         1 
## 0.4461538 0.5538462

About 79% of the ballot_harvesting code is coded as “1”. This is considered “imbalanced data” (or “unbalanced” data). Imbalanced data is pretty common in supervised machine learning, especially when working with social science datasets. Many of the things we are interested in tend to be over-represented or under-represented. For example, in our labeled dataset, 79% of the posts appear to be about ballot harvesting. A model that automatically codes all the posts about ballot harvesting as yes could appear to be incorrect “only” 21% of the time.

In an unbalanced dataset, the label with more observations is called the “majority class” (for us, this is when ballot_harvesting == 1). The labels with fewer observations is called the “minority class” (for us, this is when ballot_harvesting == 0).

The are a couple different ways we can deal with unbalanced data. One strategy is to over or under-sample. When we want to diminish the majority case, we would randomly remove instances in the majority class. When we want to increase the minority case, we would randomly duplicate instances. Learn more about different strategies here and learn how to do these things with caret here.

As you become more advanced with R, I encourage you to also check out the package unbalanced for more advanced strategies for dealing with unbalanced data. You can check out the documentation for unbalanced here. As noted in this r-bloggers post, two of the most common strategies for dealing with imbalanced data in binary variables is ROSE and SMOTE.

For now, we will proceed with our analysis without changing the data. Because our conservative variable is more balanced (compared to about_ballot_harvesting), let’s work with this variable.

conservative_data <- tweet_labeled_data %>%
  select(id, description, conservative)

24.3 Data Wrangling

Since we’ll be working with text data in our tutorial, we will need to wrangle the data so it can be used for supervised machine learning. If you are not working with text data, but instead are working with binaries, categorical variables, and continuous variables, you can go straight to your modeling (as we did with the logistic regression).

This is going to get a little complicated, so please bear with me!

The first thing we’ll want to do is construct a dataset of tokens from the conservative_data (our labeled dataset). Here, we will use tidytext as a quick and relatively easy process

conservative_tokens <- unnest_tokens(conservative_data, word, description) %>% #tokenize
  anti_join(stop_words, by = "word") %>% #remove stopwords
  dplyr::count(id, word) #count the frequency of words used by tweet

24.3.1 tf-idf

Next, we’ll cast the data as a document-term matrix. Before we do this, though, we’ll add some more information to our text data: tf_idf.

tf_idf is a NLP measure that indicates how unique a word is to a document in a corpus. If the tf-idf score is high, it means that word appears frequently in a document, but not in the other documents. Words that appear frequently in all documents would not have a tf-idf score (neither would words that occur sparingly). You can learn more about tfidf in the tidytext textbook.

we used this information in a supervised machine learning model, as opposed to counts, because tf_idf contain more information about the importance of a word to a specific document, whereas counts just gives you frequency of the words’ use. To put it another way: we can actually compare words based on how unique they are to a document using tf_idf.

con_dtm <- tidytext::bind_tf_idf(conservative_tokens, word, id, n) %>% #calculate tf-idf
  tidytext::cast_dtm(id, word, tf_idf) #construct a document-term matrix with the tf-idf scores

Now, we have our wrangled text data into a document-term matrix form! In natural language processing, recall that type of wrangled data is called a “bag of words” pre-processing strategy. With this data, we can proceed with our supervised machine learning analysis.

24.4 Data Partitioning

As with our logistic regression tutorial, we will begin by partitining the labeled dataset into a training set and a test set (in this tutorial, we will using a 70/30 split). We will apply the algorithm on the training set and then test the quality of the algorithm on the test set. We will do this four times: one for each algorithm. Then, we will compare how accurately each algorithm performed to select one that we can apply to the whole dataset.

trainIndex <- createDataPartition(y = conservative_data$conservative, p = 0.7,list = FALSE)

In the next three lines, I construct training and test sets by splitting the document-term matricies up. But the dtm doesn’t contain the labeled data, so I will need to also subset the conservative code in the original dataset (conservative_data) using the partition.

set.seed(381)
tw_to_train <- con_dtm[trainIndex, ] %>% as.matrix() %>% as.data.frame()
tw_to_test <- con_dtm[-trainIndex, ] %>% as.matrix() %>% as.data.frame()

conservative_code <- conservative_data$conservative[trainIndex]

Now, we have our training set, our test set, and the labels for our supervised machine learning model. Yay!

24.5 Model Construction

One way we can increase the quality of a supervised machine learning model is to use a resampling strategy that repeatedly takes independent samples from a population to construct an estimate. For this tutorial, we will use the same resampling strategy over and over, so we can save this attribute using the trainControl() function. trainControl() is especially useful when you are applying a bunch of different models but are using the same arguments, as we will do here.

trctrl <- trainControl(method = "boot")

For this tutoral, we will use the bootstrapping method of resampling, which you can learn about in this towards data science tutorial.

We’re almost ready to start using some SML models! But first–you need to learn about hyperparameters.

24.5.1 (Hyper)parameter Optimizing

In supervised machine learning, algorithms “learn” which parameters (features or combinations of features) are optimal for its classification task. In our standard linear model (y = a + bx), a and b are parameters (y is the function, and x is a feature). However, algorithms sometimes require additional parameters which a users (you!) is expected to provide. Parameters that are provided by the user are called “hyperparameters.” In unsupervised machine learning, the most important hyperparameter is the k (i.e., the number of topics or clusters you have). Because each SML algorithm is unique (as in, based on a unique body of mathematical logic), each algorithm has its own set of unique hyperparameters.

Still struggling to distinguish hyperparameters and parameters? Check out this explanation.

The process of identifying the right hyperparameters of for your model is tedious and time consuming–just as identifying the right k for unsupervised machine learning is time-consuming. However, it is an important part of ensuring you produce a quality text classifier. Tuning the algorithm can greatly improve the quality of your algorithm, but there are no “tried and true” rules for hyper-parameters because they are intentionally meant to be tuned for different types of data. In other words, there is no “right” hyper-parameter all the time–it is only “right” for your specific data for your specific supervised machine learning task.

Because this tutorial is a broad overview of supervised machine learning algorithms, we will not be able to go into hyperparameter optimization in depth. However, the way we compare different algorithms at the end of this tutorial is similar to the way in which we would compare two text classifiers using the same algorithm but different hyperparameters (i.e., using percent agreement and F-scores). I also encourage you to play around with the hyper-parameters in this tutorial so you can see how they change the results of the analysis.

For a full list of the hyperparameters for each model, check outthe caret tutorial. If you intend to work with supervised machine learning in R, I enocurage you to familiarize yourself with this tutorial as it is extremely useful for any aspirational data scientist working in R.

Okay, onto the algorithms!

24.6 kNN

The first algorithm we will learn is k-Nearest Neighbor, which learns by identifying the neighbors who are close to different labeled data. The assumption of this model is that similar data points will be closer to one another (this is similar to the logic underlying community detection or k-means analysis).

One advantage of kNN is that it is relatively simple (it requires very few hyper-parameters) and it can be especially useful for complex categorical data (as in, cases where you have more than 2 labels in a variable). However, kNN also takes some time to classify new data points, and the results vary greatly by its one hyper-parameter (k).

To construct a kNN algorithm, we will use the train() function in the caret package. This is the workhorse function of the package–anytime you are training a new model, you will use the train() function. train() requires (generally) three types of information: (1) the data (x and y), (2) the algorithm (method), and (3) the hyperparameters that are unique to each supervised machine learning algorithm (tuneGrid). In kNN, there is only one hyperparameter (k), the number of neighbors that the algorithm will look at (for simplicity, we will use 10).

knn_model_con <- caret::train(x = tw_to_train, #training data
                 y = as.factor(conservative_code), #labeled data
                 method = "knn", #the algorithm
                 trControl = trctrl, #the resampling strategy we will use
                 tuneGrid = data.frame(k = 2) #the hyperparameter
                 )

print(knn_model_con) #print this model
## k-Nearest Neighbors 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results:
## 
##   Accuracy   Kappa      
##   0.5302966  0.003601896
## 
## Tuning parameter 'k' was held constant at a value of 2

Based on the information from knn_model_con, we know the model was able to learn from 92 tweets (the training set), which had 1035 parameters and 2 labels (0 and 1).

What we don’t know from this information, however, is the quality of the algorithm. To do that, we will have to turn to the test data.

24.6.1 Testing the Model

To apply this algorithm to the test data, let’s use predict() which we learned about in our Advanced Linear Regression tutorial (Week 10).

knn_predict <- predict(knn_model_con, newdata = tw_to_test)

Instead of checking the percent accuracy of this data, however, we will learn to use a function from the caret package: confusionMatrix(). confusionMatrix() is useful because it provides more than just the percent accuracy measure–it will report other measures that account for random chance, as well as the F-score, a common measurement of accuracy in supervised machine learning.

knn_confusion_matrix <- caret::confusionMatrix(knn_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
knn_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 17 21
##                                          
##                Accuracy : 0.5526         
##                  95% CI : (0.383, 0.7138)
##     No Information Rate : 0.5526         
##     P-Value [Acc > NIR] : 0.5668659      
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : 0.0001042      
##                                          
##               Precision :     NA         
##                  Recall : 0.0000         
##                      F1 :     NA         
##              Prevalence : 0.4474         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 0              
## 

As you can see here, the accuracy of this supervised machine learning model is pretty weak (55%). When you account for the imbalance, the accuracy decreases to 51%, worse than if you randomly guessed the code. This is confirmed by the F-1 score (0.19), which is not great.

Remember when I said that kNN varies by its hyper-parameter? Try this out for yourself by changing the value of k in the movel above and checking the accuracy score and F1 score.

Want to learn more about kNN? Check out these tutorials:
* Towards Data Science tutorial
* kNN for dummies explanation
* kNN with non-text data

24.7 SVM

Let us now proceed to our second algorithm, support vector machines (SVM). SVM is an algorithm that is especially good for binary variables. It classifies binary variables by plotting the data points in an “n-th dimensional space”, where n is the number of features (in this case, at least a thousand words), and then identifying a “hyper-plane” that divides the observations into two spaces. SVM is considered a “large margin classifier” because it tries to identify the “largest margin” between the binary variables in this dimensional space. Learn more about large margin classifiers here.

SVM is very popular in supervised machine learning because it is well-equipped for highly dimensional data (i.e., when you have a lot of features in your data, like in natural language processing) and handling “close” cases (data points that could be classified as either 1 or 0). To handle these close cases, we modify the algorithm using two hyperparameters: cost, which is used to account for overfitting, and Loss,which penalizes for values that would be mis-classified. Learn more about cost here and (hinge) Loss here.

In R, a couple different packages have a svm() function. We’ll use the one from the LiblineaR package.

Like the kNN algorithm, we will use train() in caret to construct our model using (1) the data (x and y), (2) the algorithm (method == "svmLinear3"), and (3) the hyperparameters (for this algorithm, cost and Loss). Let’s apply the algorithm now. Note that not much changes, aside from the method and the tuneGrid arguments.

svm_model <- caret::train(x = tw_to_train,
                 y = as.factor(conservative_code),
                 method = "svmLinear3",
                 trControl = trctrl, 
                 tuneGrid = data.frame(cost = 1, #accounts for over-fitting
                                       Loss = 2)) #accounts for misclassifications

print(svm_model)
## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6523123  0.2000891
## 
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning parameter 'Loss' was held constant at a value of 2

Now let’s apply this to the test data.

24.7.1 Testing the Model

Once again, we’ll use predict() to apply the model.

svm_predict <- predict(svm_model, newdata = tw_to_test)

And next, we’ll use confusionMatrix().

svm_confusion_matrix <- caret::confusionMatrix(svm_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
svm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  1  2
##          1 16 19
##                                           
##                Accuracy : 0.5263          
##                  95% CI : (0.3582, 0.6902)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.688974        
##                                           
##                   Kappa : -0.0395         
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##               Precision : 0.33333         
##                  Recall : 0.05882         
##                      F1 : 0.10000         
##              Prevalence : 0.44737         
##          Detection Rate : 0.02632         
##    Detection Prevalence : 0.07895         
##       Balanced Accuracy : 0.48179         
##                                           
##        'Positive' Class : 0               
## 

Although the accuracy of this algorithm is pretty similar to the kNN, the F-1 score is better (0.37 compared to 0.19), because the recall of this model is twice as good as the recall in the previous model (even though the precision is about the same).

Want to learn more about SVM? Checkout these guides here:
* Towards Data Science analysis of SVM and KNN (among others)
* MonkeyLearn Explanation of hyperplanes.
* StatQuest Video on SVM. * SVM in e1071, another package with svm()

24.8 Decision Trees / CART

Let’s go now to our third algorithm, decision trees. Decisions trees are the most basic types of “tree-based models” (more complex tree-based models build on the decision tree). In tree-based models, the algorithm decides how to classify data points using a series of decisions. At each “node” of a decision tree, the data are split further and further until each data point is classified. Tree-based algorithms are popular because they can be used for both classification and regression tasks, but simple decision trees can be sensitive to over-fitting. Sometimes, you’ll see the name CART used instead of decision trees (CART stands for “Classification And Regression Trees”).

In the model we will use rpart, which takes one optional hyperparameter, cp. We won’t use it in this tutorial, but you can learn more about cp here.

dt_mod <- train(x = tw_to_train,
                y = as.factor(conservative_code),
                method = "rpart",
                trControl = trctrl
                )

print(dt_mod)
## CART 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.02439024  0.6503691  0.35144800
##   0.07317073  0.6477884  0.34820845
##   0.26829268  0.5456244  0.09876157
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02439024.

Now to check with the test data!

24.8.1 Testing the Model

You should be familiar with this process by now! First, we’ll use predict()

dt_predict <- predict(dt_mod, newdata = tw_to_test)

… and then assess with confusionMatrix()!

dt_confusion_matrix <- caret::confusionMatrix(dt_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
dt_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16 13
##          1  1  8
##                                           
##                Accuracy : 0.6316          
##                  95% CI : (0.4599, 0.7819)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.208107        
##                                           
##                   Kappa : 0.3018          
##                                           
##  Mcnemar's Test P-Value : 0.003283        
##                                           
##               Precision : 0.5517          
##                  Recall : 0.9412          
##                      F1 : 0.6957          
##              Prevalence : 0.4474          
##          Detection Rate : 0.4211          
##    Detection Prevalence : 0.7632          
##       Balanced Accuracy : 0.6611          
##                                           
##        'Positive' Class : 0               
## 

Unfortunately, this algorithm does not work very well with our data. But, try it with the cp parameter and see if you can get a better result!

Learn more about decision trees:
* Learn about the algorithm behind the decision trees in this medium article.
* Learn about other tree-based models in this explanation (her posts, in general, are quite good.

24.9 Random Forest

Random forests are a more advanced application of decision trees/CART. In fact, they are technically an ensemble algorithm because they combine the results of multiple decision trees to classify data. Because random forests is a more complex algorithm, it needs a few more hyperparameters to work. We include three here (but there are more): mtry indicates the number of variables to be randomly collected and split, splitrule creates a rule for how to split the data as decisions are made (you can learn more about the math from the original article), and min.node.size determines the depth for your tree (your algorithm will continue branching until it reaches the minimum node size, which is 5 data points).

To use random forests, we will change the method (to "ranger") and will add our tuneGrid.

rf_mod <- train(x = tw_to_train,
                y = as.factor(conservative_code),
                method = "ranger",
                trControl = trctrl,
                tuneGrid = data.frame(mtry = floor(sqrt(dim(tw_to_train)[2])),
                                      splitrule = "extratrees",
                                      min.node.size = 5))

print(rf_mod)
## Random Forest 
## 
##   92 samples
## 1035 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5840458  0.2021538
## 
## Tuning parameter 'mtry' was held constant at a value of 32
## Tuning parameter 'splitrule' was held constant at a value of extratrees
## Tuning
##  parameter 'min.node.size' was held constant at a value of 5

Then, we test the model on our test set.

24.9.1 Testing the Model

rf_predict <- predict(rf_mod, newdata = tw_to_test)
rf_confusion_matrix <- caret::confusionMatrix(rf_predict, conservative_data$conservative[-trainIndex], mode = "prec_recall")
rf_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 15  7
##          1  2 14
##                                           
##                Accuracy : 0.7632          
##                  95% CI : (0.5976, 0.8856)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.006072        
##                                           
##                   Kappa : 0.5341          
##                                           
##  Mcnemar's Test P-Value : 0.182422        
##                                           
##               Precision : 0.6818          
##                  Recall : 0.8824          
##                      F1 : 0.7692          
##              Prevalence : 0.4474          
##          Detection Rate : 0.3947          
##    Detection Prevalence : 0.5789          
##       Balanced Accuracy : 0.7745          
##                                           
##        'Positive' Class : 0               
## 

Though the F-score is better than the decision tree algorithm, it is still lower compared to the other algorithms we have tested.

Learn more about Random Forests: * Analytics Vidhya explanation & Python tutorial
* Stack overflow thread on node sizes, useful for hyperparameters
* This UC Business Analytics tutorial does a great job discussing the hyper-parameters, including the ones not discussed in this tutorial. * A good tutorial on random forests that includes plots. * Another good tutorial on random forests, using the rf() function, rather than random forests in the ranger package.

24.10 Model Fit

Now that we have constructed four different models, let’s compare each of them against each other using the F-scores.

knn_confusion_matrix$byClass[7] #knn
## F1 
## NA
svm_confusion_matrix$byClass[7] #svm
##  F1 
## 0.1
dt_confusion_matrix$byClass[7] #decision tree
##        F1 
## 0.6956522
rf_confusion_matrix$byClass[7] #random forest
##        F1 
## 0.7692308

By comparing the F-scores, we can compare the quality of each model. As you can see, the svm appears to produce the best results. But that doesn’t mean that the algorithm necessarily works well, as we can see with the full confusion matrix.

svm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  1  2
##          1 16 19
##                                           
##                Accuracy : 0.5263          
##                  95% CI : (0.3582, 0.6902)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.688974        
##                                           
##                   Kappa : -0.0395         
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##               Precision : 0.33333         
##                  Recall : 0.05882         
##                      F1 : 0.10000         
##              Prevalence : 0.44737         
##          Detection Rate : 0.02632         
##    Detection Prevalence : 0.07895         
##       Balanced Accuracy : 0.48179         
##                                           
##        'Positive' Class : 0               
## 

Even with a higher f-score, it is only 55% accurate, not much better than random chance.

For publication, it would be ideal to produce supervised machine learning algorithms with an F-score of at least .8 (you’ll see papers between .6 and .8, but these are not optimal, especially in an increasingly competitive publication environment).

Let’s talk about f-scores a little more since they’re so important to evaluating supervised machine learning algorithms.

24.10.1 F-scores: Precision and Recall

F-scores are based on two other metrics: precision and recall. Precision is calculated by dividing the number of correctly identified positive elements (in our case, when conservative == 1) from the number of all elements that are actually positive. When your model has a lot of false positives, precision is penalized. Recall is calculated by dividing the number of correctly identified positive elements from the number of elements that should have been coded as positive. When your model has a lot of false negatives, recall is penalized.

The F-score (or “f1 score”) is considered the “harmonic mean” of the two variables.

You may have noticed that the confusionMatrix() function returns both the precision and recall. This can be useful to report alongside your F-score.

svm_confusion_matrix$byClass
##          Sensitivity          Specificity       Pos Pred Value       Neg Pred Value            Precision               Recall                   F1 
##           0.05882353           0.90476190           0.33333333           0.54285714           0.33333333           0.05882353           0.10000000 
##           Prevalence       Detection Rate Detection Prevalence    Balanced Accuracy 
##           0.44736842           0.02631579           0.07894737           0.48179272

Okay, now that we know the best algorithm from the ones we have tested, let’s apply the algorithm to the whole dataset (tweet_data)

24.10.2 Applying the Model

For our application, we need ot make sure that the full data is structured in the same way as the labeled data. Therefore, we’ll select the variables of particular interest (the message id, or id, and the profile description, or description).

tweet_data <- select(tweet_data, id, description)

Next, let’s apply the same wrangling steps as we did to the labeled data: we will unnest the tokens, remove the stop words, count the frequency of words by tweets, calculate the tf_idf for each word, and then construct a document-term matrix.

full_data <- unnest_tokens(tweet_data, word, description) %>%
  anti_join(stop_words, by = "word") %>%
  dplyr::count(id, word) %>%
  tidytext::bind_tf_idf(word, id, n) %>%  
  tidytext::cast_dtm(id, word, tf_idf)

full_data2 <- full_data %>% as.matrix() %>% as.data.frame()

Next, let’s apply the svm_model to this full dataset.

fulldata_predict <- predict(svm_model, newdata = full_data2)

Importantly, our wrangled full data may not be the same as the original full data because some observations may have been removed if they did not contain information (for example, observations with only a URL would be removed from the document-term matrix because they had no words and because we removed the URL at the beginning o this tutorial). For this reason, we may want to left_join() the old data to just the tweets that are in our document term matrix.

final_data <- tibble(id = as.numeric(dimnames(full_data)[[1]])) %>%
  left_join(tweet_data[!duplicated(tweet_data$id), ], by = "id")

And finally, let’s bind the predicted conservative values to our slimmed dataset (the one without the url-only tweets).

final_data$conservative <- fulldata_predict

If you View(final_data), you’ll notice that the conservative text classifier is not particularly great–there are lots of tweets that are conservative but that the model did not code as conservative. I highlight this to emphasize that supervised machine learning models are never fully accurate. It is very hard to produce a model with a F-score of .9 or higher, and a model with an F-score of 1 (complete accuracy) would likely run into issues when applied to other data (in other words, a model with an F-score of 1 is likely to be overfit to the dataset).

I also highlight this because supervised machine learning is as much an art as it is a science. The model you ultimately select can depend on a variety of factors, including the model you ultimately choose as well as the hyperparameters of that model. Importantly, as no two datasets are the same, it is difficult to preemptively know which algorithm or hyperparameter will be the best for your text classification task. This is why tuning, training, and testing the algorithm iteratively is crucial. For an algorithm you submit in a paper, you will have to think carefully about how the balance of the labeled dataset, the algorithm you have selected, and the hyperparameters you have tuned impact the results of your analysis (thankfully, if you set a seed, the analysis can still be replicated by another scholar, though its application to a new dataset may be questionable based on the accuracy of your model).

24.10.3 Bonus Readings

Want more readings on supervised machine learning? Check out these tutorials:

Recall and precision in caret: https://rdrr.io/cran/caret/man/recall.html
Duke Unversity Comp Sci ppt on Random Forest: http://db.cs.duke.edu/courses/compsci371d/current/scribbles/L08_RandomForests.pdf
A tutorial on Machine Learning. The supervised ml section uses caret: https://lgatto.github.io/IntroMachineLearningWithR/supervised-learning.html
R-blogger tutorial: https://www.r-bloggers.com/2020/06/advanced-modelling-in-r-with-caret-a-focus-on-supervised-machine-learning/ ZevRoss tutorial: http://zevross.com/blog/2017/09/19/predictive-modeling-and-machine-learning-in-r-with-the-caret-package/

Honestly, caret is one of the most popular packages in R overall, so you will find many, many tutorials going over the plethora of algorithms available in caret if you look up the package name in a search engine.

24.11 The End

This is the last tutorial of the class. You’ve made it! Congratulations! Thanks for sticking it through the end, and best of luck with all your future programming lessons and endeavors. Remember, learning how to code is a slow and steady process. Yes, it is a steep learning curve, but learning these skills now will pay off immensely in the long-term!