Chapter 28 Model Fit

Now that we have constructed four different models, let’s compare each of them against each other using the F-scores.

knn_confusion_matrix$byClass[7] #knn
##        F1 
## 0.2857143
svm_confusion_matrix$byClass[7] #svm
##        F1 
## 0.5454545
dt_confusion_matrix$byClass[7] #decision tree
## F1 
## NA
rf_confusion_matrix$byClass[7] #random forest
##        F1 
## 0.1052632

By comparing the F-scores, we can compare the quality of each model. As you can see, the svm appears to produce the best results. But that doesn’t mean that the algorithm necessarily works well, as we can see with the full confusion matrix.

svm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  9  7
##          1  8 14
##                                           
##                Accuracy : 0.6053          
##                  95% CI : (0.4339, 0.7596)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.3142          
##                                           
##                   Kappa : 0.1972          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##               Precision : 0.5625          
##                  Recall : 0.5294          
##                      F1 : 0.5455          
##              Prevalence : 0.4474          
##          Detection Rate : 0.2368          
##    Detection Prevalence : 0.4211          
##       Balanced Accuracy : 0.5980          
##                                           
##        'Positive' Class : 0               
## 

Even with a higher f-score, it is only 55% accurate, not much better than random chance.

For publication, it would be ideal to produce supervised machine learning algorithms with an F-score of at least .8 (you’ll see papers between .6 and .8, but these are not optimal, especially in an increasingly competitive publication environment).

Let’s talk about f-scores a little more since they’re so important to evaluating supervised machine learning algorithms.

28.1 F-scores: Precision and Recall

F-scores are based on two other metrics: precision and recall. Precision is calculated by dividing the number of correctly identified positive elements (in our case, when conservative == 1) from the number of all elements that are actually positive. When your model has a lot of false positives, precision is penalized. Recall is calculated by dividing the number of correctly identified positive elements from the number of elements that should have been coded as positive. When your model has a lot of false negatives, recall is penalized.

The F-score (or “f1 score”) is considered the “harmonic mean” of the two variables.

You may have noticed that the confusionMatrix() function returns both the precision and recall. This can be useful to report alongside your F-score.

svm_confusion_matrix$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.5294118            0.6666667            0.5625000 
##       Neg Pred Value            Precision               Recall 
##            0.6363636            0.5625000            0.5294118 
##                   F1           Prevalence       Detection Rate 
##            0.5454545            0.4473684            0.2368421 
## Detection Prevalence    Balanced Accuracy 
##            0.4210526            0.5980392

Okay, now that we know the best algorithm from the ones we have tested, let’s apply the algorithm to the whole dataset (tweet_data)

28.2 Applying the Model

For our application, we need ot make sure that the full data is structured in the same way as the labeled data. Therefore, we’ll select the variables of particular interest (the message id, or id, and the profile description, or description).

tweet_data <- select(tweet_data, id, description)

Next, let’s apply the same wrangling steps as we did to the labeled data: we will unnest the tokens, remove the stop words, count the frequency of words by tweets, calculate the tf_idf for each word, and then construct a document-term matrix.

full_data <- unnest_tokens(tweet_data, word, description) %>%
  anti_join(stop_words, by = "word") %>%
  count(id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, id, n) %>%  
  tidytext::cast_dtm(id, word, tf_idf)

full_data2 <- full_data %>% as.matrix() %>% as.data.frame()

Next, let’s apply the svm_model to this full dataset.

fulldata_predict <- predict(svm_model, newdata = full_data2)

Importantly, our wrangled full data may not be the same as the original full data because some observations may have been removed if they did not contain information (for example, observations with only a URL would be removed from the document-term matrix because they had no words and because we removed the URL at the beginning o this tutorial). For this reason, we may want to left_join() the old data to just the tweets that are in our document term matrix.

final_data <- tibble(id = as.numeric(dimnames(full_data)[[1]])) %>%
  left_join(tweet_data[!duplicated(tweet_data$id), ], by = "id")

And finally, let’s bind the predicted conservative values to our slimmed dataset (the one without the url-only tweets).

final_data$conservative <- fulldata_predict

If you View(final_data), you’ll notice that the conservative text classifier is not particularly great–there are lots of tweets that are conservative but that the model did not code as conservative. I highlight this to emphasize that supervised machine learning models are never fully accurate. It is very hard to produce a model with a F-score of .9 or higher, and a model with an F-score of 1 (complete accuracy) would likely run into issues when applied to other data (in other words, a model with an F-score of 1 is likely to be overfit to the dataset).

I also highlight this because supervised machine learning is as much an art as it is a science. The model you ultimately select can depend on a variety of factors, including the model you ultimately choose as well as the hyperparameters of that model. Importantly, as no two datasets are the same, it is difficult to preemptively know which algorithm or hyperparameter will be the best for your text classification task. This is why tuning, training, and testing the algorithm iteratively is crucial. For an algorithm you submit in a paper, you will have to think carefully about how the balance of the labeled dataset, the algorithm you have selected, and the hyperparameters you have tuned impact the results of your analysis (thankfully, if you set a seed, the analysis can still be replicated by another scholar, though its application to a new dataset may be questionable based on the accuracy of your model).

28.3 Bonus Readings

Want more readings on supervised machine learning? Check out these tutorials:

Recall and precision in caret: https://rdrr.io/cran/caret/man/recall.html
Duke Unversity Comp Sci ppt on Random Forest: http://db.cs.duke.edu/courses/compsci371d/current/scribbles/L08_RandomForests.pdf
A tutorial on Machine Learning. The supervised ml section uses caret: https://lgatto.github.io/IntroMachineLearningWithR/supervised-learning.html
R-blogger tutorial: https://www.r-bloggers.com/2020/06/advanced-modelling-in-r-with-caret-a-focus-on-supervised-machine-learning/ ZevRoss tutorial: http://zevross.com/blog/2017/09/19/predictive-modeling-and-machine-learning-in-r-with-the-caret-package/

Honestly, caret is one of the most popular packages in R overall, so you will find many, many tutorials going over the plethora of algorithms available in caret if you look up the package name in a search engine.