14 Tutorial 14: Validating automated content analyses

After working through Tutorial 14, you’ll…

  • know how to validate the results of your automated content analysis.

14.1 Validating an automated content analysis

If you remember anything from the seminar, it should be the following:

Do not blindly trust the results of any automated content analysis!

It is often unclear to what extent automated content analysis can measure the latent (theoretical) concepts you are interested in.

Therefore, Grimmer and Stewart (2013, p. 271) recommend: “Validate, Validate, Validate. […] What should be avoided, […] is the blind use of any method without a validation step.”

In particular, ask yourself the following questions when evaluating results:

  • To what extent can and should I even measure theoretical constructs of interest to my study in an automated fashion?
  • How much does the automated analysis overlap with manual coding of the same variable(s) - and where do differences to a manual “gold standard” emerge?
  • How can these differences be explained, i.e., to what extent, if any, do manual and automated coding measure different things and why?

Grimmer and Stewart explain different ways for validating one’s results. Here, we’ll only discuss one such type of validation:

Comparing automated results to a manual “gold standard.”

This type of validation compares automated and manual coding of the same variables for the same texts. Oftentimes, manual codings are referred to as the “gold standard”. This implies that manual analyses are able to capture the “true” value of latent variables in texts.

The extent to which manual coding (or any form of coding) is able to do so can of course be questioned, as di Maggio (2013, p. 3f.) and Song et al. (2020, p. 553ff.) critically summarize. As a simple example, manual coders often differ in how they code texts; moreover, from an epistemological perspective, it can be discussed to what extent “true” values of variables can be measured at all.

Still, validating your result through comparison to a manually coded gold standard may show you to what extent automated and manual coding differ - and why this might be the case. Accordingly, validation at the very least allows you to better understand which (theoretical) constructs you (can) measure with your automated analysis.

Today, we will learn how to validate results with the caret package, which allows you to easily compute common metrics such as Precision, Recall, and the F1 value (we’ve already discussed these in session 7 of the seminar!).

There are other really helpful packags such as the oolong package - however, for simplicitly we will focus on validation with caret in this seminar.

Let’s use the same data as in the previous tutorials. You can find the corresponding R file in OLAT (via: Materials / Data for R) with the name immigration_news.rda.

Source of the data set: Nulty, P. & Poletti, M. (2014). “The Immigration Issue in the UK in the 2014 EU Elections: Text Mining the Public Debate.” Presentation at LSE Text Mining Conference 2014. Accessed via the quanteda corpus package.

load("immigration_news.rda")

Suppose you used R to automatically encode the negativity of texts. For example, we are now using a very simple (and probably not very valid) approach here to classify texts as having a negative sentiment (1) or not (0).

We automatically classified all articles as negative that - in relative terms - contained more negative than positive features by relying on the off-the-shelf LSD Dictionary :

library("quanteda")
automated <- dfm(data$text) %>% 
  dfm_weight("prop") %>%
  dfm(dictionary = data_dictionary_LSD2015[1:2]) %>% 
  convert(to="data.frame")
## Warning: 'dfm.character()' is deprecated. Use 'tokens()' first.
## Warning: 'dictionary' and 'thesaurus' are deprecated; use dfm_lookup() instead
#We calculate whether texts contain more negative or positive features
automated$relative <- automated$positive-automated$negative

#Texts with more negative than positive features are classified as having negative sentiment (=1)
automated$negativity[automated$relative<0] <- 1

#Texts with more positive than negative features (or a similar amount of negative and positive features)
#are classified as not having negative sentiment (=0)
automated$negativity[automated$relative>=0] <- 0

#How often do we have negative or positive sentiment?
table(automated$negativity)
## 
##    0    1 
## 1402 1431

We have now assigned a dichotomous value to each text based on our automated content analysis.

Each text has been classified as having…

  • either a negative sentiment (1)
  • or not having a negative sentiment (0).

Next, we want to validate our results by measuring in how far the automated measurement of negativity matches a manual coding of the same variable, i.e., a manual “gold standard”.

To do this, we first draw a random sample of texts from your corpus that will be coded manually.

Song et al. (2020, p. 564) recommend for a sample used to create a “gold standard” to consist of more than 1,300 documents. In addition, manual coding should be performed by more than one coder:in and the intercoder reliability for coding should amount to at least .7 - obviously, this is not possible in the context of this seminar.

For practical reasons, we will therefore be limited to a smaller gold standard, here as an example 30 articles, and coding by one coder only.

First, we draw a random sample for manual validation. With the function sample(), we randomly select 30 texts from the corpus.

texts <- sample(1:nrow(data),30)

We now create a new data frame valled validation_set, which serves as the basis for manual validation.

It contains 30 randomly selected texts from our corpus (i.e., text) as well as the following information:

  • The variable “ID” contains the ID of the texts to be manually coded.
  • The variable “Text” contains the texts to be coded manually.
  • The variable “Manual Coding” contains empty cells where we can enter our coding.
validation_set <- data.frame("ID" = data$doc_id[texts], 
                             "text" = data$text[texts], 
                             "manual_coding" = rep(NA,30))

We now save this empty csv file in our working directory to manually insert our manual codings here. If you prefer doing the coding in R, you could obviously do that as well.

write.csv2(validation_set, "validation_dictionary.csv")

The next step would now be the actual coding of texts outside of R.

Here, you would have to read each text and then enter your manual coding in the respective row of the manual_coding column:

  • code a 1 if the text contains “negative sentiment”
  • code a 0 if the text contains “no negative sentiment”

For doing so, as with any good manual content analysis, you should follow a clear codebook with coding instructions.

I have very roughly coded such a manual gold standard as an example. You will find the corresponding excel file with potential codings for each of the 30 texts in OLAT (via: Materials / Data for R) with the name validation_example.csv.

validation_set <- read.csv2("validation_example.csv")

After reading in the manual gold standard, we merge the result of our automated and of our manual analysis by using the ID of the respective texts and the merge() command.

By doing so, we create an object called confusion - which is called this way because it is basically a confusion matrix, similar to the one I showed you when discussing the validation of results in session 7.

confusion <- merge(automated[,c("doc_id", "negativity")],
                   validation_set[,c("ID", "manual_coding")],
                   by.x="doc_id", by.y="ID")
colnames(confusion) <- c("ID", "automated", "manual")
confusion
##          ID automated manual
## 1  text1154         1      1
## 2  text1234         0      0
## 3  text1291         0      0
## 4  text1296         0      0
## 5  text1306         1      1
## 6  text1307         0      0
## 7  text1523         0      0
## 8  text1561         0      1
## 9  text1630         0      0
## 10 text1937         0      0
## 11 text1942         1      0
## 12 text1956         1      0
## 13 text1999         1      0
## 14 text2028         0      0
## 15 text2062         0      1
## 16 text2124         1      1
## 17 text2170         1      0
## 18  text218         1      1
## 19  text232         0      0
## 20  text234         1      1
## 21 text2604         1      1
## 22 text2691         1      1
## 23  text497         0      0
## 24  text508         1      0
## 25  text549         1      1
## 26  text610         0      0
## 27  text621         1      0
## 28  text630         0      0
## 29  text648         0      0
## 30  text701         0      0

Next to the ID of the respective text, the object confusion contains the automated classification of the text, here in the column automated, and the manual coding, here in the column manual.

One thing becomes obvious right away: Automated and manual coding do overlap or agree to some extent - but not in all cases.

How can we now obtain a metric indicating how much manual and automated content analysis overlap, i.e., how good our automated content analysis is?

To assess this, we rely on Precision, Recall, and the F1 value as common criteria for evaluating the quality of information retrieval. Again, you have already learned what these mean in session number 7. These metrics are used in many studies - see for example Nelson et al. (2018) or, from my own research, Hase et al. (2021).

Precision:

The precision value indicates how well your automated analysis does at classifying only items as “negative” that actually contain negativity according to the manual gold standard.

\(Precision = \frac{True Positives}{True Positives + False Positives}\)

The metric ranges from a minimum of 0 to a maximum of 1. It shows how well your method does in terms of not falsely classifying too many items as “negative” (compared to the the manual gold standard), i.e., at not generating too many “false positives”. The closer the value to 1, the better your analysis.

Recall:

The recall value indicates how well your automated analysis does at classifying all items as negative that actually contain negative sentiment according to the manual gold standard.

\(Recall = \frac{True Positives}{True Positives + False Negatives}\)

This metric similarly ranges from a minimum of 0 to a maximum of 1. It shows how well your method does at not falsely classifying too many items as “not negative” (compared to the manual gold standard), i.e., at not generating too many “false negatives”. The closer the value to 1, the better your analysis.

F1-Wert:

The F1 value consists of the harmonic mean of the precision and the recall metric. Researchers often rely on the F1 value if in need to a metric that includes both precision and recall to some extent.

\(F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}\)

To get information about the precision, recall, and F1 value of our analysis, we use the confusionMatrix() command included in the caret package.

Please consider that you checked the following things before running the command:

  • All variables indicating the classification of either the manual or the automated coding must be in factor format.
  • The data argument tells the function where to find the automated coding.
  • The reference argument tells the function where to find the manual coding.
  • The reference argument tells the function that we want to asses our analyses based on metrics such as precision, recall etc. (since many others exist as well)
  • The positive argument tells the function which value of our classification denotes the occurrence of the variable we are interested in. In this case, we tell R that the occurence of negative sentiment was coded with a 1.
#transform classifications to factor format
confusion$automated <- as.factor(confusion$automated)
confusion$manual <- as.factor(confusion$manual)

#calculate confusion matrix
library("caret")
result <- confusionMatrix(data = confusion$automated, 
                          reference=confusion$manual, 
                          mode = "prec_recall", 
                          positive = "1")

Let’s take a look at the confusion matrix first:

result$table
##           Reference
## Prediction  0  1
##          0 14  2
##          1  6  8

The confusion matrix shows you how many texts, according to the automated content analysis (“Prediction”), were coded with 0 (“No negative sentiment”) or 1 (“Negative sentiment”). Similarly, it shows you how many texts, according to the manual gold standard (“Reference”), were coded with 0 (“No negative sentiment”) or 1 (“Negative sentiment”).

Thus, the more texts were coded with a 0 or 1 one both by the automated and the manual approach, the larger the agreement (and the better validation values). In turn, the more texts were coded with a 0 for one of the codings (i.e., either manual or automated) but a 1 for the other, the worse the agreement between automated and manual analysis.

Here, we see that…

  • 8 texts were coded as having “negative sentiment” by both humans and the machine. Thus, the automated content analysis detected 8 true positives (i.e., correctly classified texts as having “negative sentiment”) out of the total 10 texts that should have been coded as having “negative sentiment” according to the manual gold standard (2 + 8 = 10).
  • 14 texts were coded as “no negative sentiment” both by humans and the machine. Thus, the computer detected 14 true negatives (i.e., correctly classified texts as having “no negative sentiment”) out of a total of 20 texts that would have to be coded as “no negative sentiment” according to the manual gold standard (14 + 6 = 20).
  • However, the automated analysis coded 6 texts as “negative sentiment” that should be coded with “no negative sentiment” according to the manual gold standard. Thus, these are false positives that are likely to worsen our precision.
  • In addition, the automated analysis coded 2 texts with no negative sentiment“, which according to the manual gold standard should have been classified as”negative sentiment". Thus, these are false negatives that are likely to worsen our recall.

Having this data, you could calculate precision, recall and the F1 value of our analysis manually:

\(Precision = \frac{True Positives}{True Positives + False Positives} = \frac{8}{8+6}= .57\)

\(Recall = \frac{True Positives}{True Positives + False Negatives} = \frac{8}{8+2}= .8\)

\(F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}= 2 * \frac{.57 * .8}{.57 + .8}= .66\)

However - that is what we have R for. We can retrieve these values which were already saved in the result object:

result$byClass[5:7]
## Precision    Recall        F1 
## 0.5714286 0.8000000 0.6666667

There is no overall threshold for what constitutes a “good” automated content analysis when it comes to precision, recall and the F1 value. I would recommend you to roughly follow the guidelines on what constitutes “good” intercoder reliability values - i.e., assume that a value of .8 for precision, recall, or the F1-value would indicate that the automated analysis provides valid results.

Overall, based on these results, we would assume that our dictionary works well in detecting as many negative texts as possible, since recall amounts to .8. However, it clearly too often misclassified non-negative texts as negative, since our precision amounts to only .57.

This could indicate, for example, that our dictionary is too “broad,” i.e., contains too many words that aren’t good manifest indicators for measuring latent sentiment.

14.2 Take Aways

Vocabulary:

  • Precision, Recall, F1 value: Metrics used to evaluate the quality of automated content analysis (or information retrieval more generally):
  • Precision: To what extent does the automated classification cover only “true positives”, i.e., only relevant cases?
  • Recall: To what extent does the automated classification include all “true positives”, i.e., all relevant cases?

Commands:

  • Validation: confusionMatrix()

That’s it - you’ve worked through all tutorials for this seminar. Congrats and good luck with the final paper!