4 Tutorial 4: Validating Automated Content Analysis

After working through Tutorial 4, you’ll…

understand the meaning of evaluation metrics such as recall, precision, and F1.
know how to validate the results of your automated content analysis.

4.1 Validating an automated content analysis

If you remember anything from this seminar, it should be the following:

Do not blindly trust the results of any automated content analysis!

It is often unclear to what extent automated content analysis can measure the latent (theoretical) concepts you are interested in.

Therefore, Grimmer and Stewart (2013, p. 271) recommend: “Validate, Validate, Validate. […] What should be avoided, […] is the blind use of any method without a validation step.”

In particular, ask yourself the following questions when evaluating results:

To what extent can and should I even measure theoretical constructs of interest to my study in an automated fashion?
How much does the automated analysis overlap with manual coding of the same variable(s) - and where do differences to a manual “gold standard” emerge?
How can these differences be explained, i.e., to what extent, if any, do manual and automated coding measure different things and why?

Grimmer and Stewart explain different ways for validating one’s results. Here, we’ll only discuss one such type of validation:

Comparing automated results to a manual “gold standard.”

This type of validation compares automated and manual coding of the same variables for the same texts. Oftentimes, manual codings are referred to as the “gold standard”. This implies that manual analyses are able to capture the “true” value of latent variables in texts.

The extent to which manual coding (or any form of coding) is able to do so can of course also be questioned, as di Maggio (2013, p. 3f.) and Song et al. (2020, p. 553ff.) critically summarize. As a simple example, manual coders often differ in how they code texts; moreover, from an epistemological perspective, it can be discussed to what extent “true” values of latent construct can be identified at all.

Still, validating your result through comparison to a manually coded gold standard may show you to what extent automated and manual coding differ - and why this might be the case. Accordingly, validation at the very least allows you to better understand which (theoretical) constructs you (can) measure with your automated analysis.

Today, we will learn how to validate results with the caret package, which allows you to easily compute common metrics such as Precision, Recall, and the F1 value (for a discussion of what these metrics mean, please see the slides from the session on June 20th).

There are other really helpful packags such as the oolong package - however, for simplicitly we will focus on validation with caret in this seminar.

Let’s use the same data and preprocessing pipeline as in the previous tutorials.

We add one additional step here: Each search query gets its own, unique ID so we can later match automated and manual classifications. Note that the existing variable external_submission_id denotes unique IDs per persons, not search queries - which is why we are creating a new variable called query_id.

library("dplyr")
library("stringr")
library("quanteda")
data <- read.csv2("sample_youtube.csv")

#preprocessing pipeline
data <- data %>% 
  
  #removing URL-related terms
  mutate(across("search_query", 
                gsub, 
                pattern = "https://www.youtube.com/results?search_query=", 
                replacement = "",
                fixed = T)) %>%
  mutate(across("search_query", 
                gsub, 
                pattern = "+", 
                replacement = " ",
                fixed = T)) %>%
  
  #removing encoding issues
  mutate(
         #Correct encoding for German "Umlaute"
         search_query = gsub("%C3%B6", "ö", search_query),
         search_query = gsub("%C3%A4", "ä", search_query),
         search_query = gsub("%C3%BC", "ü", search_query),
         search_query = gsub("%C3%9", "Ü", search_query),
         
         #Correct encoding for special signs
         search_query = gsub("%C3%9F", "ß", search_query),
         
         #Correct encoding for punctuation
         search_query = gsub("%0A", " ", search_query),
         search_query = gsub("%22", '"', search_query),
         search_query = gsub("%23", "#", search_query),
         search_query = gsub("%26", "&", search_query),
         search_query = gsub("%27|%E2%80%98|%E2%80%99|%E2%80%93|%C2%B4", "'", search_query),
         search_query = gsub("%2B", "+", search_query),
         search_query = gsub("%3D", "=", search_query),
         search_query = gsub("%3F", "?", search_query),
         search_query = gsub("%40", "@", search_query),

         #Correct encoding for letters from other languages
         search_query = gsub("%C3%A7", "ç", search_query),
         search_query = gsub("%C3%A9", "é", search_query),
         search_query = gsub("%C3%B1", "ñ", search_query),
         search_query = gsub("%C3%A5", "å", search_query),
         search_query = gsub("%C3%B8", "ø", search_query),
         search_query = gsub("%C3%BA", "ú", search_query),
         search_query = gsub("%C3%AE", "î", search_query)) %>%
  
  mutate(
         #transform queries to lower case
         search_query = char_tolower(search_query),
         
         #create unique ID per search query
         query_id = paste0("ID", 1:nrow(data)))

#Create a document-feature-matrix
dfm <- data$search_query %>%
  tokens() %>%
  dfm()

Suppose you used R to automatically classifiy search queries as news-related or not news-related. For example, we will use a very simple (and probably not very valid) dictionary called dict_news to classify queries as being news-related (1) or not (0).

We automatically classified all articles as news-related that include the terms “news”, “nachrichten”, “doku”, “interview”, or “information”:

#Create dictionary
dict_news <- dictionary(list(news = c("news",
                                      "nachrichten",
                                      "doku",
                                      "interview",
                                      "information")))

#Do dictionary analysis
automated <- dfm %>% 
  
  #look up dictionary
  dfm_lookup(dictionary = dict_news) %>%
  
  #convert to data frame
  convert(., to = "data.frame") %>%
  
 mutate( #transform dictionary count to binary classification of news-related (1) or not (0)
         news = replace(news,
                        news > 0,
                        1),
         
         #add unique search query IDs for matching (done later)
         doc_id = gsub("text", "ID", doc_id))

We have now assigned a dichotomous value to each text based on our automated content analysis.

Please note that we also removed the “text”-string element and replaced it with “ID” to recreate our original ID variables and to later match automated classifications to manual codings via their unique IDs.

Each text has been classified as…

being news-related (1)
not being news-related (0).

According to this dictionary, how many queries are news-related?

#Count news-related queries
automated %>%
  count(news)

##   news    n
## 1    0 5990
## 2    1   42

In total, 42 out of our 6,032 search queries or 0.7% of queries were classified as news-related according to our automated content analysis.

Next, we want to validate our results by measuring in how far the automated measurement of news-relatedness matches a manual coding of the same variable, i.e., a manual “gold standard”.

To do this, we first draw a random sample of texts from your corpus that will be coded manually.³

Song et al. (2020, p. 564) recommend for a sample used to create a “gold standard” to consist of more than 1,300 documents. In addition, manual coding should be performed by more than one coder. Intercoder reliability for coding should amount to at least .7 - obviously, especially coding that many documents is not possible in the context of this seminar.

For practical reasons, we will therefore be limited to a smaller gold standard, here as an example 30 articles, and coding by one coder only.

First, we draw a random sample for manual validation. With the function sample(), we randomly select 30 queries from the corpus.

sample_manual <- sample(1:nrow(data),30)

We now create a new data frame valled validation_set, which serves as the basis for manual validation.

It contains 30 randomly selected texts from our corpus (i.e., sample_manual) as well as the following information:

The variable “ID” contains the ID of the queries to be manually coded.
The variable “Text” contains the queries to be coded manually.
The variable “Manual Coding” contains empty cells where we can enter our coding.

validation_set <- data.frame("ID" = data$query_id[sample_manual], 
                             "text" = data$search_query[sample_manual], 
                             "manual_coding" = rep(NA,30))

We now save this empty csv file in our working directory to manually insert our manual codings via Excel. If you prefer doing the coding in R, you could obviously do that as well.

write.csv2(validation_set, "validation_dictionary.csv", row.names = FALSE)

The next step would now be the actual coding of texts outside of R.

Here, you would have to read each text and then enter your manual coding in the respective row of the manual_coding column:

code a 1 if the query is news-related
code a 0 if the query is not news-related

For doing so, as with any good manual content analysis, you should follow a clear codebook with coding instructions.

I have very roughly coded such a manual gold standard as an example. You will find the corresponding excel file with potential codings for each of the 30 texts via Moodle under the folder “Data for R” (“validation_example.csv”).

validation_set <- read.csv2("validation_example.csv")

After reading in the manual gold standard, we merge the result of our automated and of our manual analysis by using the ID of the respective texts and the merge() command.

By doing so, we create an object called confusion - which is called this way because it is basically a confusion matrix (it will become what I mean by that in a second).

confusion <- merge(automated[,c("doc_id", "news")],
                   validation_set[,c("ID", "manual_coding")],
                   by.x="doc_id", by.y="ID")
colnames(confusion) <- c("ID", "automated", "manual")
head(confusion)

##       ID automated manual
## 1 ID1046         0      1
## 2 ID1069         0      0
## 3 ID1075         0      0
## 4 ID1131         0      0
## 5 ID1449         0      0
## 6 ID1495         0      0

Next to the ID of the respective text, the object confusion contains the automated classification of the text, here in the column automated, and the manual coding, here in the column manual.

One thing becomes obvious right away: Automated and manual coding do overlap or agree to some extent - but not in all cases.

How can we now obtain a metric indicating how much manual and automated content analysis overlap, i.e., how good our automated content analysis is?

To assess this, we rely on Precision, Recall, and the F1 value as common criteria for evaluating the quality of information retrieval. These metrics are used in many studies - see for example Nelson et al. (2018) or, from my own research, Hase et al. (2021).

Precision:

The precision value indicates how well your automated analysis does at classifying only items as “news-related” that actually are news-related according to the manual gold standard.

$Precision = \frac{True Positives}{True Positives + False Positives}$

The metric ranges from a minimum of 0 to a maximum of 1. It shows how well your method does in terms of not falsely classifying too many items as “news-related” (compared to the the manual gold standard), i.e., at not generating too many “false positives”. The closer the value to 1, the better your analysis.

Recall:

The recall value indicates how well your automated analysis does at classifying all items as news-related that actually are news-related according to the manual gold standard.

$Recall = \frac{True Positives}{True Positives + False Negatives}$

This metric similarly ranges from a minimum of 0 to a maximum of 1. It shows how well your method does at not falsely classifying too many items as “not news-related” (compared to the manual gold standard), i.e., at not generating too many “false negatives”. The closer the value to 1, the better your analysis.

F1-Wert:

The F1 value consists of the harmonic mean of the precision and the recall metric. Researchers often rely on the F1 value if in need of a metric that includes both precision and recall to some extent.

$F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}$

To get information about the precision, recall, and F1 value of our analysis, we use the confusionMatrix() command included in the caret package.

Please consider that you checked the following things before running the command:

All variables indicating the classification of either the manual or the automated coding must be in factor format.
The data argument tells the function where to find the automated coding.
The reference argument tells the function where to find the manual coding.
The mode argument tells the function that we want to asses our analyses based on metrics such as precision, recall etc. (since many others exist as well)
The positive argument tells the function which value of our classification denotes the occurrence of the variable we are interested in. In this case, we tell R that the occurence of “news-related” was coded with a 1.

#transform classifications to factor format
confusion <- confusion %>%
  mutate(automated = as.factor(automated),
         manual = as.factor(manual))

#calculate confusion matrix
library("caret")
result <- confusionMatrix(data = confusion$automated, 
                          reference = confusion$manual, 
                          mode = "prec_recall", 
                          positive = "1")

Let’s take a look at the confusion matrix first:

result$table

##           Reference
## Prediction  0  1
##          0 26  1
##          1  1  2

The confusion matrix shows you how many queries, according to the automated content analysis (“Prediction”), were coded with 0 (“not news-related”) or 1 (“news-related”). Similarly, it shows you how many queries, according to the manual gold standard (“Reference”), were coded with 0 (“not news-related”) or 1 (“news-related”).

Thus, the more texts were coded with a 0 or 1 one both by the automated and the manual approach, the larger the agreement (and the better validation values). In turn, the more texts were coded with a 0 for one of the codings (i.e., either manual or automated) but a 1 for the other, the worse the agreement between automated and manual analysis.

Here, we see that…

2 queries were coded as being “news-related” by both humans and the machine. Thus, the automated content analysis detected 2 true positives (i.e., correctly classified texts as being “news-related”) out of the total 3 queries that should have been coded as having “news-related” according to the manual gold standard (2 + 1 = 3).
26 queries were coded as “not news-related” both by humans and the machine. Thus, the computer detected 26 true negatives (i.e., correctly classified texts as being “not news-related”) out of a total of 27 queries that would have to be coded as “not news-related” according to the manual gold standard (26 + 1).
However, the automated analysis coded 1 query as “news-related” that should be coded with “not news-related” according to the manual gold standard. Thus, these are false positives that are likely to worsen our precision.
In addition, the automated analysis coded 1 query as “not news-related”, which according to the manual gold standard should have been classified as “news-related”. Thus, these are false negatives that are likely to worsen our recall.

Having this data, you could calculate precision, recall and the F1 value of our analysis manually:

$Precision = \frac{True Positives}{True Positives + False Positives} = \frac{2}{2+1}= .66$

$Recall = \frac{True Positives}{True Positives + False Negatives} = \frac{2}{2+1}= .66$

$F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}= 2 * \frac{.66 * .66}{.66 + .66}= .66$

However - that is what we have R for. We can retrieve these values which were already saved in the result object:

result$byClass[5:7]

## Precision    Recall        F1 
## 0.6666667 0.6666667 0.6666667

There is no overall threshold for what constitutes a “good” automated content analysis when it comes to precision, recall and the F1 value. I would recommend you to roughly follow the guidelines on what constitutes “good” intercoder reliability values - i.e., assume that a value of .8 for precision, recall, or the F1-value would indicate that the automated analysis provides valid results.

Overall, based on these results, we would assume that our dictionary works not great in detecting news related as well as not news related queries, since recall and precision both amount to lower values than .8.

4.2 Take Aways

Vocabulary:

Precision, Recall, F1 value: Metrics used to evaluate the quality of automated content analysis (or information retrieval more generally):
Precision: To what extent does the automated classification cover only “true positives”, i.e., only relevant cases?
Recall: To what extent does the automated classification include all “true positives”, i.e., all relevant cases?

Commands:

Validation: confusionMatrix()

Let’s keep going: with Tutorial 5: Matching survey data & data donations

There are cases where we may expect imbalance in our sample and may not want to rely on a random sample. For instance, if we have expect a lot of queries to not be news-related (0) and few to be news-related (1), a random sample may not be the best case for validation as will likely never expect queries that should be coded with 1. We will ignore potential imbalance, but if you are interested in this issue, I can recommend this text by Anke Stoll: https://doi.org/10.1007/s11616-020-00573-9 ↩︎