5 Dictionary-based Analysis
Sentiment analyses are fairly easy when you have your data in tidy text format. As they basically consist of matching the particular words’ sentiment values to the corpus, this can be done with an inner_join()
. tidytext
comes with four dictionaries: bing, loughran, afinn, and nrc:
#install.packages("textdata") #may be required to access the lexicons
needs(SnowballC, sotu, stopwords, tidyverse, tidytext)
walk(c("bing", "loughran", "afinn", "nrc"), \(x) get_sentiments(lexicon = x) |>
head() |>
print())
# A tibble: 6 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
# A tibble: 6 × 2
word sentiment
<chr> <chr>
1 abandon negative
2 abandoned negative
3 abandoning negative
4 abandonment negative
5 abandonments negative
6 abandons negative
# A tibble: 6 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
# A tibble: 6 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
As you can see here, the dictionaries are mere tibbles with two columns: “word” and “sentiment”. For easier joining, I should rename my column “token” to word.
needs(magrittr)
<- sotu_meta |>
sotu_20cent_clean mutate(text = sotu_text) |>
distinct(text, .keep_all = TRUE) |>
filter(between(year, 1900, 2000)) |>
unnest_tokens(output = token, input = text) |>
anti_join(get_stopwords(), by = c("token" = "word")) |>
filter(!str_detect(token, "[0-9]")) |>
mutate(token = wordStem(token, language = "en")) |>
mutate(n_rows = length(token))
%<>% rename(word = token) sotu_20cent_clean
The AFINN dictionary is the only one with numeric values. You might have noticed that its words are not stemmed. Hence, I need to do this before I can join it with my tibble. To get the sentiment value per document, I need to average it.
<- get_sentiments("afinn") |>
sotu_20cent_afinn mutate(word = wordStem(word, language = "en")) |>
inner_join(sotu_20cent_clean) |>
group_by(year) |>
summarize(sentiment = mean(value))
Thereafter, I can just plot it:
|>
sotu_20cent_afinn ggplot() +
geom_line(aes(x = year, y = sentiment))
That’s a bit hard to interpret. geom_smooth()
might help:
|>
sotu_20cent_afinn ggplot() +
geom_smooth(aes(x = year, y = sentiment))
Interesting. When you think of the tone in the SOTU addresses as a proxy measure for the circumstances, the worst phase appears to be during the 1920s and 1930s – might make sense given the then economic circumstances, etc. The maximum was in around the 1960s and since then it has, apparently, remained fairly stable.
5.1 Assessing the results
However, we have no idea whether we are capturing some valid signal or not. Let’s look at what drives those classifications the most:
<- get_sentiments("afinn") |>
sotu_20cent_contribution mutate(word = wordStem(word, language = "en")) |>
inner_join(sotu_20cent_clean) |>
group_by(word) |>
summarize(occurences = n(),
contribution = sum(value))
|>
sotu_20cent_contribution slice_max(contribution, n = 10) |>
bind_rows(sotu_20cent_contribution |> slice_min(contribution, n = 10)) |>
mutate(word = reorder(word, contribution)) |>
ggplot(aes(contribution, word, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
labs(y = NULL)
Let’s split this up per decade:
get_sentiments("afinn") |>
mutate(word = wordStem(word, language = "en")) |>
inner_join(sotu_20cent_clean) |>
mutate(decade = ((year - 1900)/10) |> floor()) |>
group_by(decade, word) |>
summarize(occurrences = n(),
contribution = sum(value)) |>
slice_max(contribution, n = 5) |>
bind_rows(get_sentiments("afinn") |>
mutate(word = wordStem(word, language = "en")) |>
inner_join(sotu_20cent_clean) |>
mutate(decade = ((year - 1900)/10) |> floor()) |>
group_by(decade, word) |>
summarize(occurrences = n(),
contribution = sum(value)) |>
slice_min(contribution, n = 5)) |>
mutate(word = reorder_within(word, contribution, decade)) |>
ggplot(aes(contribution, word, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~decade, ncol = 4, scales = "free") +
scale_y_reordered()
5.2 Assessing the quality of the rating
We need to assess the reliability of our classification (would different raters come to the same conclusion; and, if we compare it to a gold standard, how does the classification live up to its standards). One measure we can use here is Krippendorf’s Alpha which is defined as
\[\alpha = \frac{D_o}{D_e}\]
where \(D_{o}\) is the observed disagreement and \(D_{e}\) is the expected disagreement (by chance). The calculation of the measure is far more complicated, but R can easily take care of that – we just need to feed it with proper data. For this example I use a commonly used benchmark data set containing IMDb reviews of movies and whether they’re positive or negative.
<- read_csv("https://www.dropbox.com/scl/fi/psgj6ze6at3zovildm728/imdb_reviews.csv?rlkey=ve2s02ydosbweemalvskyiu4s&dl=1")
imdb_reviews
glimpse(imdb_reviews)
Rows: 25,000
Columns: 2
$ text <chr> "Once again Mr. Costner has dragged out a movie for far long…
$ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
<- imdb_reviews |>
imdb_reviews_afinn rowid_to_column("doc") |>
unnest_tokens(token, text) |>
anti_join(get_stopwords(), by = c("token" = "word")) |>
mutate(stemmed = wordStem(token)) |>
inner_join(get_sentiments("afinn") |> mutate(stemmed = wordStem(word))) |>
group_by(doc) |>
summarize(sentiment = mean(value)) |>
mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
TRUE ~ "negative") |>
factor(levels = c("positive", "negative")))
Now we have two classifications, one “gold standard” from the data and the one obtained through AFINN.
<- imdb_reviews |>
review_coding mutate(true_sentiment = sentiment |>
factor(levels = c("positive", "negative"))) |>
select(-sentiment) |>
rowid_to_column("doc") |>
left_join(imdb_reviews_afinn |> select(doc, sentiment_afinn))
First, we can check how often AFINN got it right, the accuracy:
sum(review_coding$true_sentiment == review_coding$sentiment_afinn, na.rm = TRUE)/25000
[1] 0.64712
However, accuracy is not a perfect metric because it doesn’t tell you anything about the details. For instance, your classifier might just predict “positive” all of the time. If your gold standard has 50 percent “positive” cases, the accuracy would lie at 0.5. We can address this using the following measures.
For the calculation of Krippendorff’s Alpha, the data must be in a different format: a matrix containing with documents as columns and the respective ratings as rows.
needs(irr)
<- review_coding |>
mat select(-text) |>
as.matrix() |>
t()
1:3, 1:5] mat[
[,1] [,2] [,3] [,4] [,5]
doc " 1" " 2" " 3" " 4" " 5"
true_sentiment "negative" "negative" "negative" "negative" "negative"
sentiment_afinn "positive" "negative" "negative" "positive" "positive"
colnames(mat) <- mat[1,]
<- mat[2:3,]
mat 1:2, 1:5] mat[
1 2 3 4 5
true_sentiment "negative" "negative" "negative" "negative" "negative"
sentiment_afinn "positive" "negative" "negative" "positive" "positive"
kripp.alpha(mat, method = "nominal")
Krippendorff's alpha
Subjects = 25000
Raters = 2
alpha = 0.266
Good are alpha values of around 0.8 – AFINN missed that one.
Another way to evaluate the quality of classification is through a confusion matrix.
Now we can calculate precision (when it predicts “positive”, how often is it correct), recall/sensitivity (when it is “positive”, how often is this predicted), specificity (when it’s “negative”, how often is it actually negative). The F1-score is the harmonic mean of precision and recall and defined as \(F_1 = \frac{2}{\frac{1}{recall}\times \frac{1}{precision}} = 2\times \frac{precision\times recall}{precision + recall}\) and the most commonly used measure to assess the accuracy of the classification. The closer to 1 it is, the better. You can find a more thorough description of the confusion matrix and the different measures in this blog post.
We can do this in R using the caret
package.
needs(caret)
<- confusionMatrix(data = review_coding$sentiment_afinn,
confusion_matrix reference = review_coding$true_sentiment,
positive = "positive")
$byClass confusion_matrix
Sensitivity Specificity Pos Pred Value
0.8439750 0.4504721 0.6056500
Neg Pred Value Precision Recall
0.7427441 0.6056500 0.8439750
F1 Prevalence Detection Rate
0.7052216 0.5000000 0.4219875
Detection Prevalence Balanced Accuracy
0.6967515 0.6472236
5.3 Exercises
Take the data from before (State of the Union Addresses). Have a look at different dictionaries (e.g., Bing or Loughran). Check the words that contributed the most. Do you see any immediate ambiguities or flaws?
<- sotu_meta |>
sotu_20cent_clean mutate(text = sotu_text) |>
distinct(text, .keep_all = TRUE) |>
filter(between(year, 1900, 2000)) |>
unnest_tokens(output = word, input = text) |>
mutate(word = wordStem(word, language = "en"))
Preprocess the dictionaries properly so that they match the terms in the text (stemming!).
Join the respective tibbles together using an
inner_join()
.Determine the sentiments for every year, plot it.
Think about how you can transform the dictionaries into a more unified representation (i.e., so that labels are “positive” or “negative”).
Compare the share of “positive” and “negative” the different dictionaries give over time.
Which terms contribute the most to “positive” or “negative?” Compare.