13 Sentiment analysis
A surprisingly easy text mining task, once your documents have been turned into a tokenised dataframe, is sentiment analysis. Sentiment analysis is the name for a range of techniques which attempt to measure emotion in a text.
There are lots of ways of doing this, which become more and more sophisticated. One fairly simple but robust method is to take a dataset of words with corresponding sentiment scores (this could be a simple negative or positive score, or a score for each of a range of emotions). Then you join these scores to your tokenised dataframe, and count them.
The tricky bit is working out what it all means: You could argue that it’s reductive to reduce a text to the sum of its positive and negative scores for each word - this is obviously not the way that language works. Also, if you’re summing the scores, you need to think about the unit you’re summarising by. Can you measure the emotions of a newspaper? or does it have to be per article? And of course it goes without saying that this was created by modern readers for use on modern text.
Despite these questions, it can throw up some interesting patterns. Perhaps, if used correctly, one might be able to understand something of the way an event was reported, though it may not actually help with the ‘sentiment’ of the article, but rather reporting style or focus. I think with the right use, sentiment shows some promise when specifically applied to newspaper data, but thinking of it as sentiment may be a fool’s errand: it tells us something about the focus or style of an article, and over time and in bulk, something of a newspaper’s style or change in style.
The tidytext library has a few built-in sentiment score datasets (or lexicons). To load them first install the textdata and tidytext packages, if they’re not installed already (using install.packages()
)
13.1 Install and load relevant packages
library(textdata)
library(tidytext)
library(tidyverse)
13.2 Fetch sentiment data
Next use a function in the tidytext library called get_sentiments()
. All this does is retrieve a dataset of sentiment scores and store them as a dataframe.
There are four to choose from - I’ll quickly explain each one.
13.2.1 Afinn dataset
= get_sentiments('afinn')
afinnsentiments
head(afinnsentiments,10)
## # A tibble: 10 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
The Afinn dataset has two colums: words in one column, and a value between -5 and +5 in the other. The value is a numeric score of the word’s perceived positivity or negativity. More information is available on the official project GitHub page
13.2.2 Bing dataset
The second, the Bing dataset, was compiled by the researchers Minqing Hu and Bing Liu. It is also a list of words, with each classified as either positive or negative.
= get_sentiments('bing')
bingsentiments
head(bingsentiments,10)
## # A tibble: 10 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
13.2.3 Loughran dataset
I’ve never used it, but it’s clearly similar to the Bing dataset, with a column of words and a classification of either negative or positive. More information and the original files can be found on the creator’s website
= get_sentiments('loughran')
loughransentiments
head(loughransentiments,10)
## # A tibble: 10 × 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
13.2.4 NRC dataset
= get_sentiments('nrc')
nrcsentiments
head(nrcsentiments,10)
## # A tibble: 10 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.
(https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)
The NRC dataset is a bit different to the other ones. This time, there’s a list of words, and an emotion associated with that word. A word can have multiple entries, with different emotions attached to them.
13.3 Load the tokenised news sample
load('tokenised_news_sample')
This has two colums, ‘word’ and ‘value’. inner_join()
will allow you to merge this with the tokenised dataframe.
%>% inner_join(afinnsentiments) tokenised_news_sample
## Joining, by = "word"
## # A tibble: 1,704,076 × 8
## article_code art title year date full_date word value
## <int> <chr> <chr> <chr> <chr> <date> <chr> <dbl>
## 1 2 0002 "" 1855 0619 1855-06-19 miss -2
## 2 2 0002 "" 1855 0619 1855-06-19 miss -2
## 3 2 0002 "" 1855 0619 1855-06-19 miss -2
## 4 2 0002 "" 1855 0619 1855-06-19 pleasure 3
## 5 2 0002 "" 1855 0619 1855-06-19 positively 2
## 6 2 0002 "" 1855 0619 1855-06-19 attraction 2
## 7 2 0002 "" 1855 0619 1855-06-19 miss -2
## 8 2 0002 "" 1855 0619 1855-06-19 celebrated 3
## 9 2 0002 "" 1855 0619 1855-06-19 farce -1
## 10 2 0002 "" 1855 0619 1855-06-19 benefit 2
## # … with 1,704,066 more rows
Now we have a list of all the words, one per line, which occurred in the afinn list, and their individual score. To make this in any way useful, we need to summarise the scores. The article seems by far the most logical start. We can get the average score for each article, which will tell us whether the article contained more positive or negative words. For this we use tally()
and mean()
I’m also using add_tally() to filter out only articles which contain at least 20 of these words from the lexicon, because I think it will make the score more meaningful.
Let’s look at the most ‘positive’ article
%>%
tokenised_news_sample inner_join(afinnsentiments) %>%
group_by(article_code) %>%
add_tally() %>%
filter(n>20) %>%
tally(mean(value)) %>%
arrange(desc(n))
## Joining, by = "word"
## # A tibble: 20,963 × 2
## article_code n
## <int> <dbl>
## 1 23531 2.8
## 2 23709 2.55
## 3 22835 2.42
## 4 54242 2.4
## 5 58442 2.33
## 6 7528 2.31
## 7 31605 2.25
## 8 2527 2.20
## 9 31669 2.2
## 10 4255 2.14
## # … with 20,953 more rows
13.4 Charting Changes in Sentiment Over Time
Sentiment analysis should be uesd with caution, but it’s potentially a useful tool, particularly to look at changes over time, or differences between newspapers or authors. We can plot the average of all the average article scores. If we had them, this could be segmented by title.
Here I’ve charted all the sentiments found in the sample dataframe - but note that because the sample is so sparse and uneven, I’ve not spaced the dates temporally, but rather just one after another.
%>%
tokenised_news_sample inner_join(afinnsentiments) %>%
group_by(full_date,article_code) %>%
add_tally() %>%
filter(n>20) %>%
tally(mean(value)) %>%
group_by(full_date) %>%
tally(mean(n)) %>%
arrange(desc(n)) %>%
ggplot() +
geom_col(aes(x = as.character(full_date), y = n)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))