2.1 The sentiments
dataset
There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are
AFINN
from Finn Årup Nielsenbing
from Bing Liu and collaboratorsnrc
from Saif Mohammad and Peter Turney.loughran
: he Loughran and McDonald dictionary of financial sentiment terms. This dictionary was developed based on analyses of financial reports, and intentionally avoids words like “share” and “fool”, as well as subtler terms like “liability” and “risk” that may not have a negative meaning in a financial context.
All three of these lexicons are based on unigrams. These lexicons contain many English words and the words are assigned scores for positive / negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc
lexicon categorizes words into classes of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing
lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN
lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The loughran
lexicon divided words into constraining, litigious, negative, positive, superfluous and uncertainty
get_sentiments("nrc")
#> # A tibble: 13,901 x 2
#> word sentiment
#> <chr> <chr>
#> 1 abacus trust
#> 2 abandon fear
#> 3 abandon negative
#> 4 abandon sadness
#> 5 abandoned anger
#> 6 abandoned fear
#> # ... with 1.39e+04 more rows
# install.packages("textdata")
get_sentiments("bing")
#> # A tibble: 6,786 x 2
#> word sentiment
#> <chr> <chr>
#> 1 2-faces negative
#> 2 abnormal negative
#> 3 abolish negative
#> 4 abominable negative
#> 5 abominably negative
#> 6 abominate negative
#> # ... with 6,780 more rows
get_sentiments("afinn")
#> # A tibble: 2,477 x 2
#> word value
#> <chr> <dbl>
#> 1 abandon -2
#> 2 abandoned -2
#> 3 abandons -2
#> 4 abducted -2
#> 5 abduction -2
#> 6 abductions -2
#> # ... with 2,471 more rows
get_sentiments("loughran") %>%
filter(sentiment == "superfluous")
#> # A tibble: 21 x 2
#> word sentiment
#> <chr> <chr>
#> 1 aegis superfluous
#> 2 amorphous superfluous
#> 3 anticipatory superfluous
#> 4 appertaining superfluous
#> 5 assimilate superfluous
#> 6 assimilating superfluous
#> # ... with 15 more rows
Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text.
One caveat is that the size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis. A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better