Chapter 10 Notes

10.1 Categorization of word “casualty” by Loughran Lexicon

Seeing that the AFINN scored the word “casualty” as negative…

brk_afinn %>%
  filter(word == "casualty") %>%
  select(sentiment) %>%
  unique()
## # A tibble: 1 x 1
##   sentiment
##       <dbl>
## 1        -2

I then looked at how the Loughran lexicon classified it.

brk_lough %>%
  filter(word == "casualty") %>%
  select(sentiment) %>%
  unique()
## # A tibble: 1 x 1
##   sentiment
##   <chr>    
## 1 neutral

It turned out to be neutral which for the Berkshire analysis would be correct. Then I looked at Bing which classifies the word as negative.

brk_bing %>%
  filter(word == "casualty") %>%
  select(sentiment) %>%
  unique()
## # A tibble: 1 x 1
##   sentiment
##   <chr>    
## 1 negative

We previously looked at how many times the word “casualty” was used - the maximum was 10 times in 1984. We saw that in the AFINN analysis in Chapter 4 that it had an influence of -20 in 1984 and if adjusted would have raised sentiment for that year from 129 to 149. How much sway did the word have on the Bing sentiment?

brk_words %>%
  filter(word == "casualty") %>%   # filters for just the word "casualty"
  group_by(year) %>%            # group by year
  count(word, sort = TRUE) %>%  # count total number and sort by frequency
  ungroup()
## # A tibble: 43 x 3
##    year  word         n
##    <chr> <chr>    <int>
##  1 1984  casualty    10
##  2 1977  casualty     7
##  3 1986  casualty     7
##  4 1980  casualty     6
##  5 1988  casualty     5
##  6 2016  casualty     5
##  7 1976  casualty     4
##  8 1979  casualty     4
##  9 1981  casualty     4
## 10 1987  casualty     4
## # ... with 33 more rows

If we look at the Bing sentiment for 1984 we see the score is (287 - 286) / 5,000 = 0.00002.

brk_bing_year <- brk_bing %>%  
  count(year, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  mutate(total = (positive + negative + neutral)) %>%   # create "total" column 
  mutate(bing = ((positive - negative) / total)) %>% # create column with calculated score
  select("year", "positive", "negative", "neutral", "total", "bing")  # reorder columns
brk_bing_year %>%
  filter(year == 1984)
## # A tibble: 1 x 6
##   year  positive negative neutral total   bing
##   <chr>    <int>    <int>   <int> <int>  <dbl>
## 1 1984       287      286    4427  5000 0.0002

If we adjust it for the misclassification the new score would be (287 - 276) / 5,000 = 0.0022 which is not a big difference.