Chapter 10 Notes
10.1 Categorization of word “casualty” by Loughran Lexicon
Seeing that the AFINN scored the word “casualty” as negative…
%>%
brk_afinn filter(word == "casualty") %>%
select(sentiment) %>%
unique()
## # A tibble: 1 x 1
## sentiment
## <dbl>
## 1 -2
I then looked at how the Loughran lexicon classified it.
%>%
brk_lough filter(word == "casualty") %>%
select(sentiment) %>%
unique()
## # A tibble: 1 x 1
## sentiment
## <chr>
## 1 neutral
It turned out to be neutral which for the Berkshire analysis would be correct. Then I looked at Bing which classifies the word as negative.
%>%
brk_bing filter(word == "casualty") %>%
select(sentiment) %>%
unique()
## # A tibble: 1 x 1
## sentiment
## <chr>
## 1 negative
We previously looked at how many times the word “casualty” was used - the maximum was 10 times in 1984. We saw that in the AFINN analysis in Chapter 4 that it had an influence of -20 in 1984 and if adjusted would have raised sentiment for that year from 129 to 149. How much sway did the word have on the Bing sentiment?
%>%
brk_words filter(word == "casualty") %>% # filters for just the word "casualty"
group_by(year) %>% # group by year
count(word, sort = TRUE) %>% # count total number and sort by frequency
ungroup()
## # A tibble: 43 x 3
## year word n
## <chr> <chr> <int>
## 1 1984 casualty 10
## 2 1977 casualty 7
## 3 1986 casualty 7
## 4 1980 casualty 6
## 5 1988 casualty 5
## 6 2016 casualty 5
## 7 1976 casualty 4
## 8 1979 casualty 4
## 9 1981 casualty 4
## 10 1987 casualty 4
## # ... with 33 more rows
If we look at the Bing sentiment for 1984 we see the score is (287 - 286) / 5,000 = 0.00002.
<- brk_bing %>%
brk_bing_year count(year, sentiment) %>%
spread(key = sentiment, value = n) %>%
mutate(total = (positive + negative + neutral)) %>% # create "total" column
mutate(bing = ((positive - negative) / total)) %>% # create column with calculated score
select("year", "positive", "negative", "neutral", "total", "bing") # reorder columns
%>%
brk_bing_year filter(year == 1984)
## # A tibble: 1 x 6
## year positive negative neutral total bing
## <chr> <int> <int> <int> <int> <dbl>
## 1 1984 287 286 4427 5000 0.0002
If we adjust it for the misclassification the new score would be (287 - 276) / 5,000 = 0.0022 which is not a big difference.