Chapter 4 Sentiment Analysis

4.1 Definition of Sentiment Analysis

To explain sentiment analysis so a 10 year-old can understand: It is the process of analyzing a piece of text and to determine if the writer’s attitude towards the subject matter is positive, negative, or neutral. I am going to analyze the Berkshire letters to infer Buffett’s attitude towards the past results of the business and his outlook.

4.2 Setting Expectations

I want to set expectations before moving further. It would be great if we could use changes in the sentiment of Warren Buffett’s letters to predict movement in Berkshire’s stock to make millions of dollars. I can say with full confidence that this won’t happen. Sentiment will likely be a coincident indicator at best. What the sentiment score will represent is an “educated guess” of Buffett’s attitude, gleaned from his letters, concerning a mix of the performance over the past year and the outlook going forward. Given that the market is extremely efficient, any positive or negative performance over the past year as well as expectations for future performance will already be incorporated into Berkshire’s stock price. Therefore it is unlikely that the sentiment of Berkshire’s letters will influence it’s stock price. In fact, one could do a simple event study to see if Berkshire’s stock price has a meaningful change in the trading days after the letter is released. But since fourth quarter earnings are release simultaneously with the letter it would be difficult to disentangle the sentiment of the letter from the performance of the business (the latter of which should already be incorporated into the stock price anyway). Bottom line: my expectations are low and so should yours.

4.3 Sentiment Analysis Roadmap

In essence, natural language processing is really just math and statistics. Figure 4.1 shows the roadmap.

Roadmap for Sentiment Analysis

Figure 4.1: Roadmap for Sentiment Analysis

The process starts with academics or other investigators taking a subset of words and using some technique (manual, crowd sourcing or machine learning) to categorize them as positive, negative or neutral. The resultant list is the sentiment lexicon. The words in a corpus (in our case the 49 Berkshire letters) are matched against the lexicon and a sentiment score is generated for each word in the lexicon (note that not all words in the letters are in the lexicons and lexicons contain different words). Scores of the individual words in a given years’ letter are aggregated and and the result is a sentiment score for the letter for that year. The score is observed as it changes over time and inferences are made to account for the change in sentiment.

4.4 Limitations with “Bag of Words” Approach

As mentioned previously, word counting methods do not capture the relationships between words. They ignore order and context. For example, intensifiers have no effect, so that adding extremely, or very will not change the value. If you took two phrases as “We had a good year.” and “We had a very good year.” The intensifier would make the second phrase more positive but the lexicon just counts each instance of “good” the same. Also, negations have no effect, so the phrase, “It was not a good year.” would show as positive.

4.5 Toy Example of Calculation of Sentiment

A computer really cannot do anything with words. Words cannot be calculated. Take a simple example of a product review and try to add these words: “The pants were very comfortable. Excellent quality but $190 was a horrible price.” You can’t, because you can’t add words. Let’s remove numbers, punctuation and stop words from this review. We are left with “pants very comfortable excellent quality horrible price.” We would then use a particular lexicon to assign numbers to the words where positive words = 1, negative words = -1 and neutral words = 0. Now we have a vector of numbers we can sum: 0 + 0 + 1 + 1 + 1 -1 + 0 = 2. Since 2 is greater than zero the review can be classified as positive. Another way of calculating sentiment is by a ratio where the number of negative words are subtracted from the number of positive words and divided by the total words. This can be shown in a rather complicated formula.

Sentiment Score = (number of postive words - number of negative words) / total number of words Sentiment Score = (3 - 1) / 7 = 0.2857143

In this case, the sentiment score would be 0.3. In this case the classification is binary - either positive or negative so a word like “bad” and “catastrophic” will each get a score of -1 even though “catastrophic” can be seen as more negative than “bad.”

4.6 Overview of Sentiment Lexicons

Not all dictionaries calculate sentiment in this way (positive - negative / total). Another method is using a finer level of “resolution” and assign words on a continuous scale with more extreme words having higher values. So “catastrophic” might have a score of -0.90 while “bad” might have a score of -0.30. With this type of scoring, all the words in a particular letter are summed up to get an overall sentiment score for that year.

Sentiment scores can be assigned to words either manually by either experts or through crowd sourced methods, such as Amazon Mechanical Turk or automatically through machine learning methods using some type of algorithm. Manually annotated lexica are usually more precise but tend to be smaller in size due to the time and cost associated with manual coding. Automatically generated lexica are larger but their precision is dependent on the accuracy of the algorithm (Gatti, Guerini, and Turchi 2016). For this analysis I use all three types as I want to see the benefits and limitations of each type.

There is also a difference in the intended use of the lexicon. For example AFINN was developed to analyze Tweets, Lourgran was specifically designed for financial documents and Syuzhet for literature in the humanities. Figure 4.2 shows the different lexicons, the number of words in the lexicon, resolution, calculation method and classification method.

Descriptions of Various Lexicons Used in Analysis

Figure 4.2: Descriptions of Various Lexicons Used in Analysis

4.7 Creating Sentiment Data for Berkshire Letters by Year

I suppose there is a way to combine all the lexicons together into one dataframe similar to the tidytext stop word list. But due to my lack of knowledge and in the interests of time, I am just going to do them one-by-one, discuss issues I encounter with each and then analyze and discuss output. I will go into exhaustive detail with the Bing lexicon to show how sentiment is calculated. I will gloss over many of these steps with other lexicons.

4.7.1 Bing Sentiment Lexicon

4.7.1.1 Description

Developed by Minqing Hu and Bing Liu starting with their 2004 paper where their goal was opinion mining of customer reviews (Hu and Liu 2004). It contains a total of 6,786 word - 2,005 positive and 4,781 negative. Classification is binary, a word is either positive or negative.

4.7.1.2 Combining Bing Sentiment Lexicon With Berkshire Corpus

In the code below we are starting with the brk_words dataframe then matching it with words in the bing sentiment lexicon to create a new dataframe brk_bing. If a word in the Berkshire letters is not in the bing lexicon, we are assigning it a value of neutral. The output is a dataframe with 206,092 rows (one for each word) and a new column “sentiment” which contains the sentiment value (positive, negative or neutral) for each word.

brk_bing <- brk_words %>%
  left_join(get_sentiments(lexicon = "bing"), by = "word") %>%
  mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
  group_by(year) %>%
  ungroup()
brk_bing %>%print(n = 5)
## # A tibble: 201,507 x 3
##   year  word         sentiment
##   <chr> <chr>        <chr>    
## 1 1971  stockholders neutral  
## 2 1971  berkshire    neutral  
## 3 1971  hathaway     neutral  
## 4 1971  pleasure     positive 
## 5 1971  report       neutral  
## # ... with 201,502 more rows

4.7.1.3 Calculating Bing Sentiment

The format we ultimately want is a dataframe with single value for the sentiment of each year’s letter. I will build up the code for this step-by-step. The next step produces a dataframe brk_bing_year with the year in the first column, the sentiment in the second column and the total number of words matching that sentiment in the third column. For example, 1971 letter is comprised of 37 positive words, 30 negative words and 550 neutral words. The individual words are no longer in the dataframe. We see there are 147 rows 49 years times 3 categories.

brk_bing_year <- brk_bing %>%  
  count(year, sentiment) 
brk_bing_year
## # A tibble: 147 x 3
##    year  sentiment     n
##    <chr> <chr>     <int>
##  1 1971  negative     30
##  2 1971  neutral     550
##  3 1971  positive     37
##  4 1972  negative     20
##  5 1972  neutral     653
##  6 1972  positive     48
##  7 1973  negative     33
##  8 1973  neutral     767
##  9 1973  positive     52
## 10 1974  negative     63
## # ... with 137 more rows

In the next step, we use the code tidyr::spread(key = sentiment, value = n) which collapses the data to 49 rows - one for each year.

brk_bing_year <- brk_bing %>%  
  count(year, sentiment) %>%
  tidyr::spread(key = sentiment, value = n) %>%
  select("year", "positive", "negative", "neutral")  # reorder columns
brk_bing_year %>% print(n=5)
## # A tibble: 49 x 4
##   year  positive negative neutral
##   <chr>    <int>    <int>   <int>
## 1 1971        37       30     550
## 2 1972        48       20     653
## 3 1973        52       33     767
## 4 1974        38       63     629
## 5 1975        78       51     833
## # ... with 44 more rows

Next we calculate sentiment for each year using the methodology previously discussed:

Bing sentiment Score = (number of postive words - number of negative words) / total number of words

For 1971 the sentiment score is:

Bing sentiment Score = (37 - 30) / 617 = 0.0113

We use the mutate function to create two new columns “total” and “bing.”

#sentiment by year
brk_bing_year <- brk_bing %>%  
  count(year, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  mutate(total = (positive + negative + neutral)) %>%   # create "total" column 
  mutate(bing = ((positive - negative) / total)) %>% # create column with calculated score
  select("year", "positive", "negative", "neutral", "total", "bing")  # reorder columns
brk_bing_year %>% print(n=5)
## # A tibble: 49 x 6
##   year  positive negative neutral total    bing
##   <chr>    <int>    <int>   <int> <int>   <dbl>
## 1 1971        37       30     550   617  0.0113
## 2 1972        48       20     653   721  0.0388
## 3 1973        52       33     767   852  0.0223
## 4 1974        38       63     629   730 -0.0342
## 5 1975        78       51     833   962  0.0281
## # ... with 44 more rows

We have achieved what we set out to do - create a dataframe brk_bing_year with a column for year and a second column with the Bing sentiment score.

brk_bing_year %>%
  select("year", "bing") %>%
  print(n=5)
## # A tibble: 49 x 2
##   year     bing
##   <chr>   <dbl>
## 1 1971   0.0113
## 2 1972   0.0388
## 3 1973   0.0223
## 4 1974  -0.0342
## 5 1975   0.0281
## # ... with 44 more rows

4.7.1.4 Analysis of Bing Sentiment

We can now chart the sentiment scores for each year which is shown below.

g1 <- ggplot(brk_bing_year, aes(year, bing, fill = bing>0,)) + #fill = makes values below zero in red
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=c("lightcoral","lightsteelblue3"))+
  labs(x= NULL,y=NULL,
       title="Bing Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
g1

We see that sentiment was very negative during the bear market of 1973-1974. We see slightly negative sentiment in 1987 (likely due to the ’87 Crash), 1990 (not sure why - perhaps Gulf War), 2001 and 2002 (post 9/11) and 2008 (Financial Crisis).

4.7.1.5 A Deeper Look Into How Bing is Classifying Words

Let’s look at how Bing classifies different words in its lexicon. The code below looks at the all the words in the brk_bing dataframe, filters for the positive words, then counts and sorts them in order of frequency.

brk_bing %>%
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 929 x 3
##    word          sentiment     n
##    <chr>         <chr>     <int>
##  1 worth         positive    369
##  2 gains         positive    367
##  3 gain          positive    352
##  4 significant   positive    261
##  5 outstanding   positive    214
##  6 goodwill      positive    179
##  7 excellent     positive    145
##  8 extraordinary positive    138
##  9 advantage     positive    128
## 10 competitive   positive    125
## # ... with 919 more rows

We see some words that would be generically positive but in the context of business, and especially with Berkshire, might more appropriately be classified as either neutral or negative. This highlights the issues we might have with our sentiment analysis - it might not be correct because certain words are not correctly categorized. And this also highlights the issue with the “bag of words” method which does not take context into account. For example “worth” is the most frequent positive word and its not apparent to me why that word would be classified as positive.

4.7.1.6 Even Deeper Dive into Misclassification of the word “worth”

Let’s dive even deeper and look at the context the word “worth” is used. I randomly picked the 2007 letter. To see a count of positive words in the 2007 letter, we simply add filter(year == "2007") %>% to our code. The result shows that “worth” is used 12 times in that year.

brk_bing %>%
  filter(sentiment == "positive") %>%
  filter(year == "2007") %>%
  count(word, sentiment, sort = TRUE) %>%
  print(n=5)
## # A tibble: 161 x 3
##   word          sentiment     n
##   <chr>         <chr>     <int>
## 1 gain          positive     17
## 2 worth         positive     12
## 3 advantage     positive      6
## 4 competitive   positive      6
## 5 extraordinary positive      6
## # ... with 156 more rows

I went back to the 2007 letter and used the ctrl + f function so I could see how the word “worth” is used in context. Figure 4.3 shows snippets of each sentence where the word “worth” is used12. As we can see two instances where “worth” is part of the bigram “Fort Worth” while the other 10 instances appear to be neutral, not positive. In fact, I would categorize all of these instances as neutral.

Instances of the word worth in 2007 letter

Figure 4.3: Instances of the word worth in 2007 letter

We clearly have an issue with the word “worth.” We know it is used a total of 369 times in total but how many times is it used in each year? We can see this using the code below which sorts by frequency. We see that in 2012 and 1986 the word “worth” is used frequently.

brk_worth <- brk_bing %>%
  filter(word == "worth") %>%   # filters for just the word "worth"
  group_by(year) %>%            # group by year
  count(word, sort = TRUE) %>%  # count total number and sort by frequency
  ungroup()
brk_worth
## # A tibble: 45 x 3
##    year  word      n
##    <chr> <chr> <int>
##  1 2012  worth    19
##  2 1986  worth    18
##  3 1989  worth    17
##  4 1980  worth    14
##  5 1990  worth    14
##  6 1992  worth    12
##  7 1993  worth    12
##  8 2007  worth    12
##  9 1983  worth    11
## 10 1994  worth    11
## # ... with 35 more rows

We need to see if the misclassification of the word “worth” is throwing off our sentiment results and making sentiment more positive than it actually is. To do this we use the following code to show the word “worth” as a percent of positive words in descending order.

scrap <- left_join(brk_bing_year, brk_worth,  by = "year") %>%  # combines two dataframes
  mutate(worth_as_percent = ((n/positive)*100)) %>%             # creates new column showing worth as a percent of positive words
  arrange(desc(worth_as_percent)) %>%                           # sorts new column in decending order
  select("year", "n", "positive","worth_as_percent", "total")           # select columns to display 
scrap
## # A tibble: 49 x 5
##    year      n positive worth_as_percent total
##    <chr> <int>    <int>            <dbl> <int>
##  1 1980     14      171             8.19  3161
##  2 2012     19      324             5.86  5492
##  3 2017      9      167             5.39  3568
##  4 1994     11      211             5.21  3648
##  5 1975      4       78             5.13   962
##  6 1979      9      182             4.95  2699
##  7 1981      7      144             4.86  2830
##  8 1993     12      247             4.86  4299
##  9 1986     18      371             4.85  5532
## 10 1992     12      260             4.62  4313
## # ... with 39 more rows

I have also included a column of total words (positive + negative + neutral). We see that the word “worth” comprises a fairly large percent of total positive words which indicates the misclassification of that word could possibly be skewing our results.

4.7.1.7 Larger Implications of Misclassifications

Going back to our list of positive words by frequency, there are other words that look suspect to me.

brk_bing %>%
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 929 x 3
##    word          sentiment     n
##    <chr>         <chr>     <int>
##  1 worth         positive    369
##  2 gains         positive    367
##  3 gain          positive    352
##  4 significant   positive    261
##  5 outstanding   positive    214
##  6 goodwill      positive    179
##  7 excellent     positive    145
##  8 extraordinary positive    138
##  9 advantage     positive    128
## 10 competitive   positive    125
## # ... with 919 more rows

For example, “significant” could be part of a bigram “significant losses,” “outstanding” could be part of the bigram “shares outstanding,” “extraordinary” could be part of a bigram “extraordinary loss” which is an accounting term. The word “competitive” in a business context isn’t necessarily positive. “Goodwill” in a generic sense is positive but in a business context is used to describe the excess of purchase price vs. tangible assets. I would need to do the same type of analysis I did with the word “worth” on all these words to see if they are correctly classified. And while any one word being misclassified might not have a large effect, the cummulative effect might be very large rendering the whole analysis useless.

Let’s take a look at the most frequently used negative words. At first glance it seems like certain words might be problematic.

brk_bing %>%
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 1,412 x 3
##    word      sentiment     n
##    <chr>     <chr>     <int>
##  1 loss      negative    473
##  2 losses    negative    436
##  3 debt      negative    223
##  4 risk      negative    200
##  5 liability negative    131
##  6 casualty  negative    121
##  7 fall      negative    108
##  8 bad       negative    105
##  9 difficult negative    103
## 10 unusual   negative     98
## # ... with 1,402 more rows

For example, Berkshire has significant insurance operations and words like “liability” and “casualty” and are likely to be neutral when seen in context. Also the word “loss” might be part of a frequently used neutral bigram “underwriting loss” which is another term specific to the insurance industry. And the word “debt” depending on context might be classified as neutral. This leads me to believe that misclassification with negative words might also change our results. But for now, we will suspend disbelief and move on with the project.

4.7.2 AFINN Sentiment Lexicon

4.7.2.1 Description

The AFINN lexicon developed by Finn Årup Nielsen in 2009 primarily to analyze Twitter sentiment (Nielsen 2011). It consists of 2,477 words with 878 positive and 1,598 negative on a scale of -5 to +5 with a mean of -0.59.

afinn <- get_sentiments(lexicon = "afinn") #create dataframe with all words in AFINN lexicon
#descriptive statistics
summary(afinn)
##      word               value        
##  Length:2477        Min.   :-5.0000  
##  Class :character   1st Qu.:-2.0000  
##  Mode  :character   Median :-2.0000  
##                     Mean   :-0.5894  
##                     3rd Qu.: 2.0000  
##                     Max.   : 5.0000

4.7.2.2 Combining AFINN with Berkshire Corpus

Using the code below, we match up the words that are in the AFINN lexicon with the corresponding words in the Berkshire letters. Using the inner_join command automatically filters out words in the brk_words dataframe which are not found in the AFINN lexicon.

brk_afinn <- brk_words %>%      #creates new dataframe "brk_afinn"
  inner_join(get_sentiments(lexicon = "afinn"), by = "word")%>% #just joins words in AFINN lexicon
  rename(sentiment = value)
brk_afinn %>% print(n=5)
## # A tibble: 19,620 x 3
##   year  word       sentiment
##   <chr> <chr>          <dbl>
## 1 1971  pleasure           3
## 2 1971  gains              2
## 3 1971  inadequate        -2
## 4 1971  benefits           2
## 5 1971  improve            2
## # ... with 19,615 more rows

4.7.2.3 Calculating AFINN sentiment by year

As you can see from the previous table, AFINN assigns each word a sentiment score. In order to produce a sentiment score by year, we simply sum up all the values for a particular year using the following code:

#sentiment by year
brk_afinn_year <- brk_afinn %>%
  group_by(year) %>%
  summarise(afinn = sum(sentiment)) %>%
  ungroup()
brk_afinn_year %>% print(n=5)
## # A tibble: 49 x 2
##   year  afinn
##   <chr> <dbl>
## 1 1971     34
## 2 1972     64
## 3 1973     55
## 4 1974     -6
## 5 1975    108
## # ... with 44 more rows

4.7.2.4 Analyis of AFINN sentiment

The following chart shows AFINN sentiment for each of the 49 letters in chronological order.

g_afinn <- ggplot(brk_afinn_year, aes(year, afinn, fill = afinn>0,)) + #fill = makes values below zero in red
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=c("lightcoral","lightsteelblue3"))+
  labs(x= NULL,y="Sentiment Score",
       title="AFINN Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_afinn

The sentiment using AFINN looks very different than the Bing chart, although “obvious” negative years such as 1974, 2001 and 2008 have either low scores or are negative. With AFINN, 1981 has a low score which makes sense as short term interest rates were 21%, long term rates were 14% as the US was suffering from several years of high inflation. Using the Bing sentiment, 1981 also got a low score.

4.7.2.5 A Brief Glimpse of Performance Data and the Definition of “obvious” Years

In the last paragraph, I used the term “obvious negative years.” I want to briefly explain what that means. Figure 4.4 shows the S&P 500 return over the 49 year period of our analysis. While we will compare sentiment to returns later, I wanted to show returns for the S&P 500 (which is a proxy for the overall performance of the US stock markets) to see which years were negative and where we would expect to see negative sentiment. When I talk about “obvious” periods, they include the 1973-174 bear market, 2000-2002, which includes the dot-com crash and 9/11 and finally, the 2008 financial crisis. These are periods where we would expect more negative sentiment because the markets performed very poorly during these periods.

S&P 500 Annual Performance

Figure 4.4: S&P 500 Annual Performance

4.7.2.6 A Deeper Look Into How AFINN is Classifying Words

Since the AFINN lexicon scores words from -5 to 5 I decided to multiply the sentiment score for each word by the number of times its used to get a kind of overall impact a given word has using the following code.

brk_afinn %>%
  count(word, sentiment) %>%
#  filter(sentiment ==2) %>%
  mutate(total = sentiment * n) %>%
  arrange(desc(total))
## # A tibble: 1,229 x 4
##    word        sentiment     n total
##    <chr>           <dbl> <int> <dbl>
##  1 outstanding         5   214  1070
##  2 assets              2   373   746
##  3 worth               2   369   738
##  4 gains               2   367   734
##  5 gain                2   352   704
##  6 shares              1   656   656
##  7 share               1   627   627
##  8 growth              2   272   544
##  9 excellent           3   145   435
## 10 terrific            4    96   384
## # ... with 1,219 more rows

We see the word “worth” has the second largest impact as it is assigned a positive sentiment of +2. This is problematic as are words like “shares” and “assets” which in context are both likely neutral. This would explain why the average sentiment of the AFINN lexicon is higher than the results we obtained with Bing.

Let’s look at negative words in the same way to see if there are any issues.

brk_afinn %>%
  count(word, sentiment) %>%
  #filter(sentiment <=-4) %>%
  mutate(total = sentiment * n) %>%
  arrange(total)
## # A tibble: 1,229 x 4
##    word        sentiment     n total
##    <chr>           <dbl> <int> <dbl>
##  1 loss               -3   473 -1419
##  2 debt               -2   223  -446
##  3 risk               -2   200  -400
##  4 bad                -3   105  -315
##  5 pay                -1   314  -314
##  6 charges            -2   130  -260
##  7 casualty           -2   121  -242
##  8 catastrophe        -3    78  -234
##  9 lost               -3    61  -183
## 10 risks              -2    91  -182
## # ... with 1,219 more rows

We see the word “loss” has the greatest influence - since the word can be used in many different contexts with a company that has large insurance operations, it would be suspect. One word that I thought was definitely misclassified is “casualty” which in a generic sense would be negative but in an insurance context would likely be neutral.

4.7.2.7 An Even Deeper Dive into Misclassification of the Word “casualty”

AFINN rates the word “casualty” as a -2, and given that it has 127 mentions, it could have a high influence on the overall sentiment score. Let’s examine which years’ letter uses the word “casualty” the most using this code:

brk_casualty <- brk_afinn %>%
  filter(word == "casualty") %>%   # filters for just the word "worth"
  group_by(year) %>%            # group by year
  count(word, sort = TRUE) %>%  # count total number and sort by frequency
  ungroup()
brk_casualty %>% print(n=5)
## # A tibble: 43 x 3
##   year  word         n
##   <chr> <chr>    <int>
## 1 1984  casualty    10
## 2 1977  casualty     7
## 3 1986  casualty     7
## 4 1980  casualty     6
## 5 1988  casualty     5
## # ... with 38 more rows

We see that the 1984 letter has 10 mentions which is the highest. Let’s examine the context where the word “casualty” is used as shown in Figure 4.5.

Instances of the word casualty in 1984 letter

Figure 4.5: Instances of the word casualty in 1984 letter

I was right. The word “casualty” is used 9 out of 10 times as part of the bigram “property/casualty” which is a term specific to the insurance industry. In fact, if you read the first two instances, the sentences are actually extremely positive. To see how are results are skewed for that year, we first look at the overall AFINN score for the 1984 letter which is 129. Since “casualty” is used 10 times and has a value of -2 that implies that the score would be readjusted to 149. And this is just one word in one year’s letter. I am not sure we can rely on these sentiment numbers. But again, the AFINN lexicon has much lower scores for “obvious” years like 1974, 2001 and 2008 so its not completely inaccurate.

Next let’s examine a sentiment lexicon which was expressly designed to analyze financial type documents - the Loughran-McDonald Sentiment Word Lists.

4.7.3 Loughran-McDonald Sentiment Lexicon

4.7.3.1 Description

4.7.3.2 Combining Loughran with Berkshire Corpus

The following code creates a new dataframe brk_lough and matches words in the Bershire letters with words in the Loughran lexicon.

lough <- get_sentiments(lexicon = "loughran")
brk_lough <- brk_words %>%
  left_join(get_sentiments(lexicon = "loughran"), by = "word") %>%
  mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
  group_by(year) %>%
  ungroup()
brk_lough %>% print(n=5)
## # A tibble: 201,855 x 3
##   year  word         sentiment
##   <chr> <chr>        <chr>    
## 1 1971  stockholders neutral  
## 2 1971  berkshire    neutral  
## 3 1971  hathaway     neutral  
## 4 1971  pleasure     positive 
## 5 1971  report       neutral  
## # ... with 201,850 more rows

Unlike the previous lexicons, there are seven different categories. To see unique levels we use the following code:

#to see unique levels
brk_lough %>%
  select(sentiment) %>%
  unique()
## # A tibble: 7 x 1
##   sentiment   
##   <chr>       
## 1 neutral     
## 2 positive    
## 3 negative    
## 4 uncertainty 
## 5 litigious   
## 6 constraining
## 7 superfluous

4.7.3.3 Calculating Loughran Sentiment

Since we only want positive and negative sentiment, we will use the following code to calculate a score similar to the way we calculated the sentiment score with the Bing lexicon.

#sentiment by year
brk_lough_year <- brk_lough %>%  
  count(year, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  mutate(total = (positive + negative + neutral)) %>%   # create "total" column 
  mutate(lough = ((positive - negative) / total)) %>% # create column with calculated score
  select("year", "positive", "negative", "neutral", "total", "lough")  # reorder columns
brk_lough_year %>% print(n=5)
## # A tibble: 49 x 6
##   year  positive negative neutral total    lough
##   <chr>    <int>    <int>   <int> <int>    <dbl>
## 1 1971        36       22     554   612  0.0229 
## 2 1972        41       22     640   703  0.0270 
## 3 1973        23       32     778   833 -0.0108 
## 4 1974        27       59     628   714 -0.0448 
## 5 1975        44       43     853   940  0.00106
## # ... with 44 more rows

4.7.3.4 Analysis of Loughran Sentiment

Below is a chart similar to the ones above to analyze the Loughran sentiment results.

g_lough <- ggplot(brk_lough_year, aes(year, lough, fill = lough>0,)) + #fill = makes values below zero in red
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=c("lightcoral","lightsteelblue3"))+
  labs(x= NULL,y="Sentiment Score",
       title="Loughran Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_lough

The results look strikingly different. While the “obvious” years of 1974, 2001 and 2008 look in order, the overall sentiment is negative for just about the entire time period. The reason for this might be that of the total of 3,917 words, 2,355 are negative while only 354 are positive. Nevertheless, we will take a closer look at how this lexicon is classifying words to further understand why the scores are so negative.

4.7.3.5 A Deeper Look Into How Loughran is Classifying Words

Let’s do a similar analysis and look at the most frequently used negative words. We see that “loss” and “losses” tops the list as well as “question” and “questions.” I’m not sure why “questions” and “question” would be negative. But we are starting to see a pattern with all three lexicons and the word “loss” and “losses.”

brk_lough %>%
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE) %>% 
  print(n=5)
## # A tibble: 955 x 3
##   word      sentiment     n
##   <chr>     <chr>     <int>
## 1 loss      negative    473
## 2 losses    negative    436
## 3 questions negative    141
## 4 question  negative    118
## 5 bad       negative    105
## # ... with 950 more rows

The word “loss” appears at the top of the list for all three sentiment lexicons we have used so far. Here is the list again for Bing.

brk_bing %>%
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE) %>%
  print(n=3)
## # A tibble: 1,412 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 loss   negative    473
## 2 losses negative    436
## 3 debt   negative    223
## # ... with 1,409 more rows

For AFINN “it”loss" is number one but the word “losses” does not show up. Let’s see why.

brk_afinn %>%
  count(word, sentiment) %>%
  #filter(sentiment <=-4) %>%
  mutate(total = sentiment * n) %>%
  arrange(total)%>%
  print(n=3)
## # A tibble: 1,229 x 4
##   word  sentiment     n total
##   <chr>     <dbl> <int> <dbl>
## 1 loss         -3   473 -1419
## 2 debt         -2   223  -446
## 3 risk         -2   200  -400
## # ... with 1,226 more rows

Using the code below, we can filter for “losses” in the brk_afinn dataframe. Interestingly, the word does not appear which means that it is not included in the AFINN lexicon.

brk_afinn %>%
  filter(word == "losses")
## # A tibble: 0 x 3
## # ... with 3 variables: year <chr>, word <chr>, sentiment <dbl>

4.7.3.6 A Deeper Dive Into the the Classification of the Words “loss” and “losses” in the Three Sentiment Lexicons We Have Looked at So Far.

Let’s do some data wrangling to see which letters use the word “loss” and “losses” the most.

brk_loss_frequency <- brk_words %>%
  filter(word == "losses" | word == "loss") %>%         # filters for words "losses" and "loss"
  count(year, word) %>%                                 # count instances per year
  tidyr::spread(key = word, value = n, fill = 0) %>%    # combines rows and fills nas with 0
  mutate(total = loss + losses) %>%                     # creates new column with total
  arrange(desc(total))                                  # sorts total column in descending order of frequency
brk_loss_frequency %>% print(n=5)
## # A tibble: 49 x 4
##   year   loss losses total
##   <chr> <dbl>  <dbl> <dbl>
## 1 2001     24     30    54
## 2 1984     23     17    40
## 3 1990     16     21    37
## 4 2008     18     17    35
## 5 1985     18     16    34
## # ... with 44 more rows

The years 2001 and 2008 are near the top of the list and are among our “obvious” years. But strangely, 1974 is absent - it shows up at 37 of 49 for frequency of the two words. The 1973-1974 bear market was one of the worst in history. To explore this a bit further, let’s look at the frequency of the use of these two words in chronological order.

brk_loss_frequency$year <- as.numeric(brk_loss_frequency$year) # change "year" column to numeric 
g4 <- ggplot(brk_loss_frequency, aes(x = year, y = total)) +
  geom_col(fill = "dodgerblue3", alpha = 0.65) +
   theme_classic() + ylab("Frequency of Words ''loss'' and ''losses'' ")
g4 + theme(axis.title.x = element_blank()) 

We see a couple of things, the use seems to fluctuate a lot, which could be due to the markets or might be specific to the insurance operations. Also, we are just looking at raw frequency. It might be better to calculate percent of total negative words which I did with the word “worth” previously.

4.7.3.7 Seeing Impact of the Words “loss” and “losses” in the Loughran Lexicon

With the code below, I do some data wrangling, the result of which is a dataframe with the words “loss” and “losses” as percent of negative words and total words for each year in the Loughran lexicon.

#create dataframe with frequency of "loss" and "losses" in descending order
 scrap2 <- brk_lough %>%
  filter(word == "losses" | word == "loss") %>%          # filters for just the word "worth"
  count(year, word) %>%
  tidyr::spread(key = word, value = n, fill = 0) %>%     # combines rows and fills nas with 0
  mutate(total_loss = loss + losses) %>%                 # creates new column with total
  arrange(desc(total_loss))                              # sorts total column in descending order of frequency

#join dataframes and create new columns
brk_loss_lough <- left_join(brk_lough_year, scrap2, by = "year") %>%  # join scrap2 dataframe with brk_lough_year
  mutate(pct_of_neg = (total_loss / negative *100)) %>%               # create new column percent of negative words
  mutate(pct_of_total = (total_loss / total *100)) %>%                # create new column percent of total words
  select("year", "total_loss",  "negative", "pct_of_neg","total", "pct_of_total")

#output of the words "loss" and "losses" as percent of negative words and total words 
brk_loss_lough %>%
    select("year", "pct_of_neg", "pct_of_total") %>%
    arrange(desc(pct_of_total)) %>% print(n=5)
## # A tibble: 49 x 3
##   year  pct_of_neg pct_of_total
##   <chr>      <dbl>        <dbl>
## 1 1975        37.2        1.70 
## 2 1974        20.3        1.68 
## 3 2001        22.1        1.17 
## 4 1984        17.5        0.817
## 5 2008        14.3        0.694
## # ... with 44 more rows

Then to see the frequency of the use of the words and the percent of negative words and total words in a graphical form, I go through the following acrobatics.

# put together new dataframe "scrap3"
scrap3 <- brk_loss_lough %>%
  select("year", "pct_of_neg", "pct_of_total", "total_loss" ) 

# convert to long form and create new dataframe "scrap3_long"
scrap3_long <- scrap3 %>%
  gather("pct_of_neg", "pct_of_total", "total_loss", key = type, value = percent)

# chart parameters and output
scrap3_long %>%
  ggplot(aes(x = year, y = percent, fill = type)) +
  geom_col() +
  facet_wrap(~ type, ncol = 1, scales = "free_y",
             labeller = as_labeller(c(pct_of_neg = "''Loss'' and ''Losses'' as % of Negative Words ", 
                                        pct_of_total = "''Loss'' and ''Losses'' as % of Total Words", 
                                        total_loss = "Frequency of words ''Loss'' and ''Losses''"))) + 
  theme_bw() +
  theme(legend.position = "none") + 
  xlab(NULL) + ylab(NULL) +
  theme(axis.text.x=element_text(angle=90, vjust=.5, hjust=0),panel.grid = element_blank())  

What we see is that “loss” and “losses” accounts for a very high percent of negative words in 1974 (20%) and 1975(37%). But it is surprising that the overall Loughran score is positive for 1975. While “losses” and “loss” are used the most in the 2001 letter, it is 22% of all the negative words used in that year.

We can see this on the following table, since the 1974 and 1975 letters are so much shorter, the lower frequency has more of an impact than on the 2001 letter.

brk_loss_lough %>%
  filter(year == "1974" | year == "1975" | year == "2001")
## # A tibble: 3 x 6
##   year  total_loss negative pct_of_neg total pct_of_total
##   <chr>      <dbl>    <int>      <dbl> <int>        <dbl>
## 1 1974          12       59       20.3   714         1.68
## 2 1975          16       43       37.2   940         1.70
## 3 2001          54      244       22.1  4622         1.17

Over the entire time period, “loss” and “losses” as a percent of negative words are 12.9%. It strikes me as odd that the word “loss” and “losses” would be used so much in a negative context in Berkshire’s letters. The words are probably being used in conjunction with the insurance operations in a neutral context. But that’s just a guess. At this point, we have to see how these words are used in context to determine if they are misclassified.

I think it would be interesting to look more into the 1975 letter because the word “loss” and “losses” is used a lot but the overall sentiment for the year is positive. As shown in the table below where the last column has the overall sentiment for that year. While loss is used 16 times in 1975 and comprises 37% of the negative words, the overall year is positive - strange…

brk_loss_lough %>%
  select("year", "total_loss", "pct_of_neg") %>%                      # selects certain columns to join
  left_join(brk_lough_year, brk_loss_lough,  by = "year") %>%         # joins the two dataframes
  filter(year == "1974" | year == "1975" | year == "2001") %>%        # selects certain years
  select("year", "total_loss", "negative", "pct_of_neg",
         "total", "lough")                                            # reorders columns
## # A tibble: 3 x 6
##   year  total_loss negative pct_of_neg total    lough
##   <chr>      <dbl>    <int>      <dbl> <int>    <dbl>
## 1 1974          12       59       20.3   714 -0.0448 
## 2 1975          16       43       37.2   940  0.00106
## 3 2001          54      244       22.1  4622 -0.0307

Reading through the 1975 letter, it appears that of the 16 instances, a portion are in the context of losses in prior years, a portion are losses in the businesses in the current year and a number of instances relate to the insurance operations - in one case it talked about the company having the “lowest loss ratio” which is actually positive, or the bank operations where the term “net loan losses” is an industry term and not necessarily negative. Here again, looking at bigrams or trigrams such as “loss ratio” or “net loan losses” and classifiying them separate from “loss” and “losses” might improve the accuracy of the sentiment analysis. This also might explain why the Loughran lexicon shows Berkshire having consistent negative sentiment - it could be due to terms that in context would be classified as neutral.

Incidentally, I went back to look at how the Loughran lexicon treats the word “casualty” which we saw was misclassified by the other lexicons. It classified it as neutral which in the case of Berkshire, is correct. I talk about this a little more in the notes in Chapter 10.

4.7.3.8 Final Thoughts on Loughran Lexicon

While it was concerning that sentiment appeared to be negative in most of the time periods - in contrast to Bing and AFINN - as long as the apparent misclassifications are consistent over time, they won’t prove to be a problem. We will keep this in mind as we go through the remaining lexicons.

4.7.4 NRC Sentiment Lexicon

4.7.4.1 Description

NRC: The NRC Emotion Lexicon is a list of 5,636 English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing on Amazon Mechanical Turk and detailed in a paper by Mohammad and Turney (2013). The unique levels are revealed using the following code.

#to see unique levels
tidytext::get_sentiments("nrc")  %>%
  select(sentiment) %>%
  unique()
## # A tibble: 10 x 1
##    sentiment   
##    <chr>       
##  1 trust       
##  2 fear        
##  3 negative    
##  4 sadness     
##  5 anger       
##  6 surprise    
##  7 positive    
##  8 disgust     
##  9 joy         
## 10 anticipation

4.7.4.2 Combining NRC with Berkshire Corpus

Next we filter the dataframe for just sentiment and then create a new dataframe brk_nrc and match it with the words in the Berkshire letters.

# create dataframe with just NRC sentiment
nrc <- get_sentiments("nrc") %>%
  filter(sentiment == "negative" | sentiment == "positive") #filter for just sentiment

# join NRC with "brk_words"
brk_nrc <- brk_words %>%
  left_join(nrc, by = "word") %>%
  mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
  group_by(year) %>%
  ungroup()
brk_nrc %>% print(n=5)
## # A tibble: 202,647 x 3
##   year  word         sentiment
##   <chr> <chr>        <chr>    
## 1 1971  stockholders neutral  
## 2 1971  berkshire    neutral  
## 3 1971  hathaway     neutral  
## 4 1971  pleasure     neutral  
## 5 1971  report       neutral  
## # ... with 202,642 more rows

4.7.4.3 Calculating NRC Sentiment

Then we use the following code to calculate sentiment in the same way that we calculated Bing and Loughran sentiment and aggregate by year to get a sentiment score for each year.

#sentiment by year
brk_nrc_year <- brk_nrc %>%  
  count(year, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  mutate(total = (positive + negative + neutral)) %>%   # create "total" column 
  mutate(nrc = ((positive - negative) / total)) %>% # create column with calculated score
  select("year", "positive", "negative", "neutral", "total", "nrc")  # reorder columns
brk_nrc_year %>% print(n=5)
## # A tibble: 49 x 6
##   year  positive negative neutral total    nrc
##   <chr>    <int>    <int>   <int> <int>  <dbl>
## 1 1971        80       27     513   620 0.0855
## 2 1972       107       32     586   725 0.103 
## 3 1973       107       44     709   860 0.0733
## 4 1974        83       60     594   737 0.0312
## 5 1975       159       57     758   974 0.105 
## # ... with 44 more rows

We have achieved what we set out to do - create a dataframe brk_nrc_year with a column for year and a second column of with the NRC sentiment score.

4.7.4.4 Analysis of NRC Sentiment

We can now chart the sentiment scores for each year which is shown below. We see that the NRC lexcion scores sentiment consitently higher than zero which must mean it is classifying words differently than the other lexicons. But our “obvious” years - 1974, 2001 and 2008 are noticably lower.

g5 <- ggplot(brk_nrc_year, aes(year, nrc)) +
  geom_col(show.legend = FALSE, fill = "lightsteelblue3") +
  labs(x= NULL,y=NULL,
       title="NRC Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g5

4.7.4.5 A Deeper Look Into How NRC is Classifying Words

If we look at word frequency of positive words, we see the top 10 list populated with words classified as positive which should be classified as neutral like “share,” “money,” “worth,” “equity” and “management.” That explains why the sentiment is so consistently high.

brk_nrc %>%
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 1,236 x 3
##    word       sentiment     n
##    <chr>      <chr>     <int>
##  1 share      positive    627
##  2 major      positive    618
##  3 money      positive    490
##  4 income     positive    488
##  5 cash       positive    416
##  6 assets     positive    373
##  7 worth      positive    369
##  8 equity     positive    364
##  9 gain       positive    352
## 10 management positive    352
## # ... with 1,226 more rows

When I looked at frequency I saw words classified as negative that in the context of Berkshire, wouldn’t be negative such as “tax,” “bonds,” “stocks,” “debt” and “casualty.” But then I noticed that “outstanding” is classified as negative as is “income.”

brk_nrc %>%
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 1,209 x 3
##    word        sentiment     n
##    <chr>       <chr>     <int>
##  1 tax         negative    748
##  2 income      negative    488
##  3 loss        negative    473
##  4 bonds       negative    303
##  5 stocks      negative    266
##  6 debt        negative    223
##  7 outstanding negative    214
##  8 risk        negative    200
##  9 expenses    negative    157
## 10 inflation   negative    125
## # ... with 1,199 more rows

It seemed odd to me that “income” showed up on both positive and negative lists and “outstanding” showed up as negative. I thought there might have been an error with how I collected or coded the data but I couldn’t find any problems. When I filter the NRC lexicon for the words “income” and “outstanding” they show up on both lists.

nrc %>%
  filter(word == "outstanding" | word == "income") %>%
  count(word, sentiment, sort = TRUE)
## # A tibble: 4 x 3
##   word        sentiment     n
##   <chr>       <chr>     <int>
## 1 income      negative      1
## 2 income      positive      1
## 3 outstanding negative      1
## 4 outstanding positive      1

I emailed the author of the lexicon but have yet to hear back. Chances are it is my mistake13.

4.7.5 Syuzhet Sentiment Lexicon

4.7.5.1 Description

The Syuzhet lexicon is part of the Syuzhet package developed by Matthew Jockers. The lexicon was developed in the Nebraska Literary Lab and contains 10,748 words which are assigned a score of -1 to 1 with 16 gradients (Jockers and Thalken 2020). As seen in the descriptive statistics below, the mean is -0.213.

syuzhet <- get_sentiment_dictionary("syuzhet") #get lexicon and create dataframe
#descriptive statistics
summary(syuzhet)
##      word               value       
##  Length:10748       Min.   :-1.000  
##  Class :character   1st Qu.:-0.750  
##  Mode  :character   Median :-0.400  
##                     Mean   :-0.213  
##                     3rd Qu.: 0.400  
##                     Max.   : 1.000

4.7.5.2 Combining Syuzhet Lexicon with Berkshire Corpus

Since the Syuzhet lexcion does not contain any neutral words we can use the inner_join command which automatically filters out words in the brk_words dataframe which are not found in the Syuzhet lexicon.

brk_syuzhet <- brk_words %>%      #creates new dataframe "brk_syuzhet"
  inner_join(syuzhet, by = "word") #just joins words in syuzhet
brk_syuzhet
## # A tibble: 47,205 x 3
##    year  word       value
##    <chr> <chr>      <dbl>
##  1 1971  pleasure    0.5 
##  2 1971  excluding  -0.8 
##  3 1971  gains       0.5 
##  4 1971  equity      1   
##  5 1971  inadequate -0.75
##  6 1971  benefits    1   
##  7 1971  continue    0.4 
##  8 1971  objective   0.4 
##  9 1971  management  0.4 
## 10 1971  improve     0.75
## # ... with 47,195 more rows

4.7.5.3 Calculating syuzhet sentiment by year

As you can see from the previous table, Syuzhet assigns each word a sentiment score from -1 to +1. In order to produce a sentiment score by year, we simply sum up all the values for a particular year using the following code:

#sentiment by year
brk_syuzhet_year <- brk_syuzhet %>%
  group_by(year) %>%
  summarise(syuzhet = sum(value)) %>%
  ungroup()
brk_syuzhet_year %>% print(n=5)
## # A tibble: 49 x 2
##   year  syuzhet
##   <chr>   <dbl>
## 1 1971     36.0
## 2 1972     56.6
## 3 1973     43.8
## 4 1974      8.6
## 5 1975     69.2
## # ... with 44 more rows

4.7.5.4 Analyis of syuzhet sentiment

The following chart shows Syuzhet sentiment score for each of the 49 letters. The sentiment is positive throughout the entire period which is a bit odd because of the 10,748 total words, 7,161 are negative and 3,587 are positive. But it is encouraging that our “obvious” years of 1974, 2001 and 2008 are the lowest. A deeper dive needs to be done to investigate why scores are so positive.

g_syuzhet <- ggplot(brk_syuzhet_year, aes(year, syuzhet)) + 
  geom_col(show.legend = FALSE, fill = "lightsteelblue3") +
  labs(x= NULL,y="Sentiment Score",
       title="Syuzhet Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_syuzhet

4.7.5.5 A Deeper Look Into How Syuzhet is Classifying Words

Since the Syuzhet lexicon scores words from -1 to 1 I decided to multiply the sentiment score for each word by the number of times its used to get a kind of overall impact a given word has using the following code (exactly what I did with the AFINN lexicon). Words that are used frequently and / or have high positivity scores result in high impact.

In my opinion, most of the words in the top 10 list below should be classified as neutral rather than positive. This would explain why the sentiment scores are consistently high.

brk_syuzhet %>%
  count(word, value) %>%
  mutate(impact = value * n) %>%
  arrange(desc(impact))
## # A tibble: 3,904 x 4
##    word      value     n impact
##    <chr>     <dbl> <int>  <dbl>
##  1 equity     1      364   364 
##  2 share      0.5    627   314.
##  3 money      0.6    490   294 
##  4 worth      0.75   369   277.
##  5 gain       0.75   352   264 
##  6 major      0.4    618   247.
##  7 assets     0.5    373   186.
##  8 gains      0.5    367   184.
##  9 cash       0.4    416   166.
## 10 ownership  0.6    268   161.
## # ... with 3,894 more rows

When we examine negative words sorted by impact we see some of the same words that were problematic in the other lexicons - “loss,” “losses” and “casualty.” There are other words which have a sizable impact such as “tax” and “debt” which in Berkshire’s case should likely be categorized as neutral.

brk_syuzhet %>%
  count(word, value) %>%
  mutate(impact = value * n) %>%
  arrange(impact)
## # A tibble: 3,904 x 4
##    word      value     n impact
##    <chr>     <dbl> <int>  <dbl>
##  1 loss      -0.75   473 -355. 
##  2 losses    -0.8    436 -349. 
##  3 tax       -0.25   748 -187  
##  4 debt      -0.75   223 -167. 
##  5 risk      -0.75   200 -150  
##  6 casualty  -0.75   121  -90.8
##  7 economic  -0.25   339  -84.8
##  8 bad       -0.75   105  -78.8
##  9 inflation -0.6    125  -75  
## 10 liability -0.5    131  -65.5
## # ... with 3,894 more rows

4.7.5.6 Final Words on Syuzhet Lexicon

This lexicon is presenting the same issues as the previous lexicons in analyzing the Berkshire letters. Since it wasn’t designed for analyzing financial and business texts, we can hardly make any criticisms.

4.7.6 SenticNet4 Sentiment Lexicon

The SenticNet and Sentiword lexicons are both accessed through the lexicon package in R developed by Tyler Rinker (2018)

4.7.6.1 Description

SenticNet was developed at the MIT Media Laboratory in 2009. It replaces bag-of-words models “…with a new model that sees text as a bag of concepts and narratives” (SenticNet n.d.). It contains 23,626 words (11,774 positive and 11,852 negative) scored on a continuous scale between -1 and 1. The mean is -0.062.

While the authors of the lexicon say that SenticNet can be used like any other lexicon, they acknowledge that the “right way” to use it is for task of polarity detection in conjunction with sentic patterns(Cambria et al. 2016). As I don’t understand what “polarity detection” or “sentic patterns” are yet, I will use it like any other lexicon.

#create dataframe with lexicon
senticnet <- hash_sentiment_senticnet
#rename columns
senticnet <- senticnet %>% rename(word = x, value = y)
#descriptive statistics
summary(senticnet)
##      word               value         
##  Length:23626       Min.   :-0.98000  
##  Class :character   1st Qu.:-0.63000  
##  Mode  :character   Median :-0.02000  
##                     Mean   :-0.06175  
##                     3rd Qu.: 0.55900  
##                     Max.   : 0.98200

Using the code below, we match up the words that are in the SenticNet lexicon with the corresponding words in the Berkshire letters. Using the inner_join command automatically filters out words in the brk_words dataframe which are not found in the SenticNet lexicon

4.7.6.2 Combining Senticnet Sentiment Lexicon With Berkshire Corpus

brk_senticnet <- brk_words %>%
  inner_join(senticnet, by = "word") 

brk_senticnet 
## # A tibble: 114,305 x 3
##    year  word       value
##    <chr> <chr>      <dbl>
##  1 1971  pleasure   0.531
##  2 1971  report    -0.51 
##  3 1971  earnings   0.828
##  4 1971  capital    0.067
##  5 1971  beginning -0.8  
##  6 1971  equity     0.037
##  7 1971  result     0.044
##  8 1971  average    0.071
##  9 1971  american   0.097
## 10 1971  industry   0.123
## # ... with 114,295 more rows

4.7.6.3 Calculating SenticNet Sentiment

#sentiment by year
brk_senticnet_year <- brk_senticnet  %>%
  group_by(year) %>%
  summarise(senticnet  = sum(value)) %>%
  ungroup()
brk_senticnet_year %>% print(n=5)
## # A tibble: 49 x 2
##   year  senticnet
##   <chr>     <dbl>
## 1 1971       60.1
## 2 1972       76.7
## 3 1973       60.5
## 4 1974       33.2
## 5 1975       89.9
## # ... with 44 more rows

4.7.6.4 Analysis of SenticNet Sentiment

g_senticnet  <- ggplot(brk_senticnet_year, aes(year, senticnet)) + 
  geom_col(show.legend = FALSE , fill = "lightsteelblue3") +
  labs(x= NULL,y="Sentiment Score",
       title="Senticnet Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_senticnet 

4.7.6.5 A Deeper Look Into How SenticNet is Classifying Words

4.7.7 Sentiword Sentiment Lexicon

4.7.7.1 Description

Sentiword was developed at the MIT Media Laboratory in 2009. It replaces bag-of-words models “…with a new model that sees text as a bag of concepts and narratives” (SenticNet n.d.). It as developed to try and blend manually built and automatically derived lexicon classification methods (Baccianella, Esuli, and Sebastiani 2010), (Gatti, Guerini, and Turchi 2016). It contains 23,626 words (11,774 positive and 11,852 negative) scored on a continuous scale between -1 and 1. The mean is -0.062.

#create dataframe with lexicon
sentiword  <- hash_sentiment_sentiword
#rename columns
sentiword <- sentiword %>% rename(word = x, value = y)
#descriptive statistics
summary(sentiword)
##      word               value         
##  Length:20093       Min.   :-1.00000  
##  Class :character   1st Qu.:-0.25000  
##  Mode  :character   Median :-0.12500  
##                     Mean   :-0.06174  
##                     3rd Qu.: 0.25000  
##                     Max.   : 1.00000

4.7.7.2 Combining Sentiword Sentiment Lexicon With Berkshire Corpus

Using the code below, we match up the words that are in the Sentiword lexicon with the corresponding words in the Berkshire letters. Using the inner_join command automatically filters out words in the brk_words dataframe which are not found in the Sentiword lexicon

brk_sentiword <- brk_words %>%
  inner_join(sentiword, by = "word") 

brk_sentiword 
## # A tibble: 46,269 x 3
##    year  word             value
##    <chr> <chr>            <dbl>
##  1 1971  pleasure        0.312 
##  2 1971  beginning       0.125 
##  3 1971  result          0.25  
##  4 1971  average        -0.375 
##  5 1971  inadequate     -0.562 
##  6 1971  improve         0.375 
##  7 1971  return          0.0938
##  8 1971  capitalization  0.125 
##  9 1971  return          0.0938
## 10 1971  rate           -0.125 
## # ... with 46,259 more rows

4.7.7.3 Calculating Sentiword Sentiment

#sentiment by year
brk_sentiword_year <- brk_sentiword  %>%
  group_by(year) %>%
  summarise(sentiword  = sum(value)) %>%
  ungroup()
brk_sentiword_year
## # A tibble: 49 x 2
##    year  sentiword
##    <chr>     <dbl>
##  1 1971      14.7 
##  2 1972      20.8 
##  3 1973      -1.61
##  4 1974      -2.54
##  5 1975      28.0 
##  6 1976      25.7 
##  7 1977      22.5 
##  8 1978      15.8 
##  9 1979      20.8 
## 10 1980      39.0 
## # ... with 39 more rows

4.7.7.4 Analysis of Sentiword Sentiment

g_sentiword  <- ggplot(brk_sentiword_year, aes(year, sentiword , fill = sentiword >0,)) + #fill = makes values below zero in red
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=c("lightcoral","lightsteelblue3"))+
  labs(x= NULL,y="Sentiment Score",
       title="Sentiword Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_sentiword 

4.7.7.5 A Deeper Look Into How Sentiword is Classifying Words

4.7.8 SO-CAL Lexicon

4.7.8.1 Description

SO-CAL (Semantic Orientation CALculator) began development in 2004. Words in the lexicon are annotated with their semantic orientation or polarity. The calculation of sentiment in SOCAL rests on two assumptions: that individual words have polarity independent of context and that this semaintic orientation can be expressed as a numerical value. The lexicon was classified with Amazon Mechanical Turk (Taboada et al. 2011).

SO-CAL is written in Python and since I am still a Python novice, I decided to download the text files from GitHub and do the analysis in R. I downloaded several dictionaries (adjectives, adverbs, nouns, verbs and and combined them together. Since a small number of words appear in multiple dictionaries, I simply took the average score. The dataframe I created contains 5,971 words (2,438 positive and 3,530 negative) scored on a continuous scale between -5 and 5. The mean is -0.34.

#Import SOCAL lexicon to dataframe
socal<- read_csv("C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\socal_mean.csv", 
                  col_names = FALSE)
#Data Wrangling
socal <- socal %>% rename(word = X1, value = X2)
socal <- socal[-c(1),]
socal$value <- as.numeric(socal$value)

summary(socal)
##      word               value        
##  Length:5971        Min.   :-5.0000  
##  Class :character   1st Qu.:-2.0000  
##  Mode  :character   Median :-1.0000  
##                     Mean   :-0.3406  
##                     3rd Qu.: 2.0000  
##                     Max.   : 5.0000

4.7.8.2 Combining SOCAL Sentiment Lexicon With Berkshire Corpus

Using the code below, we match up the words that are in the SOCAL lexicon with the corresponding words in the Berkshire letters. Using the inner_join command automatically filters out words in the brk_words dataframe which are not found in the SOCAL lexicon.

brk_socal <- brk_words %>%
  inner_join(socal, by = "word")

brk_socal %>% print(n=5)
## # A tibble: 28,313 x 3
##   year  word       value
##   <chr> <chr>      <dbl>
## 1 1971  pleasure       3
## 2 1971  average        1
## 3 1971  inadequate    -2
## 4 1971  improve        2
## 5 1971  difficult     -2
## # ... with 28,308 more rows

4.7.8.3 Calculating SOCAL Sentiment by Year

#sentiment by year
brk_socal_year <- brk_socal  %>%
  group_by(year) %>%
  summarise(socal  = sum(value)) %>%
  ungroup()
brk_socal_year %>% print(n=5)
## # A tibble: 49 x 2
##   year  socal
##   <chr> <dbl>
## 1 1971   63.5
## 2 1972  118. 
## 3 1973  102. 
## 4 1974   17.5
## 5 1975  136. 
## # ... with 44 more rows

4.7.8.4 Analysis of SOCAL Sentiment

g_socal  <- ggplot(brk_socal_year, aes(year, socal , fill = socal >0,)) + #fill = makes values below zero in red
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=c("lightsteelblue3","lightcoral"))+
  labs(x= NULL,y="Sentiment Score",
       title="SOCAL Sentiment Scores for Berkshire Letters")+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) 
g_socal

#### A Deeper Look Into How SOCAL is Classifying Words

We see that SOCAL has some of the same issues the other lexica have with positive words such as “economic,” “share” and “worth” which are all in our top ten list in term of impact.

brk_socal %>%
  count(word, value) %>%
  mutate(impact = value * n) %>%
  arrange(desc(impact))
## # A tibble: 2,289 x 4
##    word          value     n impact
##    <chr>         <dbl> <int>  <dbl>
##  1 outstanding    5      214  1070 
##  2 economic       3      339  1017 
##  3 profit         2      442   884 
##  4 excellent      5      145   725 
##  5 gain           2      352   704 
##  6 extraordinary  5      138   690 
##  7 share          1      627   627 
##  8 worth          1.67   369   615.
##  9 special        3      194   582 
## 10 terrific       5       96   480 
## # ... with 2,279 more rows

And it also has the same issues with negative words such as “loss,” “catastrophe” and “casualty.”

brk_socal %>%
  count(word, value) %>%
  mutate(impact = value * n) %>%
  arrange(impact)
## # A tibble: 2,289 x 4
##    word        value     n impact
##    <chr>       <dbl> <int>  <dbl>
##  1 loss           -2   473   -946
##  2 cost           -1   718   -718
##  3 liability      -3   131   -393
##  4 catastrophe    -5    78   -390
##  5 bad            -3   105   -315
##  6 expense        -2   132   -264
##  7 casualty       -2   121   -242
##  8 terrible       -5    46   -230
##  9 worse          -4    54   -216
## 10 difficult      -2   103   -206
## # ... with 2,279 more rows

Examining the word “outstanding.”

brk_words %>%
  count(year, word) %>% 
  filter(word == "outstanding") %>%
  group_by(year) %>%
  arrange(desc(n)) %>%
  ungroup() %>% print(n=5)
## # A tibble: 47 x 3
##   year  word            n
##   <chr> <chr>       <int>
## 1 1995  outstanding     9
## 2 1996  outstanding     9
## 3 2000  outstanding     9
## 4 2006  outstanding     9
## 5 2012  outstanding     9
## # ... with 42 more rows

4.8 Correlation of Sentiment Measures

Because several of the lexica are on different scales, in order to analyze them, I computed z-scores for each. I’ll show a before and after with the AFINN lexicon as an example. Here is the “before” graph of AFINN sentiment.

Next is to calculate the z-scores, which is the individual observation minus the mean divided by standard deviation \(z = \frac{x- \mu}{\sigma}\). Using this code below, we see our output has two columns, one with the raw score and another with the z-score.

brk_afinn_year %>%
  mutate(afinn_z = ((brk_afinn_year$afinn - mean(brk_afinn_year$afinn))/ sd(brk_afinn_year$afinn))) %>%
  select(year, afinn, afinn_z)
## # A tibble: 49 x 3
##    year  afinn afinn_z
##    <chr> <dbl>   <dbl>
##  1 1971     34  -1.52 
##  2 1972     64  -1.26 
##  3 1973     55  -1.34 
##  4 1974     -6  -1.87 
##  5 1975    108  -0.867
##  6 1976    105  -0.893
##  7 1977     72  -1.18 
##  8 1978    150  -0.495
##  9 1979    139  -0.593
## 10 1980    155  -0.451
## # ... with 39 more rows

The graphed “after” output looks very different.

Repeating this z-score calculation for all the lexicons results in the graph below.

Our “obvious” negative years - 1974, 2001 and 2008 are the most negative. Looking back to Figure 4.4 which is shown again below we see there is some correlation between sentiment and the performance of the S&P 500. But again, this is just foreshadowing, we will go more in-depth into performance and sentiment in Chapter 5.

S&P 500 Annual Performance

Figure 4.6: S&P 500 Annual Performance

4.8.1 Correlation Analysis of Sentiment Lexica

4.8.1.1 Correlation Matrix

The matrix in Figure 4.7 shows the the correlations of the various lexicons.

# rename so it looks prettier
brk_all_sentiment <- brk_all_sentiment  %>% rename(afinn = afinn_z, bing = bing_z, socal = socal_z, lough = lough_z, syuzhet = syuzhet_z, senticnet = senticnet_z, sentiword = sentiword_z, nrc = nrc_z)

# ggpairs correlation matrix
corr_for_ggpairs <- ggpairs(brk_all_sentiment, columns = c("afinn", "bing", "socal", "lough", "syuzhet", "senticnet", "sentiword", "nrc", "mean"))

# ggpairs correlation matrix in hi-res
ggsave(corr_for_ggpairs, filename = "corr_for_ggpairs.png", dpi = 300, type = "cairo",
       width = 8, height = 6, units = "in")
Correlation of Sentiment Dictionaries

Figure 4.7: Correlation of Sentiment Dictionaries

I decided to also display the results in a more graphical way and using the corrplot package which helps identify hidden structure and patterns in the matrix. While one could eyeball the matrix and see that certain lexicons are less correlated with others, using hierarchical clustering order in the corrplot package really makes this relationship jump out.

# dataframe for correlation matrix
m <- brk_all_sentiment %>%
  select(-c(year))

#correlation matrix
corr_matrix <- cor(m)
#as_tibble(corr_matrix) 

#nicer colors
col<- colorRampPalette(c("red", "white", "royalblue"))(20)

#correlation matrix with h clustering
corrplot::corrplot(corr_matrix, method = "number", order = "hclust", addrect = 2, tl.col = "black", tl.srt = 45, col = col, number.cex=.85 )

We see that the NRC, Bing and Loughran lexica are highly correlated with each other and less correlated with the other lexicons. Since these three lexica are all binary and the others are more finely graduated, this would probably explain this relationship.

4.8.1.2 Network Plots

These network plots were adapted from an excellent tutorial by Matt Dancho. Using thecorrr package shows the hidden structure and patterns in a more visual manner.

g_network_plot <- corr_matrix %>%
  corrr::network_plot(colours = c(palette_light()[[2]], "white", palette_light()[[4]]), legend = TRUE) +
  labs(title = "Correlations of Sentiment Lexica") +
  expand_limits(x = c(-0.5, 0.5), y = c(-0.5, 0.5)) +
  theme_tq() +
  theme(legend.position = "bottom")
g_network_plot

Since there are two distinctive groups of lexica, I’m going to redo the network graph without the mean variable included. It’s a slightly cleaner graph than the previous one.

# dataframe for correlation matrix
m2 <- brk_all_sentiment %>%
  select(-c(year, mean))

#correlation matrix
corr_matrix_m2 <- cor(m2)

g_network_plot_m2 <- corr_matrix_m2 %>%
  corrr::network_plot(colours = c(palette_light()[[2]], "white", palette_light()[[4]]), legend = TRUE) +
  labs(title = "Correlations of Sentiment Lexica Without Mean") +
  expand_limits(x = c(-0.5, 0.5), y = c(-0.5, 0.5)) +
  theme_tq() +
  theme(legend.position = "bottom")
g_network_plot_m2

Since I’m exploring with different ways to visualize data, I’m going to chart all the different lexica over time. In the interests of time, I am going to do this in excel. First I will write the brk_all_sentiment dataframe to a csv file so I can read it into excel.

write.csv(brk_all_sentiment,"C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\brk_all_sentiment.csv", row.names = FALSE)

Figure 4.8 shows a graph of all the lexica with the heavy red line being the average. This visualizes the correlation in a slightly different manner.

Graph of All Lexica

Figure 4.8: Graph of All Lexica

Figure 4.9 shows just the “binary” lexica (Bing, Loughran and NRC). We can see these are highly correlated.

Graph of Binary Lexica

Figure 4.9: Graph of Binary Lexica

Figure 4.10 shows just the “continuous” lexica. We can see these are also highly correlated.

Graph of Binary Lexica

Figure 4.10: Graph of Binary Lexica

4.8.1.3 Nagging Worry

While the various lexica are pretty highly correlated, I have a nagging worry concerning the misclassification issues I discussed earlier. If they are all misclassifing in the same way, it might throw off my results when I perform regression analysis with Bershire’s returns.

Try upset plots to show similarities between lexica UpSetR package
https://cran.r-project.org/web/packages/UpSetR/vignettes/queries.html

  1. No fancy code used, I just copied and pasted into a MS Word file↩︎

  2. Update: I emailed Saif Mohammad and he agreed that it seemed strange that words would be classified as both. He gave me a thoughtful explanation as to how this can happen due to how the lexicon was compiled. He suggested I try another lexicon they created which includes intensity of emotions. I will add this to the list of things to follow up.↩︎