# Chapter 6 Analysis of Sentiment vs. Peformance

## 6.1 Recap

To recap what we have done so far, in previous chapters we have:

- Chapter 2: Obtained and imported the letters into a dataframe.
- Chapter 3: Tokenized and cleaned the data.
- Chapter 4: Reviewed various sentiment lexica and merged with the letters.
- Chapter 5: Obtained, discussed and calculated various performance metrics.

*Finally*, after doing all that prep work we can compare sentiment to returns. To reiterate, it is unlikely that sentiment will be predictive of business or stock performance.

## 6.2 Creating Values for Average Sentiment of Binary and Continuous Lexica

Since some of the lexica are highly correlated and other are not, we will make averages of each group. `avg_binary`

will be an average of the binary lexica (Bing, NRC and Loughran), while `avg_continuous`

will be the average of the continuous lexica (AFINN, SO-CAL, Syuzhet, SenticNet and Sentiword).

```
#create average sentiment for binary and continuous lexica
<- brk_all_sentiment %>%
brk_all_sentiment mutate(avg_binary = rowMeans(select(., bing, nrc, lough))) %>% #average of binary
mutate(avg_continuous = rowMeans(select(.,-c(year, bing, nrc, lough)))) #average of continuous
%>% print(n=5) brk_all_sentiment
```

```
## # A tibble: 49 x 12
## year afinn bing socal lough syuzhet senticnet sentiword nrc mean avg_binary avg_continuous
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1971 -1.52 -0.129 -1.86 2.41 -1.91 -2.04 -1.24 1.23 -0.631 1.17 -1.15
## 2 1972 -1.26 2.03 -1.57 2.74 -1.64 -1.94 -0.935 2.50 -0.00778 2.43 -0.703
## 3 1973 -1.34 0.733 -1.65 -0.245 -1.80 -2.04 -2.07 0.370 -1.01 0.286 -1.37
## 4 1974 -1.87 -3.72 -2.09 -2.93 -2.26 -2.21 -2.12 -2.60 -2.47 -3.08 -2.30
## 5 1975 -0.867 1.19 -1.48 0.691 -1.47 -1.86 -0.572 2.59 -0.222 1.49 -0.712
## # ... with 44 more rows
```

We join these newley created data to the exisiting dataframe `perf_and_sentiment`

.

```
<- perf_data %>%
perf_and_sentiment left_join(brk_all_sentiment, by = "year")
```

## 6.3 Regressions

I’ll run ten pair-wise regressions. The two dependent variables will be the average sentiment measures: `avg_binary`

and `avg_continuous`

. The independent variables will be the five measures of performance (three absolute and two relative) discussed in the previous chapter:

Absolute measures of return:
- *sp500* is the annual return of the Standard & Poor’s 500 index.
- *book* is the percentage change in Berkshire’s book value.
- *brk* is the percentage change Berkshire’s stock price.

Relative measures of return:
- *alpha_book* is *book* versus the *sp500* which is the excess return of book value over the market
- *alpha_brk* is *brk* versus the_sp500_ which is the excess return of stock price over the market

The next four sections will be:

- Binary sentiment lexica and absolute performance
- Continuous sentiment lexica and absolute performance
- Binary sentiment lexica and relative performance
- Continuous sentiment lexica and relative performance

### 6.3.1 Correlation of Binary Sentiment Lexica and Absolute Performance

I’m sure there is an easier way to do this but I’m just learning R. I’ll go through the first one step-by-step^{17} I first created a new dataframe `scrap6`

which contains just the the binary sentiment data and the absolute performance measures.

```
<- perf_and_sentiment %>%
scrap6 select(year, brk, sp500, book, avg_binary, avg_continuous)
tibble(scrap6) %>% print(n=10)
```

```
## # A tibble: 49 x 6
## year brk sp500 book avg_binary avg_continuous
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1971 0.805 0.146 0.164 1.17 -1.15
## 2 1972 0.081 0.189 0.217 2.43 -0.703
## 3 1973 -0.025 -0.148 0.047 0.286 -1.37
## 4 1974 -0.487 -0.264 0.055 -3.08 -2.30
## 5 1975 0.025 0.372 0.219 1.49 -0.712
## 6 1976 1.29 0.236 0.593 1.18 -0.806
## 7 1977 0.468 -0.074 0.319 0.784 -0.874
## 8 1978 0.145 0.064 0.24 1.95 -0.347
## 9 1979 1.02 0.182 0.357 0.439 -0.447
## 10 1980 0.328 0.323 0.193 -0.0111 -0.305
## # ... with 39 more rows
```

But to plot these using `facet_wrap`

I need to convert the dataframe to a “long form” using `dplyr`

. This produces a dataframe with a separate row for each year by return type.

```
<- scrap6 %>%
scrap6_long gather("brk", "sp500", "book", key = type, value = return) %>%
relocate(type, .after = year) #move column
tibble(scrap6_long) %>% print(n=10)
```

```
## # A tibble: 147 x 5
## year type avg_binary avg_continuous return
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1971 brk 1.17 -1.15 0.805
## 2 1972 brk 2.43 -0.703 0.081
## 3 1973 brk 0.286 -1.37 -0.025
## 4 1974 brk -3.08 -2.30 -0.487
## 5 1975 brk 1.49 -0.712 0.025
## 6 1976 brk 1.18 -0.806 1.29
## 7 1977 brk 0.784 -0.874 0.468
## 8 1978 brk 1.95 -0.347 0.145
## 9 1979 brk 0.439 -0.447 1.02
## 10 1980 brk -0.0111 -0.305 0.328
## # ... with 137 more rows
```

Then I feed these data into `ggplot`

to produce the charts using various packages to format and make it look pretty. I was having trouble figuring out how to cross reference `ggplot`

output in `bookdown`

so I converted the plot to a png file using `ggsave`

.

```
#create plot
<- ggplot(scrap6_long, aes(x=avg_binary, y = return, color= type, label = year))+
g8 geom_text(check_overlap = T, size = 3) + #use year as datapoint instead of dot
geom_smooth(method=lm, se = F) + #include regression line
facet_wrap(~ type, nrow = 1, scales = "fixed",
labeller = as_labeller(c(book = "Book Value", sp500 ="S&P 500 Index", brk = "Berkshire Stock" ))) +
theme_bw() +
theme(legend.position = "none") + #get rid of legend
scale_y_continuous(labels = percent) +
#ggpubr::stat_cor(label.y = -.7) + #use this code to include r and p-value in chart
xlab("Binary Sentiment Score") +
ylab("Return") +
stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.y = -.7) #use this code to include r and p-value in chart
#ggplot in hi-res as png file
ggsave(g8, filename = "g8.png", dpi = 300, type = "cairo",
width = 8, height = 6, units = "in")
```

```
<- lm(avg_binary ~ sp500, data= perf_and_sentiment)
lm1 summary(lm1)
```

```
##
## Call:
## lm(formula = avg_binary ~ sp500, data = perf_and_sentiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05711 -0.50945 -0.04085 0.40432 2.24596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3230 0.1402 -2.304 0.025684 *
## sp500 2.6604 0.6711 3.964 0.000249 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7985 on 47 degrees of freedom
## Multiple R-squared: 0.2506, Adjusted R-squared: 0.2346
## F-statistic: 15.71 on 1 and 47 DF, p-value: 0.0002494
```

I show the code I used to position the figure and cross reference below. Figure 6.1, shows correlation of

`::include_graphics("g8.png") knitr`

### 6.3.2 Correlation of Continuous Sentiment Lexica and Absolute Performance

In Figure 6.2, we see correlation of…

```
<- ggplot(scrap6_long, aes(x=avg_continuous, y = return, color= type, label = year))+
g9 geom_text(check_overlap = T, size = 3) +
geom_smooth(method=lm, se = F) + #include regression line
facet_wrap(~ type, nrow = 1, scales = "fixed",
labeller = as_labeller(c(book = "Book Value", sp500 ="S&P 500 Index", brk = "Berkshire Stock" ))) +
theme_bw() +
theme(legend.position = "none") + #get rid of legend
scale_y_continuous(labels = percent) +
stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.y = -.7) +
xlab("Average Continuous Sentiment Score") +
ylab("Return")
#ggplot in hi-res as png file
ggsave(g9, filename = "g9.png", dpi = 300, type = "cairo",
width = 8, height = 6, units = "in")
```

### 6.3.3 Correlation of Binary Sentiment Lexica and Relative Performance

In Figure 6.3, we see correlation of…

```
#create new dataframe "scrap6"
<- perf_and_sentiment %>%
scrap7 select(year, alpha_brk, alpha_book, avg_binary, avg_continuous)
# convert to long form using "gather" and create new dataframe "scrap6_long"
<- scrap7 %>%
scrap7_long gather("alpha_brk", "alpha_book", key = type, value = return)
<- ggplot(scrap7_long, aes(x=avg_binary, y = return, color= type, label = year))+
g10 geom_text(check_overlap = T, size = 3) +
geom_smooth(method=lm, se = F) + #include regression line
facet_wrap(~ type, nrow = 1, scales = "fixed",
labeller = as_labeller(c(alpha_book = "Berkshire Book Value Excess Return over S&P 500", alpha_brk ="Berkshire Stock Excess Return over S&P 500")))+
theme_bw() +
theme(legend.position = "none") + #get rid of legend
scale_y_continuous(labels = percent) +
::stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
ggpubrlabel.y = -.5)+
xlab(" Binary Sentiment Score") +
ylab("Return")
#ggplot in hi-res as png file
ggsave(g10, filename = "g10.png", dpi = 300, type = "cairo",
width = 8, height = 6, units = "in")
```

### 6.3.4 Correlation of Continuous Sentiment Lexica and Relative Performance

In Figure 6.4, we see correlation of…

```
<- ggplot(scrap7_long, aes(x=avg_continuous, y = return, color= type, label = year))+
g11 geom_text(check_overlap = T, size = 3) +
geom_smooth(method=lm, se = F) + #include regression line
facet_wrap(~ type, nrow = 1, scales = "fixed",
labeller = as_labeller(c(alpha_book = "Berkshire Book Value Excess Return over S&P 500", alpha_brk ="Berkshire Stock Excess Return over S&P 500")))+
theme_bw() +
theme(legend.position = "none") + #get rid of legend
scale_y_continuous(labels = percent) +
::stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
ggpubrlabel.y = -.5)+
xlab("Continuous Sentiment Score") +
ylab("Return")
#ggplot in hi-res as png file
ggsave(g11, filename = "g11.png", dpi = 300, type = "cairo",
width = 8, height = 6, units = "in")
```

Alternative format for charts from the `ggpubr`

package

```
#cool charts from ggpubr package
<- ggscatter(
g12 x = "avg_binary", y = "return",
scrap6_long, color = "type",
add = "reg.line") +
scale_y_continuous(labels = percent) +
facet_wrap(~ type, nrow = 1, scales = "fixed",
labeller = as_labeller(c(book = "Book Value", sp500 ="S&P 500 Index", brk = "Berkshire Stock" ))) +
::stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
ggpubrlabel.y = -.6)+
theme(legend.position = "none") +
xlab("Average Binary Sentiment Score") +
ylab("Return")
#ggplot in hi-res as png file
ggsave(g12, filename = "g12.png", dpi = 300, type = "cairo",
width = 8, height = 6, units = "in")
```

## 6.4 Discussion of Results

```
<- perf_data %>%
scrap8 arrange(desc(alpha_book))
#gt(scrap8[1:10,])%>%
# fmt_percent(
# columns = 2:6,
# decimals = 1) %>%
# tab_header(title = "Berkshire Performance and Sentiment")
<- perf_and_sentiment %>%
scrap9 select(year, brk, sp500, book, alpha_brk, alpha_book, avg_binary, avg_continuous)
::datatable(scrap9) %>%
DTformatPercentage(columns = 2:6, digits = 1) %>%
formatRound(columns = 7:8, digits = 1)
```

```
#DT::datatable(perf_data) %>%
# formatPercentage(columns = 2:6, digits = 1)
```

## 6.5 Examination of Residuals and Assumptions for Linear Regression

Basically so I can remember what I did when I refer back to this months from now.↩︎