Chapter 2 Data Acquisition (Text extraction and pre-processing)

In order to do any type of NLP analysis one requires data to analyze. Therefore, the first step is extracting the text from its native form (in this case pdf and html files) into a text files5. These text files were then pre-processed before being imported into an R dataframe. This process is shown in Figure 2.1.

Data Acquisition Roadmap

Figure 2.1: Data Acquisition Roadmap

2.1 Extracting text

2.1.1 A Note on Copyrights

I will apologize in advance for not making the text files available for you to download to replicate my analysis. Given that the Berkshire letters are copyrighted, I could get into trouble by providing them. If you go to the website where Berkshire’s letters are located, you will see the disclosure in Figure 2.2 where I put a red box around the copyright notice at the bottom - since I do not have permission, I cannot reproduce them.

Disclosure on Berkshire's Website

Figure 2.2: Disclosure on Berkshire’s Website

This highlights an interesting nit. Berkshire’s annual reports are not filed with the SEC - only its Form 10K filing. Since a Chairman’s letter is not a required in a 10K filing, it is not included. If it was, it would be ok to reproduce.

2.1.2 Special Note on 2014 Letter

Originally, when doing this analysis, the 2014 was an outlier in several different ways. It was much longer at over 23,000 words versus the average of about 12,000 words and it scored extremely positive. Since it was skewing the results, I went back and skimmed the letter. It turns out that 2014 was a special year for Berkshire’s 50th anniversary as shown in Figure 2.3. This likely explains why the letter is so long and the sentiment so positive. All of the analysis going forward will exclude this special section of the 2014 letter.

Note from 2014 letter

Figure 2.3: Note from 2014 letter

2.1.3 Extracting text from pdf files

Letters from 1971-1976 and 2002-2019 are pdfs. In order to extract the text from these letters, I simply uploaded to them to the website pdf to text, converted them to text files and then downloaded them. Please note that while the 2002-2019 letters are available on Berkshire’s website, letters from 1971-1976 are not, but they are on the internet. For example, to obtain the 1971 letter, simply google “1971 berkshire hathaway letter to shareholders.”

2.1.4 Extracting text from html files

Letters from 1977-2001 are available on Berkshire’s website as html files. I will use the 1997 letter to give an example of my workflow. I went to the webpage for the 1997 letter clicked ctrl + a to select all, then ctrl + c to copy as shown in Figure 2.4.

text selected

Figure 2.4: text selected

Then I opened notepad and hit ctrl + v to paste. The result is in Figure 2.5.

text pasted to notepad

Figure 2.5: text pasted to notepad

While it is my preference to automate these steps, given my limited knowledge, it was simply too time consuming.

2.2 Need for pre-processing (cleanup)

Another thing I would have liked to have done in an automated fashion is cut out some tables and footnotes. A concern is that there might be extraneous which will later be mischaracterized during the sentiment analysis. For example a sentiment dictionary might classify the word “profit” as positive. But if we look at the 1997 letter (Figure 2.6), we see a table where Buffett is talking about the results of Berkshire’s insurance operations and has a column “underwriting loss” where the word “profit” appears numerous times.

Table in 1997 Letter

Figure 2.6: Table in 1997 Letter

While not having an underwriting loss is a positive, for the purposes of evaluating sentiment, it would skew the results by scoring the letter higher in positive sentiment than if the table was removed. I want to highlight this because my analysis might be flawed due to situations like this. That being said, it would have been too time consuming to go through all 49 letters manually removing numerous tables and footnotes. The hope6 is that when I become more proficient in data wrangling, I’ll be able to automatically remove tables like this.

At the end of this process, we are left with a folder which I named “txt_for_upload” which contains a separate text file for each of the 49 letters. Now we can import these text files into a dataframe.

2.3 Importing text files to a dataframe

The next step is to point to where the files are located. In this case the folder on my computer is named “txt_for_upload.”

#point to directory where files are located
input_dir <- "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload"

Then I create a list of individual filenames for each text file in the folder:

#create a list of full filenames
files_v <- dir(input_dir, "\\.txt$", full.names = TRUE)
tibble(files_v) %>% print(n=5)
## # A tibble: 49 x 1
##   files_v                                                                              
##   <chr>                                                                                
## 1 "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload/1971.txt"
## 2 "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload/1972.txt"
## 3 "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload/1973.txt"
## 4 "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload/1974.txt"
## 5 "C:\\Users\\psonk\\Dropbox\\111 TC\\R stuff\\NLP\\berkshire\\txt_for_upload/1975.txt"
## # ... with 44 more rows

Then I use the readtext package to read in the text files to the dataframe which I named “brk_letters.”

#read text in text
library(readtext)
brk_letters <- readtext(files_v, encoding = "UTF-8")
#here is the output. As you can see there are regex expressions 
#such as /n which are line breaks
tibble(brk_letters) %>% print(n=5)
## # A tibble: 49 x 2
##   doc_id   text                                                                 
##   <chr>    <chr>                                                                
## 1 1971.txt "To the Stockholders of Berkshire Hathaway Inc.:\nIt is a pleasure t~
## 2 1972.txt "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOperating earnin~
## 3 1973.txt "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOur financial re~
## 4 1974.txt "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOperating result~
## 5 1975.txt "\nTo the Stockholders of Berkshire Hathaway Inc.:\nLast year, when ~
## # ... with 44 more rows

This dataframe contains two columns: “doc_id” which contains the filename and “text” which includes the entire text of the letter for each year in a single cell. In order to convert the year to a number, I first need to get rid of the .txt suffix.

#Since "doc_id" is not particularly descriptive, it is renamed to "year":
names(brk_letters)[1] <- "year"
#Then the suffixes are removed from year column:
library(stringr)
brk_letters$year <- str_remove_all(brk_letters$year, ".txt")
brk_letters$year <- str_remove_all(brk_letters$year, "_raw")
#The year column is changed from a character to numeric (it cannot be changed to a time series because the only number is the year - in order to change to a time series it needs to be in a format of "1971-12-31" rather than "1971".
brk_letters$year <- as.numeric(brk_letters$year)

Here is the output after making these changes. It is exactly what we wanted - a dataframe with one column with the year and one column with the text of an entire letter in each row:

tibble(brk_letters) %>% print(n=5)
## # A tibble: 49 x 2
##    year text                                                                                                
##   <dbl> <chr>                                                                                               
## 1  1971 "To the Stockholders of Berkshire Hathaway Inc.:\nIt is a pleasure to report that operating earning~
## 2  1972 "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOperating earnings of Berkshire Hathaway during~
## 3  1973 "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOur financial results for 1973 were satisfactor~
## 4  1974 "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOperating results for 1974 overall were unsatis~
## 5  1975 "\nTo the Stockholders of Berkshire Hathaway Inc.:\nLast year, when discussing the prospects for 19~
## # ... with 44 more rows

If we look at the content of one particular row in the “text” column, we see that it contains the entire text of the letter.

#Contents of the second row of the "text" column which contains the text of the 1972 letter.
brk_letters$text[2]
## [1] "\nTo the Stockholders of Berkshire Hathaway Inc.:\nOperating earnings of Berkshire Hathaway during 1972 amounted to a highly satisfactory 19.8%\nof beginning shareholders’ equity. Significant improvement was recorded in all of our major\nlines of business, but the most dramatic gains were in insurance underwriting profit. Due to an\nunusual convergence of favorable factors—diminishing auto accident frequency, moderating\naccident severity, and an absence of major catastrophes—underwriting profit margins achieved a\nlevel far above averages of the past or expectations of the future.\nWhile we anticipate a modest decrease in operating earnings during 1973, it seems clear that our\ndiversification moves of recent years have established a significantly higher base of normal\nearning power. Your present management assumed policy control of the company in May, 1965.\nEight years later, our 1972 operating earnings of $11,116,256 represent a return many-fold\nhigher than would have been produced had we continued to devote our resources entirely to the\ntextile business. At the end of the 1964 fiscal year, shareholders’ equity totaled $22,138,753.\nSince that time, no additional equity capital has been introduced into the business, either through\ncash sale or through merger. On the contrary, some stock has been reacquired, reducing\noutstanding shares by 14%. The increase in book value per share from $19.46 at fiscal year-end\n1964 to $69.72 at 1972 year-end amounts to about 16.5% compounded annually.\nOur three major acquisitions of recent years have all worked out exceptionally well—from both\nthe financial and human standpoints. In all three cases, the founders were major sellers and\nreceived significant proceeds in cash—and, in all three cases, the same individuals, Jack\nRingwalt, Gene Abegg and Vic Raab, have continued to run the businesses with undiminished\nenergy and imagination which have resulted in further improvement of the fine records\npreviously established.\nWe will continue to search for logical extensions of our present operations, and also for new\noperations which will allow us to continue to employ our capital effectively.\n\nTextile Operations\nAs predicted in last year’s annual report, the textile industry experienced a pickup in 1972. In\nrecent years, Ken Chace and Ralph Rigby have developed an outstanding sales organization\nenjoying a growing reputation for service and reliability. Manufacturing capabilities have been\nrestructured to complement our sales strengths.\nHelped by the industry recovery, we experienced some payoff from these efforts in 1972.\nInventories were controlled, minimizing close-out losses in addition to minimizing capital\nrequirements; product mix was greatly improved. While the general level of profitability of the\nindustry will always be the primary factor in determining the level of our textile earnings, we\nbelieve that our relative position within the industry has noticeably improved. The outlook for\n1973 is good.\n\n\fInsurance Underwriting\nOur exceptional underwriting profits during 1972 in the large traditional area of our insurance\nbusiness at National Indemnity present a paradox. They served to swell substantially total\ncorporate profits for 1972, but the factors which produced such profits induced exceptional\namounts of new competition at what we believe to be a non-compensatory level of rates. Overall, we probably would have retained better prospects for the next five years if profits had not\nrisen so dramatically this year.\nSubstantial new competition was forecast in our annual report for last year and we experienced\nin 1972 the decline in premium volume that we stated such competition implied. Our belief is\nthat industry underwriting profit margins will narrow substantially in 1973 or 1974 and, in time,\nthis may produce an environment in which our historical growth can be resumed. Unfortunately,\nthere is a lag between deterioration of underwriting results and tempering of competition. During\nthis period we expect to continue to have negative volume comparisons in our traditional\noperation. Our seasoned management, headed by Jack Ringwalt and Phil Liesche, will continue\nto underwrite to produce a profit, although not at the level of 1972, and base our rates on longterm expectations rather than short-term hopes. Although this approach has meant dips in volume\nfrom time to time in the past, it has produced excellent long-term results.\nAlso as predicted in last year’s report, our reinsurance division experienced many of the same\ncompetitive factors in 1972. A multitude of new organizations entered what has historically been\na rather small field, and rates were often cut substantially, and we believe unsoundly, particularly\nin the catastrophe area. The past year turned out to be unusually free of catastrophes and our\nunderwriting experience was good.\nGeorge Young has built a substantial and profitable reinsurance operation in just a few years. In\nthe longer term we plan to be a very major factor in the reinsurance field, but an immediate\nexpansion of volume is not sensible against a background of deteriorating rates. In our view,\nunderwriting exposures are greater than ever. When the loss potential inherent in such exposures\nbecomes an actuality, repricing will take place which should give us a chance to expand\nsignificantly.\nIn the “home state” operation, our oldest and largest such company, Cornhusker Casualty\nCompany, operating in Nebraska only, achieved good underwriting results. In the second full\nyear, the home state marketing appeal has been proven with the attainment of volume on the\norder of one-third of that achieved by “old line” giants who have operated in the state for many\ndecades.\nOur two smaller companies, in Minnesota and Texas, had unsatisfactory loss ratios on very small\nvolume. The home state managements understand that underwriting profitably is the yardstick of\nsuccess and that operations can only be expanded significantly when it is clear that we are doing\nthe right job in the underwriting area. Expense ratios at the new companies are also high, but that\nis to be expected when they are in the development stage.\n\n\fJohn Ringwalt has done an excellent job of launching this operation, and plans to expand into at\nleast one additional state during 1973. While there is much work yet to be done, the home state\noperation appears to have major long-range potential.\nLast year it was reported that we had acquired Home and Automobile Insurance Company of\nChicago. We felt good about the acquisition at the time, and we feel even better now. Led by Vic\nRaab, this company continued its excellent record in 1972. During 1973 we expect to enter the\nFlorida (Dade County) and California (Los Angeles) markets with the same sort of specialized\nurban auto coverage which Home and Auto has practiced so successfully in Cook County. Vic\nhas the managerial capacity to run a much larger operation. Our expectation is that Home and\nAuto will expand significantly within a few years.\n\nInsurance Investment Results\nWe were most fortunate to experience dramatic gains in premium volume from 1969 to 1971\ncoincidental with virtually record-high interest rates. Large amounts of investable funds were\nthus received at a time when they could be put to highly advantageous use. Most of these funds\nwere placed in tax-exempt bonds and our investment income, which has increased from\n$2,025,201 in 1969 to $6,755,242 in 1972, is subject to a low effective tax rate.\nOur bond portfolio possesses unusually good call protection, and we will benefit for many years\nto come from the high average yield of the present portfolio. The lack of current premium\ngrowth, however, will moderate substantially the growth in investment income during the next\nseveral years.\n\nBanking Operations\nOur banking subsidiary, The Illinois Bank and Trust Co. of Rockford, maintained its position of\nindustry leadership in profitability. After-tax earnings of 2.2% on average deposits in 1972 are\nthe more remarkable when evaluated against such moderating factors as: (1) a mix of 50% time\ndeposits heavily weighted toward consumer savings instruments, all paying the maximum rates\npermitted by law; (2) an unvaryingly strong liquid position and avoidance of money-market\nborrowings; (3) a loan policy which has produced a net charge-off ratio in the last two years of\nabout 5% of that of the average commercial bank. This record is a direct tribute to the leadership\nof Gene Abegg and Bob Kline who run a bank where the owners and the depositors can both eat\nwell and sleep well.\nDuring 1972, interest paid to depositors was double the amount paid in 1969. We have\naggressively sought consumer time deposits, but have not pushed for large “money market”\ncertificates of deposit although, during the past several years, they have generally been a less\ncostly source of time funds.\nDuring the past year, loans to our customers expanded approximately 38%. This is considerably\nmore than indicated by the enclosed balance sheet which includes $10.9 million in short-term\ncommercial paper in the 1971 loan total, but which has no such paper included at the end of\n1972.\n\n\fOur position as “Rockford’s Leading Bank” was enhanced during 1972. Present rate structures, a\ndecrease in investable funds due to new Federal Reserve collection procedures, and a probable\nincrease in already substantial non-federal taxes make it unlikely that Illinois National will be\nable to increase its earnings during 1973.\n\nF inancial\nOn March 15, 1973, Berkshire Hathaway borrowed $20 million at 8% from twenty institutional\nlenders. This loan is due March 1, 1993, with principal repayments beginning March 1, 1979.\nFrom the proceeds, $9 million was used to repay our bank loan and the balance is being invested\nin insurance subsidiaries. Periodically, we expect that there will be opportunities to achieve\nsignificant expansion in our insurance business and we intend to have the financial resources\navailable to maximize such opportunities.\nOur subsidiaries in banking and insurance have major fiduciary responsibilities to their\ncustomers. In these operations we maintain capital strength far above industry norms, but still\nachieve a good level of profitability on such capital. We will continue to adhere to the former\nobjective and make every effort to continue to maintain the latter.\nWarren E. Buffett\nChairman of the Board\nMarch 16, 1973\n\n\f"

2.4 Analysis of data thus far

At this stage there are several simple analyses that could be done such as a basic word count (shown in Figure 2.7) or the number of sentences per letter.

Word Count by Year

Figure 2.7: Word Count by Year

As Figure 2.7 shows, Buffett wrote relatively short letters until 1983 when the wordcount jumped to 11,542 words. The red dotted line shows the average from 1983-2019 which is 12,435 words per letter. As mentioned, Berkshire’s website lists letters from 1977 onward, I’m not sure why they start at that year and do not list earlier letters. Keep in mind that Buffett didn’t become well known7 until the early 1990s. It would be interesting to get some measure of Buffett’s popularity and see if there is any correlation between his use of language.

Measures of popularity could include the trading volume of Berkshire stock or attendance at Berkshire’s annual meetings (which went from 20 people in the late 1980s to 40,000 people more recently). Other measures could be mentions in the newspaper such as the Wall Street Journal. For example, Figure 2.8 shows the mentions of “Berkshire Hathaway” and “Warren Buffett” or “Warren E. Buffett” in the print edition of the Wall Street Journal from 19828 to present.

A well known phenomena is social psychology is the “Hawthorne Effect” whereby an individual alters their behavior as a result of being observed. It would be interesting to see if Buffett’s language changed as he became more popular.

Mentions in Print Edition of WSJ (source: Factiva)

Figure 2.8: Mentions in Print Edition of WSJ (source: Factiva)

We can add this to the list of potential facets to analyze but now we will move on to the next step tokenization.


  1. This step was actually an ordeal which I will expand upon in the section at some later point in time. It was basically a nightmare. While my goal was to have an automated process to convert pdf and html files to a text file, whenever I would get close, I would hit an insurmountable obstacle - usually involving Regex expressions and text encoding.↩︎

  2. As I am fond of saying, whenever you use the word “hope,” you’re f*cked.↩︎

  3. I am not sure if it should be “well known” or “well-known.” I should have paid more attention in English class!↩︎

  4. For some reason, Factiva does not have data before 1982↩︎