Chapter 3 Tokenization, Text Cleaning and Normalization

At the end of the previous chapter, we ended with a dataframe brk_letters containing 49 rows and 2 columns. For each row, there is one column that contains the year and second column that contains a string of the full text of the letter for the corresponding year. For example, row 44 (first row in the table below) has “2014” in the “year” column and a string of 23,466 words of text in the “text” column.

tibble(brk_letters) %>% tail()
## # A tibble: 6 x 2
##    year text                                                                            
##   <dbl> <chr>                                                                           
## 1  2014 "A note to readers: Fifty years ago, today’s management took charge at Berkshir~
## 2  2015 "BERKSHIRE HATHAWAY INC.\nTo the Shareholders of Berkshire Hathaway Inc.:\nBerk~
## 3  2016 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 4  2017 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 5  2018 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 6  2019 "\nBERKSHIRE HATHAWAY INC.\n\n\nTo the Shareholders of Berkshire Hathaway Inc.:~

As I mentioned, there are limited types of analyses that can be performed with the data in this format. In order to analyze the data, it must be broken down into smaller pieces or tokens by a process which is called tokenization. A token is a meaningful unit of text. The goal for this chapter is to restructure the datafram so there is one token per row.

3.1 Roadmap for Tokenization, Text Cleaning and Normalization

A raw string of text must be tokenized in order to analyze. But there are other adjustments that might need to be made to better analyze the data. Figure 3.1 shows the process of preparing the text for further analysis.

Roadmap for Tokenization and Text Cleaning and Normalization

Figure 3.1: Roadmap for Tokenization and Text Cleaning and Normalization

3.2 Tokenization

The first step is using the unnest_token function in the tidytext package to put each word in a separate row. As you can see, the dimensions are now 512,391 rows and 2 columns. The unnest token function also performed text cleaning by converting all upper case letters to lower case and removing all special characters and punctuation.

brk_words <- brk_letters %>%
  unnest_tokens(word, text)      #splits text into words
tibble(brk_words)
## # A tibble: 501,124 x 2
##     year word        
##    <dbl> <chr>       
##  1  1971 to          
##  2  1971 the         
##  3  1971 stockholders
##  4  1971 of          
##  5  1971 berkshire   
##  6  1971 hathaway    
##  7  1971 inc         
##  8  1971 it          
##  9  1971 is          
## 10  1971 a           
## # ... with 501,114 more rows

3.3 Removing Numbers

Numbers will not provide us any insight to sentiment so we will remove them using the following code. Rows are reduced from 512,391 to 489,291.

brk_words <- brk_letters %>%
  unnest_tokens(word, text) %>%     # splits text into words
  filter(!grepl('[0-9]', word))     # remove numbers
tibble(brk_words) %>% print(n=5)
## # A tibble: 478,243 x 2
##    year word        
##   <dbl> <chr>       
## 1  1971 to          
## 2  1971 the         
## 3  1971 stockholders
## 4  1971 of          
## 5  1971 berkshire   
## # ... with 478,238 more rows

3.4 Stop Word Removal

“Stop words” are words which are common or frequently used but are often not useful for analysis and usually removed. The tidytext package includes a list of 1,149 stop words.

stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

When we look at the data, we see a column “lexicon” which indicates that there are likely more than one lexicon in this dataset. To obtain unique values we use the following code and see there are three different lexicons “SMART,” “snowball” and “onix” in the dataframe.

stop_words %>%
  select(lexicon) %>%   #select "lexicon" column
  unique()              #show unique values in column
## # A tibble: 3 x 1
##   lexicon 
##   <chr>   
## 1 SMART   
## 2 snowball
## 3 onix

To see if certain words are duplicated in the different lexicons, we use the following code and see there are 728 unique words in the list.

stop_words %>%
  select(word) %>%   #select "word" column
  unique()           #show unique values in column
## # A tibble: 728 x 1
##    word       
##    <chr>      
##  1 a          
##  2 a's        
##  3 able       
##  4 about      
##  5 above      
##  6 according  
##  7 accordingly
##  8 across     
##  9 actually   
## 10 after      
## # ... with 718 more rows

Before removing stop words from our corpus, let’s look at the top 10 most frequently used words over the entire 49 year period.

tibble(brk_words) %>%
  count(word, sort = T) 
## # A tibble: 15,626 x 2
##    word      n
##    <chr> <int>
##  1 the   21840
##  2 of    14356
##  3 to    12067
##  4 in    11478
##  5 and   10632
##  6 a     10054
##  7 that   7476
##  8 we     7077
##  9 our    5922
## 10 is     4677
## # ... with 15,616 more rows

The top 10 list is comprised of exactly what the stop word removal process is trying to eliminate - frequently used words that do not contribute to our sentiment analysis. We also see that there are 15,843 unique words used throughout the corpus.

Removing the stop words9 reduces the list from 489,291 to 206,089 rows.

brk_words1 <- brk_letters %>%
  unnest_tokens(word, text) %>%       # splits text into words
  filter(!grepl('[0-9]', word)) %>%   # remove numbers
  anti_join(stop_words)               # remove stop words
tibble(brk_words1) %>% print(n=5)
## # A tibble: 201,504 x 2
##    year word        
##   <dbl> <chr>       
## 1  1971 stockholders
## 2  1971 berkshire   
## 3  1971 hathaway    
## 4  1971 pleasure    
## 5  1971 report      
## # ... with 201,499 more rows

The top 10 list of frequently used words changes drastically by removing the stop words but the lexical variety decreases slightly from 15,843 words to 15,180 words.

tibble(brk_words1) %>%
  count(word, sort = T) 
## # A tibble: 14,963 x 2
##    word           n
##    <chr>      <int>
##  1 business    2201
##  2 earnings    1981
##  3 berkshire   1925
##  4 company     1298
##  5 insurance   1272
##  6 million     1262
##  7 businesses  1013
##  8 billion      881
##  9 companies    858
## 10 stock        796
## # ... with 14,953 more rows

Think about this. 633 individual words accounted for 283,202 of the total words used or almost 58%!10

When we examined the tidytext stop word list, we noticed that there were three lexicons. Keep in mind that those are not the only stop word lists. There are many others. To see what differences there might be, I decided to try a generic stop word list curated by Tim Loughran and Bill McDonald, two finance professors at the University of Notre Dame who have done some seminal work of text analysis in finance. The list is available at their website and is shown below.

loughran_stop_long <- readtext(from_loughran_website)

loughran_stop_long <- loughran_stop_long %>%
  unnest_tokens(word, text) %>%
  select(word)
tibble(loughran_stop_long) %>% print(n=5)
## # A tibble: 571 x 1
##   word 
##   <chr>
## 1 a    
## 2 a's  
## 3 able 
## 4 about
## 5 above
## # ... with 566 more rows

It is simple to re-run the analysis substituting the Loughran stop list for the tidytext stop list. Using the Loughran list reduces the rows to 223,968 versus 206,089 for the tidytext stop list.

brk_words_2 <- brk_letters %>%
  unnest_tokens(word, text) %>%       # splits text into words
  filter(!grepl('[0-9]', word)) %>%   # remove numbers
  anti_join(loughran_stop_long)       # remove stop words
tibble(brk_words_2) %>% print(n=5)
## # A tibble: 218,982 x 2
##    year word        
##   <dbl> <chr>       
## 1  1971 stockholders
## 2  1971 berkshire   
## 3  1971 hathaway    
## 4  1971 pleasure    
## 5  1971 report      
## # ... with 218,977 more rows

I do not know if using different stop word lists would make a difference in the final sentiment analysis. I just did this to show that other stop lists can be easily substituted. The other thing I could have done is add the Loughran stop list to the tidyverse stop list resulting in 4 lexicons instead of 3.

Looking at the most frequently used words after stop word removal using the tidytext lexicon, we see that there are several other words that appear very frequently like “business,” “Berkshire” and “company” which are unlikely to add value to our sentiment analysis since they are neutral words. For this reason, I might want to consider developing my own list of custom stop words (but in the interests of time, I didn’t).

tibble(brk_words1) %>%
  count(word, sort = T) 
## # A tibble: 14,963 x 2
##    word           n
##    <chr>      <int>
##  1 business    2201
##  2 earnings    1981
##  3 berkshire   1925
##  4 company     1298
##  5 insurance   1272
##  6 million     1262
##  7 businesses  1013
##  8 billion      881
##  9 companies    858
## 10 stock        796
## # ... with 14,953 more rows

3.5 Text Cleaning

There are numerous aspects of text cleaning (and while I mention them here, I did not perform all of them).

  • Converting upper case letters to lower case.

  • Spell checking

  • Substitution of contractions such as converting “I’m” to “I am.”

  • Transformation numbers into word numerals such as converting “29” to “twenty nine.”

  • Removing punctuation and special characters.

  • Converting acronyms to regular expressions such as “USA” to “United States of America.”

We observed that the tokenization process with tidytext removed puctuation and special characters and converted upper case letters to lower case. For this analysis, I did not perform the other text cleaning operations mentioned.

3.5.1 Text Normalization - Stemming and Lemmatization

Language contains words in different tenses, plurals or derived from other words. Text normalization converts or strips words down to a base form. For example stemming would reduce the words, “plays,” “playing” and “played” to the common root of “play” or “am,” “are” and “is” to “be.” Words that are stemmed are still recognizable. Lemmatization on the other hand reduces words to their roots which might not be recognizable. For example, “operate,” “operating,” “operative” and “operation” would all be reduced to “oper” which is not a word11.

For example, if we go back to our list of 10 most frequently used words after stop word removal, we see that “business” and “businesses” and “company” and “companies” are considered to be separate words each with its own count.

tibble(brk_words1) %>%
  count(word, sort = T) 
## # A tibble: 14,963 x 2
##    word           n
##    <chr>      <int>
##  1 business    2201
##  2 earnings    1981
##  3 berkshire   1925
##  4 company     1298
##  5 insurance   1272
##  6 million     1262
##  7 businesses  1013
##  8 billion      881
##  9 companies    858
## 10 stock        796
## # ... with 14,953 more rows

Rather than stem as a separate step, I will use the SnowballC package in conjunction with the other packages we already have loaded, to stem and perform the other operations I have detailed above using the following code to produce a new dataframe called brk_stemmed.

library(SnowballC)
brk_stemmed <- tibble(brk_letters) %>%
  unnest_tokens(word, text) %>%       # splits text into words
  mutate(stem = wordStem(word)) %>%   # stems words and creates column
  filter(!grepl('[0-9]', word)) %>%   # remove numbers
  anti_join(stop_words) %>%           # remove stop words
  group_by(year) %>%
  ungroup()
brk_stemmed 
## # A tibble: 201,504 x 3
##    year  word         stem     
##    <chr> <chr>        <chr>    
##  1 1971  stockholders stockhold
##  2 1971  berkshire    berkshir 
##  3 1971  hathaway     hathawai 
##  4 1971  pleasure     pleasur  
##  5 1971  report       report   
##  6 1971  operating    oper     
##  7 1971  earnings     earn     
##  8 1971  excluding    exclud   
##  9 1971  capital      capit    
## 10 1971  gains        gain     
## # ... with 201,494 more rows

The result is a dataframe with three columns - “year,” “word” and “stem” which as you can see is a stemmed version of the word. Some stemmings like “hathaway” being converted to “hathawai” does not seem to make sense. Let us examine our new top 10 list.

brk_stemmed %>%
  count(stem, sort = TRUE)
## # A tibble: 9,495 x 2
##    stem         n
##    <chr>    <int>
##  1 busi      3223
##  2 earn      2476
##  3 compani   2156
##  4 berkshir  1926
##  5 insur     1775
##  6 oper      1661
##  7 million   1442
##  8 manag     1417
##  9 share     1306
## 10 invest    1285
## # ... with 9,485 more rows

One thing we notice is that the number of unique words has decreased from 15,180 to 9,618. We also see that words like “business” and “businesses” have both been converted to “busi” and combined. In the previous word list the word “business” was used 2,260 and “businesses” was used 1,050 times. Now the word “busi” appears 3,319 times. But 2,260 + 1,050 = 3,310 which means there was some other word which appeared 9 times which was stemmed and included. Let’s find what other words were stemmed to “busi” by running some code.

brk_stemmed %>% 
  filter(stem == "busi") %>%  # filters "stem" column for values = "busi"
  select(word) %>%            # selects "word" column
  unique()                    # shows unique values in "word" column
## # A tibble: 3 x 1
##   word      
##   <chr>     
## 1 business  
## 2 businesses
## 3 busy

As we see, the word “busy” was stemmed to “busi” which doesn’t make a lot of sense. If we filter for all values in the “word” column that contain the string “busi” we get a different list.

brk_stemmed %>% 
  filter(stringr::str_detect(stem,"busi")) %>%
  select(word) %>%
  unique()
## # A tibble: 10 x 1
##    word         
##    <chr>        
##  1 business     
##  2 businesses   
##  3 busy         
##  4 businessman’s
##  5 businesslike 
##  6 businessman  
##  7 business's   
##  8 busily       
##  9 businessmen  
## 10 busiest

We see that words like “businesslike,” “businessman” and “businesses” have not been stemmed down to “busi.” To see if this is a widespread issue, we can see the number of times each word was used.

brk_stemmed %>% 
  filter(stringr::str_detect(stem,"busi")) %>%
  count(word, sort = TRUE)
## # A tibble: 10 x 2
##    word              n
##    <chr>         <int>
##  1 business       2201
##  2 businesses     1013
##  3 businessman       9
##  4 busy              9
##  5 businessmen       3
##  6 businesslike      2
##  7 busiest           1
##  8 busily            1
##  9 business's        1
## 10 businessman’s     1

Given the low number of times words other than “business” and “businesses” were used, it is likely not going to be an issue. Anyway, none of this will make a difference for the sentiment analysis. But these types of issues might be important in further natual language processing analyses other than sentiment. So it is good that I pull on these threads.

3.6 Resulting dataframe for Further Analysis

For the analysis going forward, I am not going to stem the words as I do not know how the sentiment lexicons will handle stemmed words. Combining unnesting, number removal and stop word removal provides us with the output we were looking for at the beginning of this chapter - a dataframe with one token per row.

brk_words <- brk_letters %>%
  unnest_tokens(word, text) %>%       # splits text into words
  filter(!grepl('[0-9]', word)) %>%   # remove numbers
  anti_join(stop_words) %>%           # remove stop words
  group_by(year) %>%
  ungroup()
tibble(brk_words) %>% print(n=5)
## # A tibble: 201,504 x 2
##   year  word        
##   <chr> <chr>       
## 1 1971  stockholders
## 2 1971  berkshire   
## 3 1971  hathaway    
## 4 1971  pleasure    
## 5 1971  report      
## # ... with 201,499 more rows

In the next chapter, we will review different sentiment dictionaries before combining them with our dataframe and analyzing sentiment.


  1. To see if the stop word removal makes a difference, I will perform analyses with and without stop words. Again, the purpose of this exercise is to explore and learn and knowing the advantages and disadvantages of various techniques is an important aspect of learning.↩︎

  2. 489,291 total words after removing numbers, minus 206,089 after stop words removed, equals 283,202 / 489,291 = 57.88%↩︎

  3. https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html↩︎