Chapter 3 Tokenization, Text Cleaning and Normalization
At the end of the previous chapter, we ended with a dataframe brk_letters
containing 49 rows and 2 columns. For each row, there is one column that contains the year and second column that contains a string of the full text of the letter for the corresponding year. For example, row 44 (first row in the table below) has “2014” in the “year” column and a string of 23,466 words of text in the “text” column.
tibble(brk_letters) %>% tail()
## # A tibble: 6 x 2
## year text
## <dbl> <chr>
## 1 2014 "A note to readers: Fifty years ago, today’s management took charge at Berkshir~
## 2 2015 "BERKSHIRE HATHAWAY INC.\nTo the Shareholders of Berkshire Hathaway Inc.:\nBerk~
## 3 2016 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 4 2017 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 5 2018 "BERKSHIRE HATHAWAY INC.\n\nTo the Shareholders of Berkshire Hathaway Inc.:\nBe~
## 6 2019 "\nBERKSHIRE HATHAWAY INC.\n\n\nTo the Shareholders of Berkshire Hathaway Inc.:~
As I mentioned, there are limited types of analyses that can be performed with the data in this format. In order to analyze the data, it must be broken down into smaller pieces or tokens by a process which is called tokenization. A token is a meaningful unit of text. The goal for this chapter is to restructure the datafram so there is one token per row.
3.1 Roadmap for Tokenization, Text Cleaning and Normalization
A raw string of text must be tokenized in order to analyze. But there are other adjustments that might need to be made to better analyze the data. Figure 3.1 shows the process of preparing the text for further analysis.
3.2 Tokenization
The first step is using the unnest_token function in the tidytext package to put each word in a separate row. As you can see, the dimensions are now 512,391 rows and 2 columns. The unnest token function also performed text cleaning by converting all upper case letters to lower case and removing all special characters and punctuation.
<- brk_letters %>%
brk_words unnest_tokens(word, text) #splits text into words
tibble(brk_words)
## # A tibble: 501,124 x 2
## year word
## <dbl> <chr>
## 1 1971 to
## 2 1971 the
## 3 1971 stockholders
## 4 1971 of
## 5 1971 berkshire
## 6 1971 hathaway
## 7 1971 inc
## 8 1971 it
## 9 1971 is
## 10 1971 a
## # ... with 501,114 more rows
3.3 Removing Numbers
Numbers will not provide us any insight to sentiment so we will remove them using the following code. Rows are reduced from 512,391 to 489,291.
<- brk_letters %>%
brk_words unnest_tokens(word, text) %>% # splits text into words
filter(!grepl('[0-9]', word)) # remove numbers
tibble(brk_words) %>% print(n=5)
## # A tibble: 478,243 x 2
## year word
## <dbl> <chr>
## 1 1971 to
## 2 1971 the
## 3 1971 stockholders
## 4 1971 of
## 5 1971 berkshire
## # ... with 478,238 more rows
3.4 Stop Word Removal
“Stop words” are words which are common or frequently used but are often not useful for analysis and usually removed. The tidytext
package includes a list of 1,149 stop words.
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
When we look at the data, we see a column “lexicon” which indicates that there are likely more than one lexicon in this dataset. To obtain unique values we use the following code and see there are three different lexicons “SMART,” “snowball” and “onix” in the dataframe.
%>%
stop_words select(lexicon) %>% #select "lexicon" column
unique() #show unique values in column
## # A tibble: 3 x 1
## lexicon
## <chr>
## 1 SMART
## 2 snowball
## 3 onix
To see if certain words are duplicated in the different lexicons, we use the following code and see there are 728 unique words in the list.
%>%
stop_words select(word) %>% #select "word" column
unique() #show unique values in column
## # A tibble: 728 x 1
## word
## <chr>
## 1 a
## 2 a's
## 3 able
## 4 about
## 5 above
## 6 according
## 7 accordingly
## 8 across
## 9 actually
## 10 after
## # ... with 718 more rows
Before removing stop words from our corpus, let’s look at the top 10 most frequently used words over the entire 49 year period.
tibble(brk_words) %>%
count(word, sort = T)
## # A tibble: 15,626 x 2
## word n
## <chr> <int>
## 1 the 21840
## 2 of 14356
## 3 to 12067
## 4 in 11478
## 5 and 10632
## 6 a 10054
## 7 that 7476
## 8 we 7077
## 9 our 5922
## 10 is 4677
## # ... with 15,616 more rows
The top 10 list is comprised of exactly what the stop word removal process is trying to eliminate - frequently used words that do not contribute to our sentiment analysis. We also see that there are 15,843 unique words used throughout the corpus.
Removing the stop words9 reduces the list from 489,291 to 206,089 rows.
<- brk_letters %>%
brk_words1 unnest_tokens(word, text) %>% # splits text into words
filter(!grepl('[0-9]', word)) %>% # remove numbers
anti_join(stop_words) # remove stop words
tibble(brk_words1) %>% print(n=5)
## # A tibble: 201,504 x 2
## year word
## <dbl> <chr>
## 1 1971 stockholders
## 2 1971 berkshire
## 3 1971 hathaway
## 4 1971 pleasure
## 5 1971 report
## # ... with 201,499 more rows
The top 10 list of frequently used words changes drastically by removing the stop words but the lexical variety decreases slightly from 15,843 words to 15,180 words.
tibble(brk_words1) %>%
count(word, sort = T)
## # A tibble: 14,963 x 2
## word n
## <chr> <int>
## 1 business 2201
## 2 earnings 1981
## 3 berkshire 1925
## 4 company 1298
## 5 insurance 1272
## 6 million 1262
## 7 businesses 1013
## 8 billion 881
## 9 companies 858
## 10 stock 796
## # ... with 14,953 more rows
Think about this. 633 individual words accounted for 283,202 of the total words used or almost 58%!10
When we examined the tidytext stop word list, we noticed that there were three lexicons. Keep in mind that those are not the only stop word lists. There are many others. To see what differences there might be, I decided to try a generic stop word list curated by Tim Loughran and Bill McDonald, two finance professors at the University of Notre Dame who have done some seminal work of text analysis in finance. The list is available at their website and is shown below.
<- readtext(from_loughran_website)
loughran_stop_long
<- loughran_stop_long %>%
loughran_stop_long unnest_tokens(word, text) %>%
select(word)
tibble(loughran_stop_long) %>% print(n=5)
## # A tibble: 571 x 1
## word
## <chr>
## 1 a
## 2 a's
## 3 able
## 4 about
## 5 above
## # ... with 566 more rows
It is simple to re-run the analysis substituting the Loughran stop list for the tidytext stop list. Using the Loughran list reduces the rows to 223,968 versus 206,089 for the tidytext stop list.
<- brk_letters %>%
brk_words_2 unnest_tokens(word, text) %>% # splits text into words
filter(!grepl('[0-9]', word)) %>% # remove numbers
anti_join(loughran_stop_long) # remove stop words
tibble(brk_words_2) %>% print(n=5)
## # A tibble: 218,982 x 2
## year word
## <dbl> <chr>
## 1 1971 stockholders
## 2 1971 berkshire
## 3 1971 hathaway
## 4 1971 pleasure
## 5 1971 report
## # ... with 218,977 more rows
I do not know if using different stop word lists would make a difference in the final sentiment analysis. I just did this to show that other stop lists can be easily substituted. The other thing I could have done is add the Loughran stop list to the tidyverse stop list resulting in 4 lexicons instead of 3.
Looking at the most frequently used words after stop word removal using the tidytext lexicon, we see that there are several other words that appear very frequently like “business,” “Berkshire” and “company” which are unlikely to add value to our sentiment analysis since they are neutral words. For this reason, I might want to consider developing my own list of custom stop words (but in the interests of time, I didn’t).
tibble(brk_words1) %>%
count(word, sort = T)
## # A tibble: 14,963 x 2
## word n
## <chr> <int>
## 1 business 2201
## 2 earnings 1981
## 3 berkshire 1925
## 4 company 1298
## 5 insurance 1272
## 6 million 1262
## 7 businesses 1013
## 8 billion 881
## 9 companies 858
## 10 stock 796
## # ... with 14,953 more rows
3.5 Text Cleaning
There are numerous aspects of text cleaning (and while I mention them here, I did not perform all of them).
Converting upper case letters to lower case.
Spell checking
Substitution of contractions such as converting “I’m” to “I am.”
Transformation numbers into word numerals such as converting “29” to “twenty nine.”
Removing punctuation and special characters.
Converting acronyms to regular expressions such as “USA” to “United States of America.”
We observed that the tokenization process with tidytext removed puctuation and special characters and converted upper case letters to lower case. For this analysis, I did not perform the other text cleaning operations mentioned.
3.5.1 Text Normalization - Stemming and Lemmatization
Language contains words in different tenses, plurals or derived from other words. Text normalization converts or strips words down to a base form. For example stemming would reduce the words, “plays,” “playing” and “played” to the common root of “play” or “am,” “are” and “is” to “be.” Words that are stemmed are still recognizable. Lemmatization on the other hand reduces words to their roots which might not be recognizable. For example, “operate,” “operating,” “operative” and “operation” would all be reduced to “oper” which is not a word11.
For example, if we go back to our list of 10 most frequently used words after stop word removal, we see that “business” and “businesses” and “company” and “companies” are considered to be separate words each with its own count.
tibble(brk_words1) %>%
count(word, sort = T)
## # A tibble: 14,963 x 2
## word n
## <chr> <int>
## 1 business 2201
## 2 earnings 1981
## 3 berkshire 1925
## 4 company 1298
## 5 insurance 1272
## 6 million 1262
## 7 businesses 1013
## 8 billion 881
## 9 companies 858
## 10 stock 796
## # ... with 14,953 more rows
Rather than stem as a separate step, I will use the SnowballC
package in conjunction with the other packages we already have loaded, to stem and perform the other operations I have detailed above using the following code to produce a new dataframe called brk_stemmed
.
library(SnowballC)
<- tibble(brk_letters) %>%
brk_stemmed unnest_tokens(word, text) %>% # splits text into words
mutate(stem = wordStem(word)) %>% # stems words and creates column
filter(!grepl('[0-9]', word)) %>% # remove numbers
anti_join(stop_words) %>% # remove stop words
group_by(year) %>%
ungroup()
brk_stemmed
## # A tibble: 201,504 x 3
## year word stem
## <chr> <chr> <chr>
## 1 1971 stockholders stockhold
## 2 1971 berkshire berkshir
## 3 1971 hathaway hathawai
## 4 1971 pleasure pleasur
## 5 1971 report report
## 6 1971 operating oper
## 7 1971 earnings earn
## 8 1971 excluding exclud
## 9 1971 capital capit
## 10 1971 gains gain
## # ... with 201,494 more rows
The result is a dataframe with three columns - “year,” “word” and “stem” which as you can see is a stemmed version of the word. Some stemmings like “hathaway” being converted to “hathawai” does not seem to make sense. Let us examine our new top 10 list.
%>%
brk_stemmed count(stem, sort = TRUE)
## # A tibble: 9,495 x 2
## stem n
## <chr> <int>
## 1 busi 3223
## 2 earn 2476
## 3 compani 2156
## 4 berkshir 1926
## 5 insur 1775
## 6 oper 1661
## 7 million 1442
## 8 manag 1417
## 9 share 1306
## 10 invest 1285
## # ... with 9,485 more rows
One thing we notice is that the number of unique words has decreased from 15,180 to 9,618. We also see that words like “business” and “businesses” have both been converted to “busi” and combined. In the previous word list the word “business” was used 2,260 and “businesses” was used 1,050 times. Now the word “busi” appears 3,319 times. But 2,260 + 1,050 = 3,310 which means there was some other word which appeared 9 times which was stemmed and included. Let’s find what other words were stemmed to “busi” by running some code.
%>%
brk_stemmed filter(stem == "busi") %>% # filters "stem" column for values = "busi"
select(word) %>% # selects "word" column
unique() # shows unique values in "word" column
## # A tibble: 3 x 1
## word
## <chr>
## 1 business
## 2 businesses
## 3 busy
As we see, the word “busy” was stemmed to “busi” which doesn’t make a lot of sense. If we filter for all values in the “word” column that contain the string “busi” we get a different list.
%>%
brk_stemmed filter(stringr::str_detect(stem,"busi")) %>%
select(word) %>%
unique()
## # A tibble: 10 x 1
## word
## <chr>
## 1 business
## 2 businesses
## 3 busy
## 4 businessman’s
## 5 businesslike
## 6 businessman
## 7 business's
## 8 busily
## 9 businessmen
## 10 busiest
We see that words like “businesslike,” “businessman” and “businesses” have not been stemmed down to “busi.” To see if this is a widespread issue, we can see the number of times each word was used.
%>%
brk_stemmed filter(stringr::str_detect(stem,"busi")) %>%
count(word, sort = TRUE)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 business 2201
## 2 businesses 1013
## 3 businessman 9
## 4 busy 9
## 5 businessmen 3
## 6 businesslike 2
## 7 busiest 1
## 8 busily 1
## 9 business's 1
## 10 businessman’s 1
Given the low number of times words other than “business” and “businesses” were used, it is likely not going to be an issue. Anyway, none of this will make a difference for the sentiment analysis. But these types of issues might be important in further natual language processing analyses other than sentiment. So it is good that I pull on these threads.
3.6 Resulting dataframe for Further Analysis
For the analysis going forward, I am not going to stem the words as I do not know how the sentiment lexicons will handle stemmed words. Combining unnesting, number removal and stop word removal provides us with the output we were looking for at the beginning of this chapter - a dataframe with one token per row.
<- brk_letters %>%
brk_words unnest_tokens(word, text) %>% # splits text into words
filter(!grepl('[0-9]', word)) %>% # remove numbers
anti_join(stop_words) %>% # remove stop words
group_by(year) %>%
ungroup()
tibble(brk_words) %>% print(n=5)
## # A tibble: 201,504 x 2
## year word
## <chr> <chr>
## 1 1971 stockholders
## 2 1971 berkshire
## 3 1971 hathaway
## 4 1971 pleasure
## 5 1971 report
## # ... with 201,499 more rows
In the next chapter, we will review different sentiment dictionaries before combining them with our dataframe and analyzing sentiment.
To see if the stop word removal makes a difference, I will perform analyses with and without stop words. Again, the purpose of this exercise is to explore and learn and knowing the advantages and disadvantages of various techniques is an important aspect of learning.↩︎
489,291 total words after removing numbers, minus 206,089 after stop words removed, equals 283,202 / 489,291 = 57.88%↩︎
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html↩︎