2.6 Units other than words
Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole.
We may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.
PandP_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
PandP_sentences
#> # A tibble: 7,066 x 1
#> sentence
#> <chr>
#> 1 "pride and prejudice by jane austen chapter 1 it is a truth universally~
#> 2 "however little known the feelings or views of such a man may be on his first~
#> 3 "\"my dear mr."
#> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield pa~
#> 5 "mr."
#> 6 "bennet replied that he had not."
#> # ... with 7,060 more rows
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII. One possibility, if this is important, is to try using iconv()
, with something like iconv(text, to = 'latin1')
in a mutate statement before unnesting.
tibble(text = prideprejudice) %>%
mutate(text = iconv(text, to = "ASCII")) %>%
unnest_tokens(sentence, text, token = "sentences")
#> # A tibble: 7,066 x 1
#> sentence
#> <chr>
#> 1 "pride and prejudice by jane austen chapter 1 it is a truth universally~
#> 2 "however little known the feelings or views of such a man may be on his first~
#> 3 "\"my dear mr."
#> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield pa~
#> 5 "mr."
#> 6 "bennet replied that he had not."
#> # ... with 7,060 more rows
Another option in unnest_tokens()
is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
# 275 rows
austen_chapters
#> # A tibble: 275 x 2
#> book chapter
#> <fct> <chr>
#> 1 Sense & Sensibi~ "sense and sensibility\n\nby jane austen\n\n(1811)\n\n\n\n\n"
#> 2 Sense & Sensibi~ "\n\n\nthe family of dashwood had long been settled in susse~
#> 3 Sense & Sensibi~ "\n\n\nmrs. john dashwood now installed herself mistress of ~
#> 4 Sense & Sensibi~ "\n\n\nmrs. dashwood remained at norland several months; not~
#> 5 Sense & Sensibi~ "\n\n\n\"what a pity it is, elinor,\" said marianne, \"that ~
#> 6 Sense & Sensibi~ "\n\n\nno sooner was her answer dispatched, than mrs. dashwo~
#> # ... with 269 more rows
# 275 rows
tidy_books %>%
distinct(book, chapter)
#> # A tibble: 275 x 2
#> book chapter
#> <fct> <int>
#> 1 Sense & Sensibility 0
#> 2 Sense & Sensibility 1
#> 3 Sense & Sensibility 2
#> 4 Sense & Sensibility 3
#> 5 Sense & Sensibility 4
#> 6 Sense & Sensibility 5
#> # ... with 269 more rows
In the austen_chapters
data frame, each row corresponds to one chapter.
Near the beginning of this chapter, we used a similar regex to find where all the chapters were in Austen’s novels for a tidy data frame organized by one-word-per-row (Section 2.2). Using a regex as the token is somewhat similar to
tidy_books %>%
group_by(book, chapter) %>%
summarize(str_c(word, collapse = " "))
#> # A tibble: 275 x 3
#> # Groups: book [6]
#> book chapter `str_c(word, collapse = " ")`
#> <fct> <int> <chr>
#> 1 Sense & Sensi~ 0 sense and sensibility by jane austen 1811
#> 2 Sense & Sensi~ 1 chapter 1 the family of dashwood had long been settled~
#> 3 Sense & Sensi~ 2 chapter 2 mrs john dashwood now installed herself mist~
#> 4 Sense & Sensi~ 3 chapter 3 mrs dashwood remained at norland several mon~
#> 5 Sense & Sensi~ 4 chapter 4 what a pity it is elinor said marianne that ~
#> 6 Sense & Sensi~ 5 chapter 5 no sooner was her answer dispatched than mrs~
#> # ... with 269 more rows
We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words?
bing_negative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
chapter_words <- tidy_books %>%
count(book, chapter)
tidy_books %>%
semi_join(bing_negative) %>%
count(book, chapter, name = "negative_words") %>%
left_join(chapter_words) %>%
mutate(ratio = negative_words / n) %>%
filter(chapter != 0) %>%
group_by(book) %>%
top_n(1)
#> # A tibble: 6 x 5
#> # Groups: book [6]
#> book chapter negative_words n ratio
#> <fct> <int> <int> <int> <dbl>
#> 1 Sense & Sensibility 43 161 3405 0.0473
#> 2 Pride & Prejudice 34 111 2104 0.0528
#> 3 Mansfield Park 46 173 3685 0.0469
#> 4 Emma 15 151 3340 0.0452
#> 5 Northanger Abbey 21 149 2982 0.0500
#> 6 Persuasion 4 62 1807 0.0343