4.1 Tokenizing by n-gram

unnest_tokens() have been used to tokenize the text by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses. But we can also use the function to tokenize into consecutive sequences of words of length n, called n-grams.

We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set nto 2, we are examining pairs of two consecutive words, often called “bigrams”:

4.1.2 Analyzing bigrams

The result of separating bigrams is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common “streets” mentioned in each book:

A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the weighted log odds (Section 3.2) of bigrams across Austen novels.

4.1.3 Using bigrams to provide context in sentiment analysis

Context matters in sentiment analysis. For example, the words “happy” and “like” will be counted as positive, even in a sentence like

“I’m not happy and I don’t like it!”

Now that we have the data organized into bigrams, it’s easy to tell how often words are preceded by a word like “not”:

Let’s use the AFINN lexicon for sentiment analysis, which you may recall gives a numeric sentiment value for each word, with positive or negative numbers indicating the direction of the sentiment.

It’s worth asking which words contributed the most in the “wrong” direction. To compute that, we can multiply their value by the number of times they appear (so that a word with a value of +3 occurring 10 times has as much impact as a word with a sentiment value of +1 occurring 30 times). We visualize the result with a bar plot

The bigrams “not like” and “not help” were overwhelmingly the largest causes of misidentification, making the text seem much more positive than it is. But we can see phrases like “not afraid” and “not fail” sometimes suggest text is more negative than it is.

“Not” isn’t the only term that provides some context for the following word. We could pick four common words not, no, never and without that negate the subsequent term, and use the same joining and counting approach to examine all of them at once.

4.1.4 Visualizing a network of bigrams with ggraph

Note how tidygraph handles network data, the main tbl_graph object splits a network into two data frames: Node data and Edge data

We see that salutations such as “miss”, “lady”, “sir”, “and”colonel" are common centers of nodes, which are often followed by names. We also see pairs or triplets along the outside that form common short phrases (“half hour”, “thousand pounds”, or “short time/pause”).

Note that this is a visualization of a Markov chain, a common model in text processing, where the choice of a word only depends on its previous word. In this case, a random generator following this model might spit out “dear”, then “sir”, then “william/walter/thomas/thomas’s”, by following each word to the most common words that follow it.

A polished graph:

4.1.5 Visualizing “friends”

Here I deviate from the original text, where Julia and David analyzed King James Version of the Bible. However, I have collected the transcripts of the famous TV series, friends (season 1). Let’s start a simple analysis first by loading the data

Retrieve a clean data frame with word counts (bigrams did not work very well)

Draw a network