Chapter 16 n-gram analysis

The last thing we’ll cover in this tutorial is an n-gram analysis. An n-gram analysis allows you to study a sequence of words. Two words in a sequence, like “Republican Party”, is a bigram. Three words in a sequence, like “Department of Justice”, is a trigram. A 5-gram (quintgram) would have 5 consecutive words (e.g., “President of the United States”)

To do a bigram analysis, we’ll use the unnest_token() function. Previously, we wanted word tokens. But if we wanted ngrams, we can set the token argument to “ngram”. We can also add the n argument, which tells R how long your ngram should be (if n = 3, you would be looking for trigrams).

tweet_bigram <- tw_data %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

What does this data frame look like?

head(tweet_bigram)
## # A tibble: 6 x 6
##   status_id created_at          screen_name sentiment phd   bigram             
##       <dbl> <dttm>              <chr>           <dbl> <lgl> <chr>              
## 1   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE please always      
## 2   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE always use         
## 3   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE use a              
## 4   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE a colorblind       
## 5   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE colorblind friendly
## 6   1.35e18 2021-01-15 17:33:32 stevebagley         2 FALSE friendly palette

Now, we have a bigram column. Yay! What are the most common bigrams?

tweet_bigram %>%
  dplyr::count(bigram, sort = TRUE) %>%
  head(20)
## # A tibble: 20 x 2
##    bigram                              n
##    <chr>                           <int>
##  1 academictwitter academicchatter  3379
##  2 academicchatter academictwitter  3003
##  3 in the                           1458
##  4 u fe0f                           1410
##  5 academictwitter phdchat          1387
##  6 phdchat academictwitter          1231
##  7 if you                           1158
##  8 u 0001f48c                       1158
##  9 u 0001f469                       1018
## 10 a phd                             987
## 11 us a                              972
## 12 send us                           957
## 13 a dm                              952
## 14 s your                            952
## 15 no more                           947
## 16 academictwitter commissionsopen   944
## 17 commissionsopen commission        944
## 18 we got                            943
## 19 u u                               937
## 20 worries about                     937

While most of the bigrams appear to be hashtags, you may notice that we still have our stop words (like “if” and “a’). How do we remove bigrams with this combination? Well, we can use the separate() function in tidyr, which parses a character column based on a”delimiter” (in this case, the delimiter is a space). Learn more about delimiters here. We can then filter() out stop words. (filter() is a function in dplyr).

bigrams_separated <- tweet_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

tweet_biram_filtered <- bigrams_separated %>%
  filter(!word1 %in% final_stop$word) %>%
  filter(!word2 %in% final_stop$word) %>%
  unite(bigram, word1, word2, sep = " ")

tweet_biram_filtered %>%
  dplyr::count(bigram, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    bigram                              n
##    <chr>                           <int>
##  1 commissionsopen commission        944
##  2 accept rush                       932
##  3 commisions writingcommnunity      932
##  4 commission commisions             932
##  5 im taylor                         932
##  6 writingcommnunity academic        932
##  7 academicchatter academicchatter   777
##  8 colorblind friendly               708
##  9 drawing figures                   702
## 10 figures academicchatter           702

Interesting!

Want to visualize this bigram as a network? Check out this tutorial.
Want to make an ngram word cloud? Check out this tutorial.
A great tidytext run through by maintainer Julia Silge. Find it here.