Chapter 16 n-gram analysis
The last thing we’ll cover in this tutorial is an n-gram analysis. An n-gram analysis allows you to study a sequence of words. Two words in a sequence, like “Republican Party”, is a bigram. Three words in a sequence, like “Department of Justice”, is a trigram. A 5-gram (quintgram) would have 5 consecutive words (e.g., “President of the United States”)
To do a bigram analysis, we’ll use the unnest_token()
function. Previously, we wanted word tokens. But if we wanted ngrams, we can set the token
argument to “ngram”. We can also add the n
argument, which tells R how long your ngram should be (if n = 3
, you would be looking for trigrams).
What does this data frame look like?
## # A tibble: 6 x 6
## status_id created_at screen_name sentiment phd bigram
## <dbl> <dttm> <chr> <dbl> <lgl> <chr>
## 1 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE please always
## 2 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE always use
## 3 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE use a
## 4 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE a colorblind
## 5 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE colorblind friendly
## 6 1.35e18 2021-01-15 17:33:32 stevebagley 2 FALSE friendly palette
Now, we have a bigram column. Yay! What are the most common bigrams?
## # A tibble: 20 x 2
## bigram n
## <chr> <int>
## 1 academictwitter academicchatter 3379
## 2 academicchatter academictwitter 3003
## 3 in the 1458
## 4 u fe0f 1410
## 5 academictwitter phdchat 1387
## 6 phdchat academictwitter 1231
## 7 if you 1158
## 8 u 0001f48c 1158
## 9 u 0001f469 1018
## 10 a phd 987
## 11 us a 972
## 12 send us 957
## 13 a dm 952
## 14 s your 952
## 15 no more 947
## 16 academictwitter commissionsopen 944
## 17 commissionsopen commission 944
## 18 we got 943
## 19 u u 937
## 20 worries about 937
While most of the bigrams appear to be hashtags, you may notice that we still have our stop words (like “if” and “a’). How do we remove bigrams with this combination? Well, we can use the separate()
function in tidyr
, which parses a character column based on a”delimiter” (in this case, the delimiter is a space). Learn more about delimiters here. We can then filter()
out stop words. (filter()
is a function in dplyr
).
bigrams_separated <- tweet_bigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
tweet_biram_filtered <- bigrams_separated %>%
filter(!word1 %in% final_stop$word) %>%
filter(!word2 %in% final_stop$word) %>%
unite(bigram, word1, word2, sep = " ")
tweet_biram_filtered %>%
dplyr::count(bigram, sort = TRUE) %>%
head(10)
## # A tibble: 10 x 2
## bigram n
## <chr> <int>
## 1 commissionsopen commission 944
## 2 accept rush 932
## 3 commisions writingcommnunity 932
## 4 commission commisions 932
## 5 im taylor 932
## 6 writingcommnunity academic 932
## 7 academicchatter academicchatter 777
## 8 colorblind friendly 708
## 9 drawing figures 702
## 10 figures academicchatter 702
Interesting!
Want to visualize this bigram as a network? Check out this tutorial.
Want to make an ngram word cloud? Check out this tutorial.
A great tidytext run through by maintainer Julia Silge. Find it here.