• Notes for Text Mining with R
  • Preface
  • I Text Mining with R
  • 1 Tidy text format
    • 1.1 The unnest_tokens() function
    • 1.2 The gutenbergr package
    • 1.3 Compare word frequency
    • 1.4 Other tokenization methods
  • 2 Sentiment analysis with tidy data
    • 2.1 The sentiments dataset
    • 2.2 Sentiment analysis with inner join
    • 2.3 Comparing 3 different dictionaries
    • 2.4 Most common positive and negative words
    • 2.5 Wordclouds
    • 2.6 Units other than words
  • 3 Analyzing word and document frequency
    • 3.1 tf-idf
      • 3.1.1 Term frequency in Jane Austen’s novels
      • 3.1.2 Zipf’s law
      • 3.1.3 Word rank slope chart
      • 3.1.4 The bind_tf_idf() function
    • 3.2 Weighted log odds ratio
      • 3.2.1 Log odds ratio
      • 3.2.2 Model-based approach: Weighted log odds ratio
      • 3.2.3 Discussions
      • 3.2.4 bind_log_odds()
    • 3.3 A corpus of physics texts
  • 4 Relationships between words: n-grams and correlations
    • 4.1 Tokenizing by n-gram
      • 4.1.1 Filtering n-grams
      • 4.1.2 Analyzing bigrams
      • 4.1.3 Using bigrams to provide context in sentiment analysis
      • 4.1.4 Visualizing a network of bigrams with ggraph
      • 4.1.5 Visualizing “friends”
    • 4.2 Counting and correlating pairs of words with widyr
      • 4.2.1 Counting and correlating among sections
      • 4.2.2 Pairwise correlation
  • 5 Converting to and from non-tidy formats
    • 5.1 Tidying a document-term matrix
    • 5.2 Casting tidy text data into a matrix
    • 5.3 Tidying corpus objects with metadata
  • 6 Topic modeling
    • 6.1 Latent Dirichlet Allocation
      • 6.1.1 Example: Associated Press
    • 6.2 Example: the great library heist
      • 6.2.1 LDA on chapters
      • 6.2.2 Per-document classification
      • 6.2.3 By word assignments: augment()
    • 6.3 Tuning number of topics
  • 7 Text classification
  • References
  • Appendix
  • A Reviews on regular expressions
    • A.1 POSIX Character Classes
    • A.2 Greedy and lazy quantifiers
    • A.3 Looking ahead and back
    • A.4 Backreferences
  • B Text processing examples in R
    • B.1 Replacing and removing
    • B.2 Combining and splitting
    • B.3 Extracting text from pdf and other files
      • B.3.1 Office documents
      • B.3.2 Images
  • Written with bookdown

Notes for “Text Mining with R: A Tidy Approach”

Notes for “Text Mining with R: A Tidy Approach”

Qiushi Yan

2020-05-11

Preface

This is a notebook concerning Text Mining with R: A Tidy Approach(Silge and Robinson 2017).

tidyverse and tidytext are automatically loaded before each chapter:

library(tidyverse)
library(tidytext)

I have defined a simiple function, facet_bar() to meet the frequent need in this book to make a facetted bar plot, with the y variable reordered by x in each facet by:

facet_bar <- function(df, y, x, by, nrow = 2, ncol = 2, scales = "free") {
  mapping <- aes(y = reorder_within({{ y }}, {{ x }}, {{ by }}), 
                 x = {{ x }}, 
                 fill = {{ by }})
  
  facet <- facet_wrap(vars({{ by }}), 
                      nrow = nrow, 
                      ncol = ncol,
                      scales = scales) 
  
  ggplot(df, mapping = mapping) + 
    geom_col(show.legend = FALSE) + 
    scale_y_reordered() + 
    facet + 
    ylab("")
} 

As a quick demostration of this function, we can plot the top 10 common words in Jane Austen’s six books:

austen_common <- janeaustenr::austen_books() %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  count(book, word) %>% 
  group_by(book) %>% 
  top_n(10) %>% 
  ungroup()

austen_common
#> # A tibble: 60 x 3
#>   book                word         n
#>   <fct>               <chr>    <int>
#> 1 Sense & Sensibility dashwood   231
#> 2 Sense & Sensibility edward     220
#> 3 Sense & Sensibility elinor     623
#> 4 Sense & Sensibility jennings   199
#> 5 Sense & Sensibility marianne   492
#> 6 Sense & Sensibility miss       210
#> # ... with 54 more rows
# make a bar plot 
facet_bar(austen_common,
          y = word,
          x = n,
          by = book,
          nrow = 3) + 
  labs(title = "Top 10 common words in Jane Austen's novels",
       x = "")