Notes for Text Mining with R
Preface
I Text Mining with R
1
Tidy text format
1.1
The
unnest_tokens()
function
1.2
The
gutenbergr
package
1.3
Compare word frequency
1.4
Other tokenization methods
2
Sentiment analysis with tidy data
2.1
The
sentiments
dataset
2.2
Sentiment analysis with inner join
2.3
Comparing 3 different dictionaries
2.4
Most common positive and negative words
2.5
Wordclouds
2.6
Units other than words
3
Analyzing word and document frequency
3.1
tf-idf
3.1.1
Term frequency in Jane Austen’s novels
3.1.2
Zipf’s law
3.1.3
Word rank slope chart
3.1.4
The
bind_tf_idf()
function
3.2
Weighted log odds ratio
3.2.1
Log odds ratio
3.2.2
Model-based approach: Weighted log odds ratio
3.2.3
Discussions
3.2.4
bind_log_odds()
3.3
A corpus of physics texts
4
Relationships between words: n-grams and correlations
4.1
Tokenizing by n-gram
4.1.1
Filtering n-grams
4.1.2
Analyzing bigrams
4.1.3
Using bigrams to provide context in sentiment analysis
4.1.4
Visualizing a network of bigrams with
ggraph
4.1.5
Visualizing “friends”
4.2
Counting and correlating pairs of words with
widyr
4.2.1
Counting and correlating among sections
4.2.2
Pairwise correlation
5
Converting to and from non-tidy formats
5.1
Tidying a document-term matrix
5.2
Casting tidy text data into a matrix
5.3
Tidying corpus objects with metadata
6
Topic modeling
6.1
Latent Dirichlet Allocation
6.1.1
Example: Associated Press
6.2
Example: the great library heist
6.2.1
LDA on chapters
6.2.2
Per-document classification
6.2.3
By word assignments:
augment()
6.3
Tuning number of topics
7
Text classification
References
Appendix
A
Reviews on regular expressions
A.1
POSIX Character Classes
A.2
Greedy and lazy quantifiers
A.3
Looking ahead and back
A.4
Backreferences
B
Text processing examples in R
B.1
Replacing and removing
B.2
Combining and splitting
B.3
Extracting text from pdf and other files
B.3.1
Office documents
B.3.2
Images
Written with bookdown
Notes for “Text Mining with R: A Tidy Approach”
B.1
Replacing and removing