Visualizing text data

Learning outcomes/objective: Learn…

…about basic concepts of Natural Language Processing (NLP)
…become familiar with a typical R-tidymodels-workflow for text analysis
…about basic graphs for text data

Sources: Original material; Camille Landesvatter’s topic model lecture; Silge (2017)

1 Text as Data

Many sources of text data for social scientists
- open ended survey responses, social media data, interview transcripts, news articles, official documents (public records, etc.), research publications, etc.
even if data of interest not in textual form (yet)
- speech recognition, text recognition, machine translation etc.
“Past”: text data often ignored (by quants), selectively read, anecdotally used or manually labeled by researchers
Today: wide variety of text analytically methods (supervised + unsupervised) and increasing adoption of these methods by social scientists (Wilkerson and Casas 2017)

2 Language in NLP

corpus: a collection of documents
documents: single tweets, single statements, single text files, etc.
tokenization: “the process of splitting text into tokens” (Silge 2017)
- tokens = single words, sequences of words or entire sentences
- Often defines the unit of analysis
bag of words (method): approach where all tokens are put together in a “bag” without considering their order¹
- possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
- Q: Are stop words really uninformative?
document-term/feature matrix (DTM/DFM): common format to store text data (examples later)

3 (R-)Workflow for Text Analysis

Data collection (“Obtaining Text”*)
Data manipulation
- Corpus pre-processing (“From Text to Data”*)
- Vectorization: Turning Text into a Matrix (DTM/DFM²) (“From Text to Data”*)
Analysis (“Quantitative Analysis of Text”*)
Validation and Model Selection (“Evaluating Performance”³)
Visualization and Model Interpretation

Visualization at descriptive and/or modelling stage

4 Data collection

use existing corpora
- R: {gutenbergr}: contains more than 60k book transcripts
- R: {quanteda.corpora}: provides multiple corpora; see here for a overview
- R: {topicmodels}: contains Associated Press data (2246 news articles mostly from around 1988)
- search for datasets, see e.g. this list
collect new corpora
- electronic sources: application user interfaces (APIs, e.g. Facebook, Twitter), web scraping, wikipedia, transcripts of all german electoral programs
- undigitized text, e.g. scans of documents
- data from interviews, surveys and/or experiments (speech → text)
consider relevant applications to turn your data into text format (speech-to-text recognition, pdf-to-text, OCR, Mechanical Turk and Crowdflower)

5 Data manipulation

5.1 Data manipulation: Basics (1)

Text data is different from “structured” data (e.g., a set of rows and columns)
Most often not “clean” but rather messy
- shortcuts, dialect, incorrect grammar, missing words, spelling issues, ambiguous language, humor
- web context: emojis, # (twitter), etc.
Preprocessing
- much more important & crucial determinant of successful text analysis!

5.2 Data manipulation: Basics (2)

Common steps in pre-processing text data:

stemming (removal of word suffixes), e.g., computation, computational, computer $\rightarrow$ compute
lemmatisation (reduce a term to its lemma, i.e., its base form), e.g., “better” $\rightarrow$ “good”
transformation to lower cases
removal of punctuation (e.g., ,;.-) / numbers / white spaces / URLs / stopwords / very infrequent words
$\rightarrow$ Always choose your preprocessing steps carefully!
- e.g., removing punctuation: “I enjoy: eating, my cat and leaving out commas” vs. “I enjoy: eating my cat and leaving out commas”
Choosing unit of analysis?! (sentence vs. unigram vs. bigram etc.)

5.3 Data manipulation: Basics (3)

In principle, all those transformations can be achieved by using base R
Other packages however provide ready-to-apply functions, such as {tidytext}, {tm} or {quanteda}
Important
- transform data to corpus object or tidy text object (examples on the next slides)

6 Lab: Basic data manipulation & visualization using `tidytext`

6.1 Functions & packages

unnest_tokens(): split a column into tokens, flattening the table into one-token-per-row.
- By default unnest_tokens() removes punctuation and makes all terms lowercase automatically
- ?unnest_tokens
  - to_lower = TRUE: Specify whether to convert tokens to lowercase
  - drop = TRUE: Specify whether original input column should get dropped
  - token = "words": Specify unit for tokenizing, or a custom tokenizing function
anti_join(): Filtering joins filter rows from x based on the presence or absence of matches in y
- e.g., anti_join(stopwords): Filter out stopwords
cast_dtm(): Turns a “tidy” one-term-per-document-per-row data frame into a DocumentTermMatrix from the tm package
- See also cast_tdm() and cast_dfm()
- Usage:

data_tidy %>% # Use tidy text format data
    count(text_id,word) %>% # Count words
    cast_dtm(document = text_id, # Spread into matrix
           term = word,
           value = n) %>%
  as.matrix() # Store as matrix

Tidytext format lends itself to using dplyr functions
Filter out particular tokens:

data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http"))

Filter out 5000 rarest tokens:

tokens_rare <- 
        data_tidy %>% 
        count(word) %>% # Count frequency of tokens
        arrange(n) %>% # Order dataframe
        slice(1:5000) %>% # Take first 5000 rows (rarest tokens)
        pull(word) # Extract tokens
        
 # Filter out tokens define above
 data_tidy %>% 
    filter(!word %in% tokens_rare)

6.2 Importing data & tidy text format & stop words

Pre-processing with tidytext requires your data to be stored in a tidy text object
Main characteristics of a tidy text dataset
- one-token-per-row
- “long format” (Row: Document $\times$ token)

First, we have to retrieve some data. We’ll use tweet data from Russian trolls (Roeder 2018) (these are not real people anyways). The data below is edited data (variables & observations subsampled, language == English, account_category == LeftTroll or RightTroll) based on the file IRAhandle_tweets_1.csv. The variables are explained here: https://github.com/fivethirtyeight/russian-troll-tweets/.

Below we start by installing/loading the necessary packages:

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud,
                             knitr)

Then we load the data into R (use the link on your computer):

# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())
dim(data) # What dimensions do we have?

[1] 4000    9

# View(data)

We start by adding an identifier text_id to the documents/tweets.

# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?

# head(data)
kable(head(data))

text_id	author	text	region	language	publish_date	following	followers	account_type	account_category
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
2	ACEJINEV	Trans-Siberian Orchestra Tickets On Sale. Buy #TSO Christmas Concert #Tickets @eCityTickets https://t.co/FWMaOWxEh7 https://t.co/J2ZzUe87Fl	United States	English	11/7/2016 14:54	803	908	Left	LeftTroll
3	ANDEERLWR	#anderr GRAPHIC VIDEO : Did ANTIFA Violence Cause the Tragedy at #Charlottesville? https://t.co/y2QeK4AC0J https://t.co/OEXcWQdiL4	United States	English	8/13/2017 16:47	21	6	Right	RightTroll
4	AMELIEBALDWIN	Remember all the fuss when the US wrongly accused #Russia of destroying hospitals in #Syria? Well this is for real https://t.co/U5LHrZLgnB	United States	English	3/18/2017 20:22	2303	2744	Right	RightTroll
5	ALBERTMORENMORE	RT @2AFight: More people die from alcohol than guns. READ> https://t.co/7H6OlJJtEC #2A #NRA #tcot #tgdn #PJNET #ccot #teaparty https://t.c…	United States	English	2/18/2016 9:02	1136	729	Right	RightTroll
6	AMELIEBALDWIN	'@RealAlexJones @BarackObama @POTUS @theDemocrats Obama is the Divider-In-Chief. His mission from Soros was destroy America from within.'	United States	English	11/30/2016 8:25	2366	2578	Right	RightTroll

dim(data)

[1] 4000   10

Then, by using the unnest_tokens() function from tidytext we transform this data to a tidy text format, where the words (tokens) of each text/document are written into their own rows.

# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% # unnest & keep orig. documents
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())
    
# head(data_tidy)
kable(head(data_tidy))

text_id	author	text	word	region	language	publish_date	following	followers	account_type	account_category
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	genflynn	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	realdonaldtrump	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	mike_pence	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	witnessed	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	wis	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	anti	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll

dim(data_tidy)

[1] 43998    11

Questions:
- How does our dataset change after tokenization and removing stopwords? How many observations do we now have? And what do the variable text_id and word identify/store?
- Also, have a closer look at the single words. Do you notice anything else that has changed, e.g., is something missing from the original text?

We can use str() and data_tidy$word[1:15] to inspect the resulting data/tokens.

str(data_tidy,10)

tibble [43,998 × 11] (S3: tbl_df/tbl/data.frame)
 $ text_id         : int [1:43998] 1 1 1 1 1 1 1 1 1 1 ...
 $ author          : chr [1:43998] "AMELIEBALDWIN" "AMELIEBALDWIN" "AMELIEBALDWIN" "AMELIEBALDWIN" ...
 $ text            : chr [1:43998] "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ ...
 $ word            : chr [1:43998] "genflynn" "realdonaldtrump" "mike_pence" "witnessed" ...
 $ region          : chr [1:43998] "United States" "United States" "United States" "United States" ...
 $ language        : chr [1:43998] "English" "English" "English" "English" ...
 $ publish_date    : chr [1:43998] "10/9/2016 4:45" "10/9/2016 4:45" "10/9/2016 4:45" "10/9/2016 4:45" ...
 $ following       : num [1:43998] 1944 1944 1944 1944 1944 ...
 $ followers       : num [1:43998] 2472 2472 2472 2472 2472 ...
 $ account_type    : chr [1:43998] "Right" "Right" "Right" "Right" ...
 $ account_category: chr [1:43998] "RightTroll" "RightTroll" "RightTroll" "RightTroll" ...

data_tidy$word[1:15]

 [1] "genflynn"        "realdonaldtrump" "mike_pence"      "witnessed"      
 [5] "wis"             "anti"            "worker"          "anti"           
 [9] "liberty"         "agenda"          "elite"           "b.s"            
[13] "https"           "t.co"            "url3frnfqt"

6.3 Data manipulation: Tidytext Example (2)

Other transformations may need some dealing with regular expressions
- e.g., to remove white space with tidytext (s+ describes a blank space):

example_white_space <- gsub("\\s+","",data$text)
example_white_space[1:5]

[1] "'@GenFlynn@realDonaldTrump@mike_penceafterwhatIwitnessedtoday...wisneedsserioushelp!Antiworkerantilibertyagenda.EliteB.S.https://t.co/URL3FrNfqT'"
[2] "Trans-SiberianOrchestraTicketsOnSale.Buy#TSOChristmasConcert#Tickets@eCityTicketshttps://t.co/FWMaOWxEh7https://t.co/J2ZzUe87Fl"                  
[3] "#anderrGRAPHICVIDEO:DidANTIFAViolenceCausetheTragedyat#Charlottesville?https://t.co/y2QeK4AC0Jhttps://t.co/OEXcWQdiL4"                            
[4] "RememberallthefusswhentheUSwronglyaccused#Russiaofdestroyinghospitalsin#Syria?Wellthisisforrealhttps://t.co/U5LHrZLgnB"                           
[5] "RT@2AFight:Morepeoplediefromalcoholthanguns.READ>https://t.co/7H6OlJJtEC#2A#NRA#tcot#tgdn#PJNET#ccot#teapartyhttps://t.c…"

Advantage: tidy text format → regular R functions can be used
- …instead of functions specialized to analyze a corpus object
e.g., use dplyr workflow to count the most popular words in your text data:

data_tidy %>% 
    count(word) %>% 
    arrange(desc(n))

Below an example where we first identify the rarest tokens and then filter them out:

# Identify rare tokens
tokens_rare <- 
        data_tidy %>% 
        count(word) %>% # Count frequency of tokens
        arrange(n) %>% # Order dataframe
        slice(1:5000) %>% # Take first 5000 rows (rarest tokens)
        pull(word) # Extract tokens
        
# Filter out tokens define above
 data_tidy_filtered <- 
    data_tidy %>% 
    filter(!word %in% tokens_rare)
 
 dim(data_tidy_filtered)

[1] 38998    11

Tidytext is a good starting point (in my opinion), because we (can) carry out these steps individually
- other packages combine many steps into one single function (e.g. quanteda combines pre-processing and DFM casting in one step)
R (as usual) offers many ways to achieve similar or same results
- e.g. you could also import, filter and pre-process using dplyr and tidytext, further pre-process and vectorize with tm or quanteda (tm has simpler grammar but slightly fewer features), use machine learning applications and eventually re-convert to tidy format for interpretation and visualization (ggplot2)

7 Vectorization: Basics

Text analytical models (e.g., topic models) often require the input data to be stored in a certain format
Typically: document-term matrix (DTM), sometimes also called document-feature matrix (DFM)
- turn raw text into a vector-space representation
- matrix where each row represents a document and each column represents a word
  - term-frequency (tf): the number within each cell describes the number of times the word appears in the document
  - term frequency–inverse document frequency (tf-idf): weights occurrence of certain words, e.g., lowering weight of word “education” in corpus of articles on educational inequality

8 Lab: Vectorization with `Tidytext`

Repeat all the steps from above…

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud,
                             knitr)



# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())
dim(data) # What dimensions do we have?

[1] 4000    9

# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?


# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% 
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())

kable(head(data_tidy))

text_id	author	text	word	region	language	publish_date	following	followers	account_type	account_category
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	genflynn	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	realdonaldtrump	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	mike_pence	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	witnessed	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	wis	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll
1	AMELIEBALDWIN	'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT'	anti	United States	English	10/9/2016 4:45	1944	2472	Right	RightTroll

With the cast_dtm function from the tidytext package, we can now transform it to a DTM.

# Cast tidy text data into DTM format
dtm <- data_tidy %>% 
  count(text_id,word) %>%
  cast_dtm(document = text_id,
                 term = word,
                 value = n) %>%
        as.matrix()

# Check the dimensions and a subset of the DTM
dim(dtm)

[1]  4000 15153

print(dtm[1:6,1:6]) # important: this is only snippet of DTM (6 terms/cols, 6 rows only)

    Terms
Docs agenda anti b.s elite genflynn https
   1      1    2   1     1        1     1
   2      0    0   0     0        0     2
   3      0    0   0     0        0     2
   4      0    0   0     0        0     1
   5      0    0   0     0        0     2
   6      0    0   0     0        0     0

9 Lab: Text visualization

Repeat all the steps from above…

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud)

# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())


# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?


# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% 
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())

9.1 Wordclouds

set.seed(42)


# Aggregate by word
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>% # ?
    count(word) %>% # ?
    arrange(desc(n)) %>%
    slice(1:20)
    

ggplot(data_plot, aes(label = word, size = n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

9.2 Wordclouds across subsets (grouped)

set.seed(42)


# Aggregate by word
data_plot <- data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    count(word, account_category) %>% 
    group_by(account_category) %>%
    arrange(desc(n)) %>%
    slice(1:20) %>%
    ungroup()
    
# Wordcloud: Coloring different groups
ggplot(data_plot, aes(label = word, size = n, color = account_category)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

# Wordcloud: Faceting different groups
ggplot(data_plot, aes(label = word, size = n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal() +
  facet_wrap(~account_category)

9.3 Barplots (Frequency)

set.seed(42)


# Data for vertical barplot
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    group_by(word) %>% 
  summarize(n= n()) %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    mutate(word = factor(word, # Convert to factor for ordering
                                             levels = as.character(.$word),
                                             ordered = TRUE))
    
    
# Create barplot
ggplot(data_plot, aes(x = word, y = n)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Data for horizontal barplot
data_plot <- data_plot %>%
    arrange(n) %>%
    mutate(word = factor(word,
                                             levels = as.character(.$word),
                                             ordered = TRUE))

# Create horizontal barplot
ggplot(data_plot, aes(x = n, y = word)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.y = element_text(angle = 45, hjust = 1))

# Data for faceted horizontal barplot
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    group_by(word, account_category) %>% 
  summarize(n= n(),
                    account_category = first(account_category)) %>%
    group_by(account_category) %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    ungroup()


# Create horizontal barplot
ggplot(data_plot, aes(x = n, y = word)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.y = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~account_category)

10 `TM` example: Text pre-processing and vectorization

$\rightarrow$ consider alternative packages (e.g., tm, quanteda)
Example: tm package
- input: corpus not tidytext object
- What is a corpus in R? $\rightarrow$ group of documents with associated metadata

# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols()) 

data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything())
dim(data)

# Clean corpus
corpus_clean <- VCorpus(VectorSource(data$text)) %>%
  tm_map(removePunctuation, preserve_intra_word_dashes = TRUE) %>%
  tm_map(removeNumbers) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, words = c(stopwords("en"))) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(stemDocument)

# Check exemplary document
corpus_clean[["1"]][["content"]]

[1] "genflynn realdonaldtrump mikep wit today wis need serious help anti worker anti liberti agenda elit bs httpstcourlfrnfqt"

In case you pre-processed your data with the tm package, remember we ended with a pre-processed corpus object
Now, simply apply the DocumentTermMatrix function to this corpus object

# Pass your "clean" corpus object to the DocumentTermMatrix function
dtm_tm <- DocumentTermMatrix(corpus_clean, control = list(wordLengths = c(2, Inf))) # control argument here is specified to include words that are at least two characters long

# Check a subset of the DTM
inspect(dtm_tm[,1:6])

<<DocumentTermMatrix (documents: 4000, terms: 6)>>
Non-/sparse entries: 20/23980
Sparsity           : 100%
Maximal term length: 12
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   ��� ����� ������ ꭶꮎ� ������������ ꮮꭺꮖꭼ
  1514   1     0      0   0            0    0
  1640   0     1      0   0            0    0
  1953   1     0      0   0            0    0
  2230   1     0      0   0            0    0
  2345   1     0      0   0            0    0
  3509   0     0      0   1            0    1
  3940   2     0      0   0            0    0
  556    1     0      0   0            0    0
  640    0     1      0   0            0    0
  884    0     0      1   0            0    0

Q: How do the terms between the DTM we created with tidytext and the one created with tm differ? Why?

References

Roeder, Oliver. 2018. “Why We’re Sharing 3 Million Russian Troll Tweets.” https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/.

Silge, Julia. 2017. Text Mining with r : A Tidy Approach. First edition. Beijing, China.

Wilkerson, John, and Andreu Casas. 2017. “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annu. Rev. Polit. Sci. 20 (1): 529–44.

Footnotes

Alternatives: bigrams/word pairs, word embeddings↩︎
Document-term matrix (DTM) is a mathematical representation of text data where rows correspond to documents in the corpus, and columns correspond to terms (words or phrases). DFM, also known as a document-feature matrix, is similar to a DTM but instead of representing the count of terms in each document, it represents the presence or absence of terms.↩︎
Wilkerson and Casas (2017)↩︎