Chapter 21 Data Wrangling
Since we’ll be working with text data in our tutorial, we will need to wrangle the data so it can be used for supervised machine learning. If you are not working with text data, but instead are working with binaries, categorical variables, and continuous variables, you can go straight to your modeling (as we did with the logistic regression).
This is going to get a little complicated, so please bear with me!
The first thing we’ll want to do is construct a dataset of tokens from the conservative_data
(our labeled dataset). Here, we will use tidytext
as a quick and relatively easy process
<- unnest_tokens(conservative_data, word, description) %>% #tokenize
conservative_tokens anti_join(stop_words, by = "word") %>% #remove stopwords
count(id, word, sort = TRUE) #count the frequency of words used by tweet
21.1 tf-idf
Next, we’ll cast the data as a document-term matrix. Before we do this, though, we’ll add some more information to our text data: tf_idf.
tf_idf
is a NLP measure that indicates how unique a word is to a document in a corpus. If the tf-idf score is high, it means that word appears frequently in a document, but not in the other documents. Words that appear frequently in all documents would not have a tf-idf score (neither would words that occur sparingly). You can learn more about tfidf
in the tidytext textbook.
we used this information in a supervised machine learning model, as opposed to counts, because tf_idf
contain more information about the importance of a word to a specific document, whereas counts just gives you frequency of the words’ use. To put it another way: we can actually compare words based on how unique they are to a document using tf_idf
.
<- tidytext::bind_tf_idf(conservative_tokens, word, id, n) %>% #calculate tf-idf
con_dtm ::cast_dtm(id, word, tf_idf) #construct a document-term matrix with the tf-idf scores tidytext
Now, we have our wrangled text data into a document-term matrix form! In natural language processing, recall that type of wrangled data is called a “bag of words” pre-processing strategy. With this data, we can proceed with our supervised machine learning analysis.