Chapter 21 Data Wrangling
Since we’ll be working with text data in our tutorial, we will need to wrangle the data so it can be used for supervised machine learning. If you are not working with text data, but instead are working with binaries, categorical variables, and continuous variables, you can go straight to your modeling (as we did with the logistic regression).
This is going to get a little complicated, so please bear with me!
The first thing we’ll want to do is construct a dataset of tokens from the conservative_data (our labeled dataset). Here, we will use tidytext as a quick and relatively easy process
conservative_tokens <- unnest_tokens(conservative_data, word, description) %>% #tokenize
anti_join(stop_words, by = "word") %>% #remove stopwords
count(id, word, sort = TRUE) #count the frequency of words used by tweet21.1 tf-idf
Next, we’ll cast the data as a document-term matrix. Before we do this, though, we’ll add some more information to our text data: tf_idf.
tf_idf is a NLP measure that indicates how unique a word is to a document in a corpus. If the tf-idf score is high, it means that word appears frequently in a document, but not in the other documents. Words that appear frequently in all documents would not have a tf-idf score (neither would words that occur sparingly). You can learn more about tfidf in the tidytext textbook.
we used this information in a supervised machine learning model, as opposed to counts, because tf_idf contain more information about the importance of a word to a specific document, whereas counts just gives you frequency of the words’ use. To put it another way: we can actually compare words based on how unique they are to a document using tf_idf.
con_dtm <- tidytext::bind_tf_idf(conservative_tokens, word, id, n) %>% #calculate tf-idf
tidytext::cast_dtm(id, word, tf_idf) #construct a document-term matrix with the tf-idf scoresNow, we have our wrangled text data into a document-term matrix form! In natural language processing, recall that type of wrangled data is called a “bag of words” pre-processing strategy. With this data, we can proceed with our supervised machine learning analysis.