Chapter 21 Data Wrangling

Since we’ll be working with text data in our tutorial, we will need to wrangle the data so it can be used for supervised machine learning. If you are not working with text data, but instead are working with binaries, categorical variables, and continuous variables, you can go straight to your modeling (as we did with the logistic regression).

This is going to get a little complicated, so please bear with me!

The first thing we’ll want to do is construct a dataset of tokens from the conservative_data (our labeled dataset). Here, we will use tidytext as a quick and relatively easy process

conservative_tokens <- unnest_tokens(conservative_data, word, description) %>% #tokenize
  anti_join(stop_words, by = "word") %>% #remove stopwords
  count(id, word, sort = TRUE) #count the frequency of words used by tweet

21.1 tf-idf

Next, we’ll cast the data as a document-term matrix. Before we do this, though, we’ll add some more information to our text data: tf_idf.

tf_idf is a NLP measure that indicates how unique a word is to a document in a corpus. If the tf-idf score is high, it means that word appears frequently in a document, but not in the other documents. Words that appear frequently in all documents would not have a tf-idf score (neither would words that occur sparingly). You can learn more about tfidf in the tidytext textbook.

we used this information in a supervised machine learning model, as opposed to counts, because tf_idf contain more information about the importance of a word to a specific document, whereas counts just gives you frequency of the words’ use. To put it another way: we can actually compare words based on how unique they are to a document using tf_idf.

con_dtm <- tidytext::bind_tf_idf(conservative_tokens, word, id, n) %>% #calculate tf-idf
  tidytext::cast_dtm(id, word, tf_idf) #construct a document-term matrix with the tf-idf scores

Now, we have our wrangled text data into a document-term matrix form! In natural language processing, recall that type of wrangled data is called a “bag of words” pre-processing strategy. With this data, we can proceed with our supervised machine learning analysis.