5.1 Tidying a document-term matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. This is a matrix where

  • each row represents one document

  • each column represents one term (word)

  • each value (typically) contains the number of appearances of that term in that document

Document-term matrices are often stored as a sparse matrix object. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.

tidytext provides ways of converting between these two formats:

  • tidy() turns a document-term matrix into a tidy data frame (one-token-per-row)

  • cast() turns a tidy data frame into a matrix.There are three variations of this verb corresponding to different classes of matricies : cast_sparse() (converting to a sparse matrix from the Matrix package), cast_dtm() (converting to a DocumentTermMatrix object from tm), and cast_dfm() (converting to a dfm object from quanteda)

DocumentTermMatrix class is built into the tm package. Notice that this DTM is 99% sparse (99% of document-word pairs are zero).

Terms() is a accessor function to extract the full distinct word vector

tidy it to get a tidy data frame

quanteda uses dfm (document-feauture matrix) as a common data structure for text data. For example, the quanteda package comes with a corpus of presidential inauguration speeches, which can be converted to a dfm using the appropriate function.

We, of course, want to tidy it

Suppose we would like to see how the usage of some user specified words change over time. We start by complete() the data frame, and then total words per speech: