6.2 Example: the great library heist

To evaluate our topic model, we first divided 4 books into chapters. If a topic model with \(K = 4\) performs well, then there should be a corresponding segmentation among those chpaters coming from those 4 different books.

6.2.2 Per-document classification

We may want to how and which topics are associated with each document, in particular, the majority of chapters in the same book should belong to the same topic (if we assign a chapter\(_m\) to a topic\(_k\) when the \(k\)th postion in \(\hat{\theta}_m\) is significantly higher).

Ideally we would expect that in every book panel, there is one boxplot highly centered at 1 with the other 3 boxes at 0, since chapters in the same book are categorized in the same topic.

Another way of visualizaing this is to plot the histogaram of chapter-topic proportions of each topic. We would expect to see two extremes

It does look like some chapters from Twenty Thousand Leagues under the Sea were somewhat associated with other topic 3 (whereas most chapters are assigned to topic 2). Let’s put in some investigation

Which chapters have a relatively high proportion of topic 3?

As we see here, topci modeling can be viewed as text classification to some degree. We can find the topic that was most associated with each chapter using top_n(), which is essentially the “classification” of that chapter. For example, the 57th chapter of Great Expectations are assigned to topic 1.

We can then compare each to the “consensus” topic for each book (the most common topic among its chapters), and see if there is misidentification

In all of the 4 books, no single chapter is misidentified to another topic!

For future need, classification results are stored in classification

6.2.3 By word assignments: augment()

One step of the LDA algorithm is assigning each word in each document to a topic \(z_{m, n}\). The more words in a document are assigned to that topic, generally, the more weight \(\theta_m\) will go on that document-topic classification.

We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the augment() function, which is to add information to each observation in the original data.

To get a sense of how our model works, we can draw a bar plot of assigned topics in each book

We can combine this assignments table with the classification to find which words were incorrectly classified by a coofusion matrix.

What were the most commonly mistaken words?

We can see that a number of words were often assigned to Pride and Prejudice or The War of the Worlds cluster even when they appeared in Great Expectations or Twenty Thousand Leagues under the Sea. For some of these words, such as “Jane”, it comes as no surprise that it will be assigned to Pride and Prejudice.

It is possible that a word is assigned to a book, even though it never appears in that book.