3.3 A corpus of physics texts

Let’s work with another corpus of documents, to see what terms are important in a different set of works.

Count words as usual

Visualize the highest log odds words (log odds is particularly useful in comparing wrting styles)

Why there is _k and _x in Einstein’s text ?

Some cleaning up of the text may be in demand. Also notice that there are separate “co” and “ordinate” items in the high tf-idf words for the Einstein text; the unnest_tokens() function separates around punctuation like hyphens by default. Notice that the tf-idf scores for “co” and “ordinate” are close to same!

“AB”, “RC”, and so forth are names of rays, circles, angles, and so forth for Huygens.