4.2 Counting and correlating pairs of words with widyr

Tokenizing by n-gram is a useful way to explore pairs of adjacent words. However, we may also be interested in words that tend to co-occur within particular documents or particular chapters, even if they don’t occur next to each other.

For this reason, it is sometimes necessary to “cast” a tidy dataset into a wide matrix (such as a co-occurrence matrix), performs an operation such as a correlation on it, then re-tidies the result. This is when widyr comes to the rescue, the workflow is shown in the book:

4.2.1 Counting and correlating among sections

We devide the book “Pride and Prejudice” into 10-line sections , as we did for Section 2.2 (80 lines). We may be interested in what words tend to appear within the same section.

widyr::pairwise_counts() counts the number of times each pair of items appear together within a group defined by “feature”. In this case, it counts the number of times each pair of words appear together within a section, note it still returns a tidy data frame, although the underlying computation took place in a matrix form :

We can easily find the words that most often occur with Darcy. Since pairwise_count records both the counts of (word_A, word_B) and (word_B, word_B), it does not matter we filter at item1 or item2

4.2.2 Pairwise correlation

Pairs like “Elizabeth” and “Darcy” are the most common co-occurring words, but that’s not particularly meaningful since they’re also the most common individual words. We may instead want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately.

In particular, we compute the \(\phi\) coefficient. Introduced by Karl Pearson, this measure is similar to the Pearson correlation coefficient in its interpretation. In fact, a Pearson correlation coefficient estimated for two binary variables will return the \(\phi\) coefficient. The phi coefficient is related to the chi-squared statistic for a 2 × 2 contingency table

\[ \phi = \sqrt{\frac{\chi^2}{n}} \]

where \(n\) denotes sample size. In the case of pairwise counts, \(\phi\) is calculated by

\[ \phi = \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{n_{1·}n_{0·}n_{·1}n_{·0}}} \]

We see, from the above quation, that \(\phi\) is “standarized” by individual counts, so various word pair with different individual frequency can be compared to each other:

The computation of \(\phi\) can be simply done by pairwise_cor (other choice of correlation coefficients specified by method). The procedure can be somewhat computationally expensive, so we filter out uncommon words

Which word is most correlated with “lady”?

This lets us pick particular interesting words and find the other words most associated with them

How about a network visualization to see the overall correlation pattern?

Note that unlike the bigram analysis, the relationships here are symmetrical, rather than directional (there are no arrows). We can also see that while pairings of names and titles that dominated bigram pairings are common, such as “colonel/fitzwilliam”, we can also see pairings of words that appear close to each other, such as “walk” and “park”, or “dance” and “ball”.