12 Calculating tf-idf Scores with Tidytext

Another common analysis of text uses a metric known as ‘tf-idf’. This stands for term frequency-inverse document frequency. Take a corpus with a bunch of documents (here we’re using articles as individual documents). TF-idf scores the words in each document, normalised by how often they are found in the other documents. It’s a measure of the ‘importance’ of a particular word for a given document, and it’s often used in combination with other analyses, for example as an input to a text classifier, or to help rank search results. Though it should be used with caution, it’s also a way of understanding the way language is used in newspapers, and how it changed over time.

The function in the tidytext library bind_tf_idf takes care of all this. First you need to get a frequency count for each issue in the dataframe. We’ll make a unique issue code by pasting together the date and the nlp into one string, using the function paste0, and save this as a file named ‘issue_words’.

First load the necessary libraries and tokenised data we created in the last notebook:

library(tidytext)
library(tidyverse)
library(rmarkdown)
## 
## Attaching package: 'rmarkdown'
## The following object is masked from 'package:future':
## 
##     run
load('tokenised_news_sample')
issue_words = tokenised_news_sample %>% 
  mutate(issue_code = paste0(title, full_date)) %>%
  group_by(issue_code, word) %>%
  tally() %>% 
  arrange(desc(n))

Next use bind_tf_idf() to calculate the measurement for each word in the dataframe.

issue_words %>% bind_tf_idf(word, issue_code, n)
## # A tibble: 6,455,306 × 6
## # Groups:   issue_code [313]
##    issue_code word         n      tf    idf   tf_idf
##    <chr>      <chr>    <int>   <dbl>  <dbl>    <dbl>
##  1 1855-09-27 severely  1676 0.0243  0.0525 0.00127 
##  2 1855-09-27 slightly  1614 0.0234  0.0729 0.00170 
##  3 1855-02-28 feb       1340 0.0155  0.241  0.00373 
##  4 1855-07-10 severely  1329 0.00830 0.0525 0.000435
##  5 1855-07-10 slightly  1100 0.00687 0.0729 0.000501
##  6 1855-10-09 street    1069 0.00707 0      0       
##  7 1855-11-20 street    1004 0.00656 0      0       
##  8 1855-02-08 jan        943 0.0158  0      0       
##  9 1855-02-26 feb        925 0.0149  0.241  0.00359 
## 10 1855-07-17 lord       917 0.00590 0      0       
## # … with 6,455,296 more rows

Now we can sort it in descending order of the issue code, to find the most ‘unusual’ words:

issue_words %>% 
  bind_tf_idf(word, issue_code, n) %>% 
  arrange(desc(tf_idf))
## # A tibble: 6,455,306 × 6
## # Groups:   issue_code [313]
##    issue_code word            n      tf   idf  tf_idf
##    <chr>      <chr>       <int>   <dbl> <dbl>   <dbl>
##  1 1855-12-12 breeder       207 0.00341 3.44  0.0118 
##  2 1855-08-03 garraway's    187 0.00218 3.55  0.00774
##  3 1855-02-26 regt          783 0.0126  0.570 0.00719
##  4 1855-04-06 hopwood       184 0.00307 2.19  0.00672
##  5 1855-02-14 aguado         82 0.00136 4.36  0.00593
##  6 1855-08-03 farebrother   224 0.00261 2.25  0.00587
##  7 1855-12-12 exhibitor     153 0.00252 2.28  0.00575
##  8 1855-02-28 diarr         197 0.00227 2.53  0.00575
##  9 1855-04-05 hopwood       145 0.00254 2.19  0.00557
## 10 1855-06-23 hatfield's     90 0.00100 5.05  0.00507
## # … with 6,455,296 more rows

What does this tell us? Well, unfortunately, most of the ‘unusual’ words by this measure are OCR errors or spelling mistakes. One way to correct for this is to only include words in an English language dictionary. Use the lexicon package and then the command data(grady_augmented) to download a dictionary of English language words and common proper nouns, as a character vector:

library(lexicon)

data(grady_augmented)

Get tf-idf scores again, filtering the dataset first to include only words within grady_augmented

issue_words %>% 
  filter(word %in% grady_augmented) %>%
  bind_tf_idf(word, issue_code, n) %>% 
  arrange(desc(tf_idf))
## # A tibble: 3,161,323 × 6
## # Groups:   issue_code [313]
##    issue_code word            n       tf   idf  tf_idf
##    <chr>      <chr>       <int>    <dbl> <dbl>   <dbl>
##  1 1855-12-12 breeder       207 0.00461  3.44  0.0159 
##  2 1855-12-12 exhibitor     153 0.00341  2.28  0.00778
##  3 1855-06-04 gardener      171 0.00392  1.51  0.00593
##  4 1855-10-19 nitrogen       43 0.000916 5.05  0.00463
##  5 1855-09-27 dangerously   373 0.00708  0.616 0.00436
##  6 1855-05-07 etty           91 0.00210  1.99  0.00417
##  7 1855-06-13 coinage       114 0.00266  1.53  0.00406
##  8 1855-12-26 outdoor        72 0.00156  2.57  0.00400
##  9 1855-02-08 dysentery     170 0.00374  1.02  0.00381
## 10 1855-09-27 serjeants     178 0.00338  1.07  0.00362
## # … with 3,161,313 more rows

The highest tf-idf score is for the word ‘tetanus’ on 19th May, 1856. This means that this word occurred lots of times in this issue, and not very often in other issues. This might point to particular topics, and it might, in particular, point to topics which had a very short or specific lifespan.

If we had a bigger dataset, or one arranged in another way, these words might point to linguistic differences between regions, publishers, or writers.

Let’s find the tetanus articles. We can use a function called str_detect() with filter() to filter to just articles containing a given word. So we’ll go back to the untokenised dataframe.

load('news_sample_dataframe')
news_sample_dataframe %>% filter(str_detect(text, "tetanus")) 
## # A tibble: 20 × 7
##    article_code art   text                          title year  date  full_date 
##           <int> <chr> <chr>                         <chr> <chr> <chr> <date>    
##  1         5937 0164  "SCIENCE  AND  ART.  IMPROVE… ""    1855  1120  1855-11-20
##  2         6612 0160  "SCIENCE  AND  ART.  THE  BE… ""    1855  1204  1855-12-04
##  3         7466 0097  ",  w  L  ECESItEg  24,  INi… ""    1855  1225  1855-12-25
##  4         7657 0288  "[DECEMBER  WHOLESALE  - STY… ""    1855  1225  1855-12-25
##  5         8889 0059  "THE  SUN,  LONDON,  MONDAY … ""    1855  0108  1855-01-08
##  6         8999 0169  "THE  SUN,  LONDON,  MONDAY … ""    1855  0108  1855-01-08
##  7        10160 0207  "COUNTRY  MARKETS.  1  HADDI… ""    1855  0115  1855-01-15
##  8        21433 0005  "(Flom  the  Journal  des  D… ""    1855  0405  1855-04-05
##  9        21508 0080  "THE  PANAMA  FAIL  WAY.  (F… ""    1855  0405  1855-04-05
## 10        47834 0019  "BREVET.  Lieutenant-General… ""    1855  1003  1855-10-03
## 11        47935 0120  "m  u  c.  Piancforte  C4ria… ""    1855  1003  1855-10-03
## 12        51650 0014  "DR.  KANE'S  ARCTIC  EXPEDI… ""    1855  1027  1855-10-27
## 13        51733 0097  "DR.  KANE'S  ARCTIC  EXPEDI… ""    1855  1027  1855-10-27
## 14        59165 0025  "THE  PACIFIC  MAILS.  SOUTH… ""    1855  1218  1855-12-18
## 15        59238 0098  "SUPPOSED  MURDER  OF  A  SP… ""    1855  1218  1855-12-18
## 16        62602 0039  "DEATHS  AT  SCUTARL  Nomina… ""    1855  0110  1855-01-10
## 17        66839 0016  "THE  WAR.  The  Military  G… ""    1855  0419  1855-04-19
## 18        70060 0039  "If  is  were  postponed  to… ""    1855  0627  1855-06-27
## 19        79825 0019  "Volama  anb  ftntral.  THE … ""    1855  0113  1855-01-13
## 20        83533 0020  "Sad  ant  Volta.  THE  RUGE… ""    1855  1222  1855-12-22

These disproportionately high mentions of the word tetanus seem to be related to the trial of William Palmer (https://en.wikipedia.org/wiki/William_Palmer_(murderer), who was convicted for the murder of his friend by strychnine - which apparently caused tetanus.