12 Calculating tf-idf Scores with Tidytext
Another common analysis of text uses a metric known as ‘tf-idf’. This stands for term frequency-inverse document frequency. Take a corpus with a bunch of documents (here we’re using articles as individual documents). TF-idf scores the words in each document, normalised by how often they are found in the other documents. It’s a measure of the ‘importance’ of a particular word for a given document, and it’s often used in combination with other analyses, for example as an input to a text classifier, or to help rank search results. Though it should be used with caution, it’s also a way of understanding the way language is used in newspapers, and how it changed over time.
The function in the tidytext library bind_tf_idf
takes care of all this. First you need to get a frequency count for each issue in the dataframe. We’ll make a unique issue code by pasting together the date and the nlp into one string, using the function paste0
, and save this as a file named ‘issue_words’.
First load the necessary libraries and tokenised data we created in the last notebook:
library(tidytext)
library(tidyverse)
library(rmarkdown)
##
## Attaching package: 'rmarkdown'
## The following object is masked from 'package:future':
##
## run
load('tokenised_news_sample')
= tokenised_news_sample %>%
issue_words mutate(issue_code = paste0(title, full_date)) %>%
group_by(issue_code, word) %>%
tally() %>%
arrange(desc(n))
Next use bind_tf_idf()
to calculate the measurement for each word in the dataframe.
%>% bind_tf_idf(word, issue_code, n) issue_words
## # A tibble: 6,455,306 × 6
## # Groups: issue_code [313]
## issue_code word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1855-09-27 severely 1676 0.0243 0.0525 0.00127
## 2 1855-09-27 slightly 1614 0.0234 0.0729 0.00170
## 3 1855-02-28 feb 1340 0.0155 0.241 0.00373
## 4 1855-07-10 severely 1329 0.00830 0.0525 0.000435
## 5 1855-07-10 slightly 1100 0.00687 0.0729 0.000501
## 6 1855-10-09 street 1069 0.00707 0 0
## 7 1855-11-20 street 1004 0.00656 0 0
## 8 1855-02-08 jan 943 0.0158 0 0
## 9 1855-02-26 feb 925 0.0149 0.241 0.00359
## 10 1855-07-17 lord 917 0.00590 0 0
## # … with 6,455,296 more rows
Now we can sort it in descending order of the issue code, to find the most ‘unusual’ words:
%>%
issue_words bind_tf_idf(word, issue_code, n) %>%
arrange(desc(tf_idf))
## # A tibble: 6,455,306 × 6
## # Groups: issue_code [313]
## issue_code word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1855-12-12 breeder 207 0.00341 3.44 0.0118
## 2 1855-08-03 garraway's 187 0.00218 3.55 0.00774
## 3 1855-02-26 regt 783 0.0126 0.570 0.00719
## 4 1855-04-06 hopwood 184 0.00307 2.19 0.00672
## 5 1855-02-14 aguado 82 0.00136 4.36 0.00593
## 6 1855-08-03 farebrother 224 0.00261 2.25 0.00587
## 7 1855-12-12 exhibitor 153 0.00252 2.28 0.00575
## 8 1855-02-28 diarr 197 0.00227 2.53 0.00575
## 9 1855-04-05 hopwood 145 0.00254 2.19 0.00557
## 10 1855-06-23 hatfield's 90 0.00100 5.05 0.00507
## # … with 6,455,296 more rows
What does this tell us? Well, unfortunately, most of the ‘unusual’ words by this measure are OCR errors or spelling mistakes. One way to correct for this is to only include words in an English language dictionary. Use the lexicon
package and then the command data(grady_augmented)
to download a dictionary of English language words and common proper nouns, as a character vector:
library(lexicon)
data(grady_augmented)
Get tf-idf scores again, filtering the dataset first to include only words within grady_augmented
%>%
issue_words filter(word %in% grady_augmented) %>%
bind_tf_idf(word, issue_code, n) %>%
arrange(desc(tf_idf))
## # A tibble: 3,161,323 × 6
## # Groups: issue_code [313]
## issue_code word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1855-12-12 breeder 207 0.00461 3.44 0.0159
## 2 1855-12-12 exhibitor 153 0.00341 2.28 0.00778
## 3 1855-06-04 gardener 171 0.00392 1.51 0.00593
## 4 1855-10-19 nitrogen 43 0.000916 5.05 0.00463
## 5 1855-09-27 dangerously 373 0.00708 0.616 0.00436
## 6 1855-05-07 etty 91 0.00210 1.99 0.00417
## 7 1855-06-13 coinage 114 0.00266 1.53 0.00406
## 8 1855-12-26 outdoor 72 0.00156 2.57 0.00400
## 9 1855-02-08 dysentery 170 0.00374 1.02 0.00381
## 10 1855-09-27 serjeants 178 0.00338 1.07 0.00362
## # … with 3,161,313 more rows
The highest tf-idf score is for the word ‘tetanus’ on 19th May, 1856. This means that this word occurred lots of times in this issue, and not very often in other issues. This might point to particular topics, and it might, in particular, point to topics which had a very short or specific lifespan.
If we had a bigger dataset, or one arranged in another way, these words might point to linguistic differences between regions, publishers, or writers.
Let’s find the tetanus articles. We can use a function called str_detect()
with filter()
to filter to just articles containing a given word. So we’ll go back to the untokenised dataframe.
load('news_sample_dataframe')
%>% filter(str_detect(text, "tetanus")) news_sample_dataframe
## # A tibble: 20 × 7
## article_code art text title year date full_date
## <int> <chr> <chr> <chr> <chr> <chr> <date>
## 1 5937 0164 "SCIENCE AND ART. IMPROVE… "" 1855 1120 1855-11-20
## 2 6612 0160 "SCIENCE AND ART. THE BE… "" 1855 1204 1855-12-04
## 3 7466 0097 ", w L ECESItEg 24, INi… "" 1855 1225 1855-12-25
## 4 7657 0288 "[DECEMBER WHOLESALE - STY… "" 1855 1225 1855-12-25
## 5 8889 0059 "THE SUN, LONDON, MONDAY … "" 1855 0108 1855-01-08
## 6 8999 0169 "THE SUN, LONDON, MONDAY … "" 1855 0108 1855-01-08
## 7 10160 0207 "COUNTRY MARKETS. 1 HADDI… "" 1855 0115 1855-01-15
## 8 21433 0005 "(Flom the Journal des D… "" 1855 0405 1855-04-05
## 9 21508 0080 "THE PANAMA FAIL WAY. (F… "" 1855 0405 1855-04-05
## 10 47834 0019 "BREVET. Lieutenant-General… "" 1855 1003 1855-10-03
## 11 47935 0120 "m u c. Piancforte C4ria… "" 1855 1003 1855-10-03
## 12 51650 0014 "DR. KANE'S ARCTIC EXPEDI… "" 1855 1027 1855-10-27
## 13 51733 0097 "DR. KANE'S ARCTIC EXPEDI… "" 1855 1027 1855-10-27
## 14 59165 0025 "THE PACIFIC MAILS. SOUTH… "" 1855 1218 1855-12-18
## 15 59238 0098 "SUPPOSED MURDER OF A SP… "" 1855 1218 1855-12-18
## 16 62602 0039 "DEATHS AT SCUTARL Nomina… "" 1855 0110 1855-01-10
## 17 66839 0016 "THE WAR. The Military G… "" 1855 0419 1855-04-19
## 18 70060 0039 "If is were postponed to… "" 1855 0627 1855-06-27
## 19 79825 0019 "Volama anb ftntral. THE … "" 1855 0113 1855-01-13
## 20 83533 0020 "Sad ant Volta. THE RUGE… "" 1855 1222 1855-12-22
These disproportionately high mentions of the word tetanus seem to be related to the trial of William Palmer (https://en.wikipedia.org/wiki/William_Palmer_(murderer), who was convicted for the murder of his friend by strychnine - which apparently caused tetanus.