1.2 The gutenbergr
package
The gutenbergr package provides access to the public domain works from the Project Gutenberg collection. The package includes tools both for downloading books (stripping out the unhelpful header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find works of interest. In this book, we will mostly use the function gutenberg_download()
that downloads one or more works from Project Gutenberg by ID.
The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:
gutenberg_metadata
#> # A tibble: 51,997 x 8
#> gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#> <int> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 0 <NA> <NA> NA en <NA> Publi~
#> 2 1 "The~ Jeffe~ 1638 en United States L~ Publi~
#> 3 2 "The~ Unite~ 1 en American Revolu~ Publi~
#> 4 3 "Joh~ Kenne~ 1666 en <NA> Publi~
#> 5 4 "Lin~ Linco~ 3 en US Civil War Publi~
#> 6 5 "The~ Unite~ 1 en American Revolu~ Publi~
#> # ... with 5.199e+04 more rows, and 1 more variable: has_text <lgl>
For example, you could find the Gutenberg ID of Wuthering Heights by doing:
gutenberg_metadata %>%
filter(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#> gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#> <int> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 768 Wuth~ Bront~ 405 en Gothic Fiction/~ Publi~
#> # ... with 1 more variable: has_text <lgl>
gutenberg_download(768)
#> # A tibble: 12,085 x 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 "WUTHERING HEIGHTS"
#> 2 768 ""
#> 3 768 ""
#> 4 768 "CHAPTER I"
#> 5 768 ""
#> 6 768 ""
#> # ... with 1.208e+04 more rows
In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works()
function does this pre-filtering. It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
#> # A tibble: 10 x 8
#> gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#> <int> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 105 Pers~ Auste~ 68 en <NA> Publi~
#> 2 121 Nort~ Auste~ 68 en Gothic Fiction Publi~
#> 3 141 Mans~ Auste~ 68 en <NA> Publi~
#> 4 158 Emma Auste~ 68 en <NA> Publi~
#> 5 161 Sens~ Auste~ 68 en <NA> Publi~
#> 6 946 Lady~ Auste~ 68 en <NA> Publi~
#> # ... with 4 more rows, and 1 more variable: has_text <lgl>