1.2 The gutenbergr package

The gutenbergr package provides access to the public domain works from the Project Gutenberg collection. The package includes tools both for downloading books (stripping out the unhelpful header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find works of interest. In this book, we will mostly use the function gutenberg_download() that downloads one or more works from Project Gutenberg by ID.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering. It also allows you to perform filtering as an argument: