This section provides some pointers to resources on character data and text processing in R.
Figure 9.8: Text and string manipulation with stringr and regular expressions from Posit cheatsheets.
Unicode characters and encodings
An inquisitive mind can easily spend a few weeks on Unicode characters and encodings:
Regular expressions are yet another topic to waste more time on than anyone must know:
Figure 9.9: Basic regular expressions in R (by Ian Kopacka)
available at Posit cheatsheets.
See Section E.4 of Appendix E for additional resources on regular expressions.
Books on text data
The topic of handling text data in R is big enough to receive book-length treatments:
The book Handling Strings with R (by Gaston Sanchez) provides a comprehensive overview of string manipulation in R.
The book Text Mining with R (by Julia Silge and David Robinson) provides a guide to text analysis within a tidy data framework. It uses the tidytext package to format text into tables with one token of text per row and manipulate them to perform advanced tasks.
Not a book on processing text, but a wonderful source of public-domain texts to process is https://www.gutenberg.org. The R package gutenbergr (Robinson, 2020) allows to search, download, and process books from this collection.
Natural language processing
Venturing from manipulating character symbols to the challenges of organizing text corpora and analyzing its semantics and pragmatics involves many additional techniques and transformation steps.
The CRAN Task View: Natural Language Processing provides an overview of many related R packages.
Here are some pointers to R packages that may be helpful along the way:
The lsa package provides routines for performing latent semantic analysis (LSA, see Wikipedia) in R.
The quanteda package provides a framework for quantitative text analysis in R.
The tokenizers package provides a consistent interface for converting natural language text into tokens (e.g., paragraphs, sentences, words).