Chapter 23 Text Analysis and Text Mining

23.1 Introduction

Most “data analysis” has been focussed on quantitative methods, but in recent years text analysis and text mining methods have been developed in concert with natural language processing (NLP).

23.2 Theory and methods

Shawn Graham, Ian Milligan, and Scott Weingart (2015) “Topic Modeling: A Hands-On Adventure in Big Data”, chapter four of (Shawn Graham and Weingart 2015)

23.3 R

The definitive guide

Julia Silge and David Robinson (2016) Tidy Text Mining with R {most recent version dated 2016-12-19}

other general resources

Super User, 2019-02-04, An overview of the NLP ecosystem in R (#nlproc #textasdata)

23.3.2 Packages CRAN Task View: NLP

CRAN Task View: Natural Language Processing

  • “This CRAN task view collects relevant R packages that support computational linguists in conducting analysis of speech and language on a variety of levels - setting focus on words, syntax, semantics, and pragmatics.” {quanteda}


CRAN page: quanteda: Quantitative Analysis of Textual Data


“Getting Started with quanteda (package vignette) {tidytext}


CRAN page: tidytext: Text Mining using ‘dplyr’, ‘ggplot2’, and Other Tidy Tools

github repo: tidytext on github


Julia Silge, “Term Frequency and tf-idf Using Tidy Data Principles”, 2016-06-27

Julia Silge and David Robinson (2016-10-27) “Introduction to tidytext (package vignette) {tm}


CRAN page: tm: Text Mining Package


Inigo Feinerer (2015) “Introduction to the tm Package: Text Mining in R” (package vignette)

Ingo Feinerer, Kurt Hornik, David Meyer (2007) “Text Mining Infrastructure in R”, Journal of Statistical Software, 25 (5).


Shawn Graham, Ian Milligan, and Scott Weingart. 2015. Exploring Big Historical Data: The Historian’s Macroscope. Imperial College Press.