Chapter 24 Text Analysis and Text Mining

24.1 Introduction

Most “data analysis” has been focussed on quantitative methods, but in recent years text analysis and text mining methods have been developed in concert with natural language processing (NLP).

24.2 Theory and methods

Shawn Graham, Ian Milligan, and Scott Weingart (2015) “Topic Modeling: A Hands-On Adventure in Big Data”, chapter four of (Shawn Graham and Weingart 2015)

24.3 R

The definitive guide

Julia Silge and David Robinson (2016) Tidy Text Mining with R {most recent version dated 2016-12-19}

other general resources

Super User, 2019-02-04, An overview of the NLP ecosystem in R (#nlproc #textasdata)

24.3.2 Packages

24.3.2.1 CRAN Task View: NLP

CRAN Task View: Natural Language Processing

  • “This CRAN task view collects relevant R packages that support computational linguists in conducting analysis of speech and language on a variety of levels - setting focus on words, syntax, semantics, and pragmatics.”

24.3.2.3 {quanteda}

quanteda.io

package

CRAN page: quanteda: Quantitative Analysis of Textual Data

articles

“Getting Started with quanteda (package vignette)

24.3.2.4 {tidytext}

package

CRAN page: tidytext: Text Mining using ‘dplyr’, ‘ggplot2’, and Other Tidy Tools

github repo: tidytext on github

articles

Julia Silge, “Term Frequency and tf-idf Using Tidy Data Principles”, 2016-06-27

Julia Silge and David Robinson (2016-10-27) “Introduction to tidytext (package vignette)

24.3.2.5 {tm}

package

CRAN page: tm: Text Mining Package

articles

Inigo Feinerer (2015) “Introduction to the tm Package: Text Mining in R” (package vignette)

Ingo Feinerer, Kurt Hornik, David Meyer (2007) “Text Mining Infrastructure in R”, Journal of Statistical Software, 25 (5).


References

Shawn Graham, Ian Milligan, and Scott Weingart. 2015. Exploring Big Historical Data: The Historian’s Macroscope. Imperial College Press. http://www.themacroscope.org/2.0/.