Text Mining for Social Sciences (Summer 2024)
1 Preface
Dear student,
if you read this script, you are either participating in one of my courses on digital methods for the social sciences, or at least interested in this topic. If you have any questions or remarks regarding this script, hit me up at felix.lennert@ensae.fr.
This script will introduce you to two techniques I regard as elementary for any aspiring (computational) social scientist: the collection of digital trace data via either scraping the web or acquiring data from application programming interfaces (APIs) and the analysis of text in an automated fashion (text mining).
The following chapters draw heavily on packages from the tidyverse
(Wickham et al. 2019) and related packages. If you have not acquired sufficient familiarity yet, you can have a look at the excellent book R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023).
I have added brief videos to each section. In these, I will briefly go through the code of the respective section and show a bit of what’s going on in there. I sometimes spontaneously elaborate a bit more on the examples at hand or show things in the data, so they may add some value. However, the script should be enough to understand the concepts I introduce. The videos contain a codeword at some point (so that I can check whether my students have really watched them).
1.1 Outline
This book will unfold as follows:
Chapter 2, “Brief R Recap,” briefly introduces RStudio Projects, Quarto, tidy data and tidyr
, dplyr
, ggplot
, functions, loops, and purrr
. These techniques are vital for the things that come next.
Chapter 3, “stringr
and RegExes,” deals with string manipulation using, you guessed it, stringr
and Regexes.
Chapter 4, “Text Preprocessing and Featurization,” touches upon the basics of bringing text into a numeric format that lends itself to quantitative analyses.
Chapter 5, “Dictionary-based Analysis,” deals with counting words that are stored in a dictionary.
Chapter 6, “Weighing Terms,” introduces feature weighting (i.e., determining which tokens matter more than others).
Chapter 7, “Tagging,” shows you how to tag different parts of speech, recognize different named entities, and parse dependencies.
Chapter 8, “Unsupervised Classification,” deals with the classification of text in an unsupervised manner using “classic” Latent Dirichlet Allocation, Structural Topic Models, and Seeded Topic Models.
Chapter 9, “Supervised Classification,” deals with the classification of text in a supervised manner using tidymodels
.
Chapter 10, “Word Embeddings,” finally introduces new text analysis techniques that are based on distributional representations of words, commonly referred to as word embeddings.
All chapters try to deal with social scientific examples. Data sets will be provided via Dropbox, therefore the script shall run more or less out of the box. Exercises are included, the respective solutions will be added as the course unfolds (except for the R recap, please contact me in case you are interested).