Toolbox Computational Social Science

Author

Felix Lennert

Published

January 4, 2024

1 Preface

Dear student,

if you read this script, you are either participating in one of my courses on digital methods for the social sciences, or at least interested in this topic. If you have any questions or remarks regarding this script, hit me up at felix.lennert@ensae.fr.

This script will introduce you to two techniques I regard as elementary for any aspiring (computational) social scientist: the collection of digital trace data via either scraping the web or acquiring data from application programming interfaces (APIs) and the analysis of text in an automated fashion (text mining).

The following chapters draw heavily on packages from the tidyverse (Wickham et al. 2019) and related packages. If you have not acquired sufficient familiarity yet, you can have a look at the excellent book R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023).

I have added brief videos to each section. In these, I will briefly go through the code of the respective section and show a bit of what’s going on in there. I sometimes spontaneously elaborate a bit more on the examples at hand or show things in the data, so they may add some value. However, the script should be enough to understand the concepts I introduce. The videos contain a codeword at some point (so that I can check whether my students have really watched them).

1.1 Outline

This book will unfold as follows:

Chapter 2, “Brief R Recap,” briefly introduces RStudio Projects, Quarto, tidy data and tidyr, dplyr, ggplot, functions, loops, and purrr. These techniques are vital for the things that come next.

Chapter 3, “stringr and RegExes,” deals with string manipulation using, you guessed it, stringr and Regexes.

Chapters 4 and 5, “Crawling the Web” and “Scraping the Web – Extracting Data,” introduce the reader to the basics of rvest, HTML, and CSS selectors and how these can be used to acquire data from the web. Moreover, I introduce the httr package and explain how you can use it to make requests to APIs.

Chapter 6, “Text Preprocessing and Featurization,” touches upon the basics of bringing text into a numeric format that lends itself to quantitative analyses. Moreover, it introduces feature weighting (i.e., determining which tokens matter more than others) as well as dictionary-based analyses.

Chapter 7, “Supervised Classification,” deals with the classification of text in a supervised manner using tidymodels.

Chapter 8, “Unsupervised Classification,” deals with the classification of text in an unsupervised manner using “classic” Laten Dirichlet Allocation, Structural Topic Models, and Seeded Topic Models.

Chapter 9, “Word Embeddings,” finally introduces new text analysis techniques that are based on distributional representations of words, commonly referred to as word embeddings.

All chapters try to deal with social scientific examples. Data sets will be provided via Dropbox, therefore the script shall run more or less out of the box. Exercises are included, the respective solutions will be added as the course unfolds (except for the R recap, please contact me in case you are interested).