Toolbox Computational Social Science


Felix Lennert


October 13, 2023

1 Preface

Dear student,

if you read this script, you are either participating in one of my courses on digital methods for the social sciences, or at least interested in this topic. If you have any questions or remarks regarding this script, hit me up at .

This script will introduce you to two techniques I regard as elementary for any aspiring (computational) social scientist: the collection of digital trace data via either scraping the web or acquiring data from application programming interfaces (APIs) and the analysis of text in an automated fashion (text mining).

The following chapters draw heavily on packages from the tidyverse (Wickham et al. 2019) and related packages. If you have not acquired sufficient familiarity yet, you can have a look at the excellent book R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023).

I have added brief videos to each section. In these, I will briefly go through the code of the respective section and show a bit what’s going on in there. I sometimes spontaneously elaborate a bit more on the examples at hand or show things in the data, so they may add some value. However, the script should be enough to understand the concepts I introduce. The videos contain a codeword at some point (so that I can check whether my students have really watched them).

1.1 Outline

This book will unfold as follows:

Chapter 1, “Brief R Recap,” will briefly introduce RStudio Projects, Quarto, tidy data and tidyr, dplyr, ggplot, functions, loops, and purrr. These techniques are elementary for the things that come next.

Chapter 2, “stringr and RegExes,” deals with string manipulation using, you guessed it, stringr and Regexes.

Chapter 3, “Obtaining Data from the Web,” introduces the reader to the basics of rvest, HTML, and CSS selectors and how these can be used to acquire data from the web. Moreover, I introduce the httr package and explain how you can use it to make requests to API.

Chapter 4, “Text Preprocessing and Basic Analyses,” touches upon the basics of bringing text into a numeric format that lends itself to quantitative analysis. Moreover, it introduces feature weighting (i.e., determining which tokens matter more than others) as well as dictionary-based analyses.

Chapter 5, “Text Classification with the Bag of Words,” deals with the classification of text in both supervised and unsupervised ways.

Chapter 6, “Beyond BoW,” finally introduces new text analysis techniques that are based on distributional representations of words, commonly referred to as word embeddings.

All chapters try to deal with social scientific examples. Data sets will be provided via Dropbox, therefore the script shall run more or less out of the box. Exercises are included, the solutions are available in the corresponding GitHub repository.