Day 1 Introduction

Dear student,

if you read this script, you are either participating in the SICSS itself or came across it while browsing for resources for your studies. In any case, if you find inconsistencies or mistakes, please do not hesitate to point them out by shooting an email to .

1.1 Outline

This script will introduce you to the automated acquisition and subsequent quantitative analysis of text data using R. Over the last decades, more and more text has become readily available. Think for example of Social Networking Platforms, online fora, Google Books, newspaper articles, the fact that YouTube can generate subtitles for all of its videos, or the fact that administrative records are increasingly digitized. Social scientists of course have decades of experience analyzing these things, yet they used to be constrained by data availability and, hence, their data sets used to be way smaller and could be processed by humans. To make the most out of the newly available data sets we mentioned above and to repurpose them for social scientific research, we need to use tools from the information and computer sciences. Some are fairly old such as basic dictionary-based sentiment analysis whose precursors were introduced in the 1960s, others are as recent as the early 2000s (LDA) or even 2014 (word2vec).

This script will be split into 5 chapters. Each chapter can be seen as representing a day of the summer school. Each chapter contains a “further links” section and exercises. Data are provided through dropbox links that are directly executable from the script. The raw RMD files are stored in dedicated GitHub repositories (2022, 2023)4.

This introductory chapter is intended to help you set up your RStudio by installing all the required packages. Some of them (spacyr!) rely on a working Python interpreter and are therefore a bit more finicky to set up. We will provide links to step-by-step guides to help you with the process but not cover it in the tutorial. The script will heavily rely on packages from the tidyverse with which we assume familiarity, in particular with the dplyr and purrr packages. At the end of chapter 1, we include links to introductions we deem useful.

Day 2 introduces web scraping. You will become familiar with making calls to different structured web pages through rvest). APIs and how you can tap them through httr will be introduced, too.

Day 3 gives you insight into techniques for scraping structured web pages. This is usually achieved using CSS selectors.

Day 4 is dedicated to the extraction of unstructured textual content – RegExes.

Day 5 introduces the featurization and cleaning of text. Different – less and more elaborate – techniques will be presented.

Day 6 Introduces supervised machine learning approaches for the classification and analysis of text. Moreover, we show how you can, (a), train your own word embeddings using word2vec from scratch, as well as some arithmetic calculations and dimensionality reduction techniques using these embeddings. We also, (b), how you can load pre-trained embeddings and use them for prediction tasks.

The following chapters draw heavily on packages from the tidyverse (Wickham et al. 2019), tidytext (Silge and Robinson 2016), and tidymodels (Kuhn and Wickham 2020), as well as the two excellent books “Text Mining with R: A Tidy Approach” (Silge and Robinson 2017) and “Supervised Machine Learning for Text Analysis in R” (Hvitfeldt and Silge 2022). Examples and inspiration are often drawn from blog posts and/or other tutorials, and we will give credit wherever credit is due. Moreover, you will find further readings at the end of each section as well as exercises at the end of the respective chapter.

1.2 Setup procedure

For now, the only package you will need for running the script is needs. It will take care of the installation of the necessary package at the beginning of each page.

1.2.1 Registration for API usage

In the section on APIs, we will play with the New York Times API. If you want to follow the script on your machine, you need to sign up for access and acquire an API key. Find instructions for registering here.

1.2.2 Docker for RSelenium

When you work with RSelenium, what happens is that you simulate a browser which you then control through R. For multiple reasons, the preferred procedure is to run the headless browser in a Docker container, a virtual machine inside your laptop. Hence, if you are planning on using RSelenium, you should install Docker first and follow this tutorial to set it up properly. (Please note that if you’re on a non-Intel Mac, like one of the authors of this script, you will have to use a different container – check this link.)

1.2.3 Some useful functions

We assume your familiarity with R. However, we are fully aware that coding styles (or “dialects”) differ. Therefore, just a quick demonstration of some of the building blocks of the tidyverse.

We use the new pipe – |>. The pipe takes its argument on the left and forwards it to the next function, including it there as the first argument unless a _ placeholder is provided somewhere. Note that the _ placeholder only works with named arguments.

library(magrittr)
mean(c(2, 3)) == c(2, 3) |> mean()
## [1] TRUE
mtcars |> lm(mpg ~ cyl, data = _)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876
# … is the same as…
lm(mpg ~ cyl, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

The important terms in the dplyr package are mutate(), select(), filter(), summarize() (used with group_by()), and arrange(). pull() can be used to extract a vector.

needs(tidyverse)

mtcars |>
  rownames_to_column("model") |> # add rownames as a column
  select(model, mpg, cyl, hp) |> # select 4 columns
  arrange(cyl) |> # arrange them according to number of cylinders
 # filter(cyl %in% c(4, 6)) |> # only retain values where condition is TRUE
  mutate(model_lowercase = str_to_lower(model)) |> # change modelnames to lowercase
  group_by(cyl) |> # change scope, effectively split up tibbles according to group_variable
  summarize(mean_mpg = mean(mpg)) |> # drop all other columns, collapse rows
  pull(cyl) # pull vector
## [1] 4 6 8

We also will work with lists. Lesser known functions here come from the purrr package. On one hand, we have the map() family, which applies functions over lists, and pluck() which extracts elements from the list.

raw_list <- list(1:4, 4:6, 10:42)
str(raw_list)
## List of 3
##  $ : int [1:4] 1 2 3 4
##  $ : int [1:3] 4 5 6
##  $ : int [1:33] 10 11 12 13 14 15 16 17 18 19 ...
map(raw_list, mean)
## [[1]]
## [1] 2.5
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 26
map(raw_list, \(x) mean(x) |> sqrt())
## [[1]]
## [1] 1.581139
## 
## [[2]]
## [1] 2.236068
## 
## [[3]]
## [1] 5.09902
map_dbl(raw_list, mean) # by specifying the type of output, you can reduce the list
## [1]  2.5  5.0 26.0
raw_list |> pluck(1)
## [1] 1 2 3 4

This can also be achieved using a loop. Here, you use an index to loop over objects and do something to their elements.

for (p in seq_along(raw_list)){
  raw_list[[p]] <- mean(raw_list[[p]])
}

Another part of R is functions. They require arguments. Then they do something to these arguments. In the end, they return the last call (if it’s not stored in an object). Otherwise, an object can be returned using return() – this is usually unnecessary though.

a_plus_b <- function(a, b){
  a + b
}

a_plus_b(1, 2)
## [1] 3
a_plus_b <- function(a, b){
 c <- a + b
 return(c)
}

a_plus_b(1, 2)
## [1] 3

1.4 Last but not least

Learning R – and programming in general – is tough. More often than not, things will not go the way you want them to go. Mostly, this is due to minor typos or the fact that R is case-sensitive. However, don’t fret. Only practice makes perfect. It is perfectly normal to not comprehend error messages. The following video illustrates this:

If questions arise that a Google search cannot answer, we are always only one email away – and will probably just hit Google right away, too, to figure something out for you.

References

Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. First edition. Data Science Series. Boca Raton London New York: CRC Press, Taylor & Francis Group.
Kuhn, Max, and Hadley Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.”
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.
———. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing ; Boston: O’Reilly.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

  1. Lille University↩︎

  2. CREST, École Polytechnique; to whom correspondence should be addressed, felix.lennert@ensae.fr↩︎

  3. CREST, École Polytechnique↩︎

  4. Hence, you can also file a pull request if you have found a flaw.↩︎