Day 1 Introduction

Dear student,

if you read this script, you are either participating in the SICSS itself or came across it while browsing for resources for your studies. In any case, if you find inconsistencies or mistakes, please do not hesitate to point them out by shooting an email to .

1.1 Outline

This script will introduce you to the automated acquisition and subsequent quantitative analysis of text data using R. Over the last decades, more and more text has become readily available. Think for example of Social Networking Platforms, online fora, Google Books, newspaper articles, the fact that YouTube can generate subtitles for all of its videos, or the fact that administrative records are increasingly digitized. Social scientists of course have decades of experience analyzing these things, yet they used to be constrained by data availability and, hence, their data sets used to be way smaller and could be processed by humans. To make the most out of the newly available data sets we mentioned above and to repurpose them for social scientific research, we need to use tools from the information and computer sciences. Some are fairly old such as basic dictionary-based sentiment analysis whose precursors were introduced in the 1960s, others are as recent as the early 2000s (LDA) or even 2014 (word2vec).

This script will be split into 5 chapters. Each chapter can be seen as representing a day of the summer school. Each chapter contains a “further links” section and exercises. Data are provided through dropbox links that are directly executable from the script. The raw RMD files are stored in a dedicated GitHub repository4.

This introductory chapter is intended to help you set up your RStudio by installing all the required packages. Some of them (spacyr!) rely on a working Python interpreter and are therefore a bit more finicky to set up. We will provide links to step-by-step guides to help you with the process but not cover it in the tutorial. The script will heavily rely on packages from the tidyverse with which we assume familiarity, in particular with the dplyr and purrr packages. At the end of chapter 1, we include links to introductions we deem useful.

Day 2 introduces web scraping. You will become familiar with making calls to different structured web pages through rvest and rselenium). APIs and how you can tap them will be introduced, too.

Day 3 gives you insight into techniques for scraping unstructured web pages. This is usually achieved using CSS selectors.

Day 4 is dedicated to the featurization and descriptive analysis of text.

Day 5 introduces both unsupervised and supervised machine learning approaches for the classification and analysis of text.

Day 6 provides a glimpse of how word embedding techniques can be used. It will showcase how word2vec can be used to learn embeddings from text from scratch, as well as some arithmetic calculations using these embeddings.

The following chapters draw heavily on packages from the tidyverse (Wickham et al. 2019), tidytext (Silge and Robinson 2016), and tidymodels (Kuhn and Wickham 2020), as well as the two excellent books “Text Mining with R: A Tidy Approach” (Silge and Robinson 2017) and “Supervised Machine Learning for Text Analysis in R” (Hvitfeldt and Silge 2022). Examples and inspiration are often drawn from blog posts and/or other tutorials, and we will give credit wherever credit is due. Moreover, you will find further readings at the end of each section as well as exercises at the end of the respective chapter.

1.2 Setup procedure

The next chunk serves the purpose of preparing your machine for the days to come. It will install all the necessary packages.

if (!"tidyverse" %in% installed.packages()[, 1]) install.packages("tidyverse")

packages <- c(
  "broom",
  "devtools",
  "discrim",
  "forcats",
  "glmnet",
  "hcandersenr",
  "httr",
  "irlba",
  "janitor",
  "jsonlite",
  "ldatuning",
  "LDAvis",
  "lubridate", 
  "magrittr",
  "naivebayes",
  "polite",
  "ranger",
  "RSelenium",
  "rtweet",
  "rvest",
  "SnowballC",
  "sotu",
  "spacyr",
  "stm", 
  "stopwords",
  "stminsights",
  "textdata",
  "textrecipes",
  "tidymodels",
  "tidytext", 
  "tidyverse", 
  "topicmodels", 
  "tsne",
  "tune",
  "word2vec",
  "wordcloud",
  "workflows", 
  "vembedr",
  "yardstick"
  )

purrr::walk(packages, ~{
  if (!.x %in% installed.packages()[, 1]) install.packages(.x)
})

if (!"tif" %in% installed.packages()[, 1]) devtools::install_github("ropensci/tif")

While we would strongly advise you to integrate RStudio projects into your workflow, it is not required for SICSS-Paris. We will work with RMarkdown (RMD) documents which facilitate working with file paths significantly insofar as they automatically assume the folder they are stored in as the current working directory. In our case, this is not necessary though, since everything can be directly downloaded from Dropbox.

1.2.1 Registration for API usage

In the section on APIs, we will play with the New York Times API. If you want to follow the script on your machine, you need to sign up for access and acquire an API key. Find instructions for registering here.

Also, if you want to play with the rtweet package, you need a Twitter account.

1.2.2 Docker for RSelenium

When you work with RSelenium, what happens is that you simulate a browser which you then control through R. For multiple reasons, the preferred procedure is to run the headless browser in a Docker container, a virtual machine inside your laptop. Hence, if you are planning on using RSelenium, you should install Docker first and follow this tutorial to set it up properly. (Please note that if you’re on a non-Intel Mac, like one of the authors of this script, you are screwed and Docker’s browser module will not work. We have not found a functioning workaround yet. So no scraping with Selenium for you.)

1.2.3 Some useful functions

We assume your familiarity with R. However, we are fully aware that coding styles (or “dialects”) differ. Therefore, just a quick demonstration of some of the building blocks of the tidyverse.

We use the “old” pipe – %>%. The pipe takes its argument on the left and forwards it to the next function, including it there as the first argument unless a . placeholder is provided somewhere. %<>% takes the argument on the left and modifies it at the same time.

library(magrittr)
mean(c(2, 3)) == c(2, 3) %>% mean()
## [1] TRUE
mtcars %>% lm(mpg ~ cyl, data = .)
## 
## Call:
## lm(formula = mpg ~ cyl, data = .)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876
# … is the same as…
lm(mpg ~ cyl, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876
cars <- mtcars
cars %<>% .[[1]] %>% mean()
# … is the same as…
cars <- cars %>% .[[1]] %>% mean()

The important terms in the dplyr package are mutate(), select(), filter(), summarize() (used with group_by()), and arrange(). pull() can be used to extract a vector.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
mtcars %>%
  rownames_to_column("model") %>% # add rownames as a column
  select(model, mpg, cyl, hp) %>% # select 4 columns
  arrange(cyl) %>% # arrange them according to number of cylinders
 # filter(cyl %in% c(4, 6)) %>% # only retain values where condition is TRUE
  mutate(model_lowercase = str_to_lower(model)) %>% # change modelnames to lowercase
  group_by(cyl) %>% # change scope, effectively split up tibbles according to group_variable
  summarize(mean_mpg = mean(mpg)) %>% # drop all other columns, collapse rows
  pull(cyl) # pull vector
## [1] 4 6 8

We also will work with lists. Lesser known functions here come from the purrr package. On one hand, we have the map() family, which applies functions over lists, and pluck() which extracts elements from the list.

raw_list <- list(1:4, 4:6, 10:42)
str(raw_list)
## List of 3
##  $ : int [1:4] 1 2 3 4
##  $ : int [1:3] 4 5 6
##  $ : int [1:33] 10 11 12 13 14 15 16 17 18 19 ...
map(raw_list, mean)
## [[1]]
## [1] 2.5
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 26
map(raw_list, ~{mean(.x) %>% sqrt()})
## [[1]]
## [1] 1.581139
## 
## [[2]]
## [1] 2.236068
## 
## [[3]]
## [1] 5.09902
map_dbl(raw_list, mean) # by specifying the type of output, you can reduce the list
## [1]  2.5  5.0 26.0
raw_list %>% pluck(1)
## [1] 1 2 3 4

This can also be achieved using a loop. Here, you use an index to loop over objects and do something to their elements.

for (p in seq_along(raw_list)){
  raw_list[[p]] <- mean(raw_list[[p]])
}

Another part of R is functions. They require arguments. Then they do something to these arguments. In the end, they return the last call (if it’s not stored in an object). Otherwise, an object can be returned using return() – this is usually unnecessary though.

a_plus_b <- function(a, b){
  a + b
}

a_plus_b(1, 2)
## [1] 3
a_plus_b <- function(a, b){
 c <- a + b
 return(c)
}

a_plus_b(1, 2)
## [1] 3

1.4 Last but not least

Learning R – and programming in general – is tough. More often than not, things will not go the way you want them to go. Mostly, this is due to minor typos or the fact that R is case-sensitive. However, don’t fret. Only practice makes perfect. It is perfectly normal to not comprehend error messages. The following video illustrates this:

If questions arise that a Google search cannot answer, we are always only one email away – and will probably just hit Google right away, too, to figure something out for you.

References

Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. First edition. Data Science Series. Boca Raton London New York: CRC Press, Taylor & Francis Group.
Kuhn, Max, and Hadley Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.”
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.
———. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing ; Boston: O’Reilly.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

  1. CREST, École Polytechnique↩︎

  2. CREST, École Polytechnique; to whom correspondence should be addressed, felix.lennert@ensae.fr↩︎

  3. CREST, École Polytechnique↩︎

  4. Hence, you can also file a pull request if you have found a flaw.↩︎