Course script for SICSS Paris
2023-06-26
Day 1 Introduction
Dear student,
if you read this script, you are either participating in the SICSS itself or came across it while browsing for resources for your studies. In any case, if you find inconsistencies or mistakes, please do not hesitate to point them out by shooting an email to felix.lennert@ensae.fr.
1.1 Outline
This script will introduce you to the automated acquisition and subsequent quantitative analysis of text data using R. Over the last decades, more and more text has become readily available. Think for example of Social Networking Platforms, online fora, Google Books, newspaper articles, the fact that YouTube can generate subtitles for all of its videos, or the fact that administrative records are increasingly digitized. Social scientists of course have decades of experience analyzing these things, yet they used to be constrained by data availability and, hence, their data sets used to be way smaller and could be processed by humans. To make the most out of the newly available data sets we mentioned above and to repurpose them for social scientific research, we need to use tools from the information and computer sciences. Some are fairly old such as basic dictionary-based sentiment analysis whose precursors were introduced in the 1960s, others are as recent as the early 2000s (LDA) or even 2014 (word2vec).
This script will be split into 5 chapters. Each chapter can be seen as representing a day of the summer school. Each chapter contains a “further links” section and exercises. Data are provided through dropbox links that are directly executable from the script. The raw RMD files are stored in dedicated GitHub repositories (2022, 2023)4.
This introductory chapter is intended to help you set up your RStudio by installing all the required packages. Some of them (spacyr
!) rely on a working Python interpreter and are therefore a bit more finicky to set up. We will provide links to step-by-step guides to help you with the process but not cover it in the tutorial. The script will heavily rely on packages from the tidyverse
with which we assume familiarity, in particular with the dplyr
and purrr
packages. At the end of chapter 1, we include links to introductions we deem useful.
Day 2 introduces web scraping. You will become familiar with making calls to different structured web pages through rvest
). APIs and how you can tap them through httr
will be introduced, too.
Day 3 gives you insight into techniques for scraping structured web pages. This is usually achieved using CSS selectors.
Day 4 is dedicated to the extraction of unstructured textual content – RegExes.
Day 5 introduces the featurization and cleaning of text. Different – less and more elaborate – techniques will be presented.
Day 6 Introduces supervised machine learning approaches for the classification and analysis of text. Moreover, we show how you can, (a), train your own word embeddings using word2vec
from scratch, as well as some arithmetic calculations and dimensionality reduction techniques using these embeddings. We also, (b), how you can load pre-trained embeddings and use them for prediction tasks.
The following chapters draw heavily on packages from the tidyverse
(Wickham et al. 2019), tidytext
(Silge and Robinson 2016), and tidymodels
(Kuhn and Wickham 2020), as well as the two excellent books “Text Mining with R: A Tidy Approach” (Silge and Robinson 2017) and “Supervised Machine Learning for Text Analysis in R” (Hvitfeldt and Silge 2022). Examples and inspiration are often drawn from blog posts and/or other tutorials, and we will give credit wherever credit is due. Moreover, you will find further readings at the end of each section as well as exercises at the end of the respective chapter.
1.2 Setup procedure
For now, the only package you will need for running the script is needs
. It will take care of the installation of the necessary package at the beginning of each page.
1.2.1 Registration for API usage
In the section on APIs, we will play with the New York Times API. If you want to follow the script on your machine, you need to sign up for access and acquire an API key. Find instructions for registering here.
1.2.2 Docker for RSelenium
When you work with RSelenium
, what happens is that you simulate a browser which you then control through R. For multiple reasons, the preferred procedure is to run the headless browser in a Docker container, a virtual machine inside your laptop. Hence, if you are planning on using RSelenium
, you should install Docker first and follow this tutorial to set it up properly. (Please note that if you’re on a non-Intel Mac, like one of the authors of this script, you will have to use a different container – check this link.)
1.2.3 Some useful functions
We assume your familiarity with R. However, we are fully aware that coding styles (or “dialects”) differ. Therefore, just a quick demonstration of some of the building blocks of the tidyverse.
We use the new pipe – |>
. The pipe takes its argument on the left and forwards it to the next function, including it there as the first argument unless a _
placeholder is provided somewhere. Note that the _
placeholder only works with named arguments.
## [1] TRUE
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
The important terms in the dplyr
package are mutate()
, select()
, filter()
, summarize()
(used with group_by()
), and arrange()
. pull()
can be used to extract a vector.
needs(tidyverse)
mtcars |>
rownames_to_column("model") |> # add rownames as a column
select(model, mpg, cyl, hp) |> # select 4 columns
arrange(cyl) |> # arrange them according to number of cylinders
# filter(cyl %in% c(4, 6)) |> # only retain values where condition is TRUE
mutate(model_lowercase = str_to_lower(model)) |> # change modelnames to lowercase
group_by(cyl) |> # change scope, effectively split up tibbles according to group_variable
summarize(mean_mpg = mean(mpg)) |> # drop all other columns, collapse rows
pull(cyl) # pull vector
## [1] 4 6 8
We also will work with lists. Lesser known functions here come from the purrr
package. On one hand, we have the map()
family, which applies functions over lists, and pluck()
which extracts elements from the list.
## List of 3
## $ : int [1:4] 1 2 3 4
## $ : int [1:3] 4 5 6
## $ : int [1:33] 10 11 12 13 14 15 16 17 18 19 ...
## [[1]]
## [1] 2.5
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] 26
## [[1]]
## [1] 1.581139
##
## [[2]]
## [1] 2.236068
##
## [[3]]
## [1] 5.09902
## [1] 2.5 5.0 26.0
## [1] 1 2 3 4
This can also be achieved using a loop. Here, you use an index to loop over objects and do something to their elements.
Another part of R is functions. They require arguments. Then they do something to these arguments. In the end, they return the last call (if it’s not stored in an object). Otherwise, an object can be returned using return()
– this is usually unnecessary though.
## [1] 3
## [1] 3
1.3 Further links
Each chapter will contain a Further links section, where we include useful online resources which you can consume to delve deeper into the matters discussed in the respective chapter.
- Further material for learning covering each section of this script can be found on the RStudio website.
- A more accessible guide to singular tidyverse packages can be found in the
introverse
R package. Find instructions for how to install and use it online. - The SICSS bootcamp gets you up and started promptly; wondering if you require a recap? – take the quizzes before going through the material.
- The R4DS book is a good mix of approachable introduction, technical description, real-world examples, and interesting exercises. You can read it in a superficial as well as in an in-depth manner. Solutions for the exercises are available as well. The following chapters are relevant (ordered from most to least relevant): 2-4-6-5-3-7-11-27-14-15-16-19-21.
1.4 Last but not least
Learning R – and programming in general – is tough. More often than not, things will not go the way you want them to go. Mostly, this is due to minor typos or the fact that R is case-sensitive. However, don’t fret. Only practice makes perfect. It is perfectly normal to not comprehend error messages. The following video illustrates this:
If questions arise that a Google search cannot answer, we are always only one email away – and will probably just hit Google right away, too, to figure something out for you.