Chapter 2 Digital trace data

One elementary skill for Computational social scientists is the analysis of “found” data from the internet. There are two basic approaches to acquire those data: scraping web pages and sending API requests.

Figure 1 shows an overview of different technologies that are used to store and disseminate data (Munzert et al. 2014: 10).

Figure 1: technologies for putting data online and how to extract and store them

HTML is usually used for contents on more basic websites, AJAX on fancier ones which change appearance etc. using JavaScript. We use rvest (Wickham 2019a) for web scraping. AJAX, however, requires more elaborate software (e.g., RSelenium). I will link an extensive tutorial for that later. JSON is the format many application programming interfaces (APIs) provide data in and I will, therefore, dwell a bit more on it in the chapter on APIs.

A problem with data from the web is their messiness. There might be some special characters in there that you want to get rid of or unnecessary text that you need to remove. The most common way to do this is by using regular expressions (regexes) which are introduced in the first part of this chapter. Then, I will introduce the actual scraping. I will also give a brief introduction to HTML, as it will enable you to pre-select relevant parts from the particular web page. rvest has some handy functions to extract certain kinds of content. Those will be introduced, too. Of course, one big advantage of doing scraping with R is the fact that we can automate the process. For instance, we can tell the machine to first scrape a list of links and then following those links and extract information from there (e.g., if you want to want to collect data on housing prices). However, many web pages will not want you to extract their entire page or only do so at a certain rate limit and we definitely need to respect that. This is what the polite (Perepolkin 2019) package is for which works well in connection with rvest. In the final part, you will learn more on APIs and how you can communicate with them. For this, I will also introduce you to JSON, the data format most APIs work with.

2.1 String manipulation

When working with data, a significant number of variables will be in some sort of text format. When you want to manipulate those variables, an easy approach would be exporting the data to MS Excel and then just performing those manipulations by hand. This is very time-consuming, though, and, hence, I rather recommend the R way which scales well and works fast for data sets of varying sizes.

Quick reminder: a string is an element of a character vector and can be created by simply wrapping some text in back ticks:

string <- "Hi, how are you doing?"
vector_of_strings <- c("Hi, how are you doing?", "I'm doing well, HBY?", "Me too, thanks for asking.")

The stringr package (Wickham 2019b) contains a multitude of commands (49 in total) which can be used to achieve a couple of things: manipulating character vectors; operations which are sensitive to different locales; matching patterns. Basically, those goals can also be achieved with base R functions, but stringr’s advantage is its consistency. The makers of stringr describe it as

A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.

Every stringr function starts with str_ – which facilitates finding the proper command: just type str_ and RStudio’s auto-suggest function should take care of the rest (if it doesn’t pop up by itself, you can trigger it by hitting the tab-key). Also, they take a vector of strings as their first argument, which facilitates using them in a %>%-pipeline and adding them to a mutate()-call.

One important component of stringr functions is regular expressions which will be introduced later as well.

2.1.1 Basic manipulations

In the following, I will introduce you to a number of different operations that can be performed on strings.

2.1.1.1 Changing the case of the words

A basic operation is changing words’ case.

library(tidyverse) #stringr is part of the core tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
str_to_lower(vector_of_strings)
## [1] "hi, how are you doing?"     "i'm doing well, hby?"      
## [3] "me too, thanks for asking."
str_to_upper(vector_of_strings)
## [1] "HI, HOW ARE YOU DOING?"     "I'M DOING WELL, HBY?"      
## [3] "ME TOO, THANKS FOR ASKING."
str_to_title(vector_of_strings)
## [1] "Hi, How Are You Doing?"     "I'm Doing Well, Hby?"      
## [3] "Me Too, Thanks For Asking."
str_to_sentence(vector_of_strings)
## [1] "Hi, how are you doing?"     "I'm doing well, hby?"      
## [3] "Me too, thanks for asking."

2.1.1.2 Determining a string’s length

Determining the string’s number of characters goes as follows:

str_length(vector_of_strings)
## [1] 22 20 26

2.1.1.3 Extracting particular characters

Characters can be extracted (by position) using str_sub

str_sub(vector_of_strings, start = 1, end = 5) # extracting first to fifth character
## [1] "Hi, h" "I'm d" "Me to"
str_sub(vector_of_strings, start = -5, end = -1) # extracting fifth-to-last to last character
## [1] "oing?" " HBY?" "king."

You can also use str_sub() to replace strings. E.g., to replace the last character by a full stop, you can do the following:

str_sub(vector_of_strings, start = -1) <- "."
vector_of_strings
## [1] "Hi, how are you doing."     "I'm doing well, HBY."      
## [3] "Me too, thanks for asking."

However, in everyday use you would probably go with str_replace() and regular expressions.

2.1.1.4 Concatenating strings

Similar to how c() puts together different elements (or vectors of length 1) into a single vector, str_c() can be used to concatenate several strings into a single string. This can, for instance, be used to write some birthday invitations.

names <- c("Inger", "Peter", "Kalle", "Ingrid")

str_c("Hi", names, "I hope you're doing well. As per this letter, I invite you to my birthday party.")
## [1] "HiIngerI hope you're doing well. As per this letter, I invite you to my birthday party." 
## [2] "HiPeterI hope you're doing well. As per this letter, I invite you to my birthday party." 
## [3] "HiKalleI hope you're doing well. As per this letter, I invite you to my birthday party." 
## [4] "HiIngridI hope you're doing well. As per this letter, I invite you to my birthday party."

Well, this looks kind of ugly, as there are no spaces and commas are lacking as well. You can fix that by determining a separator using the sep argument.

str_c("Hi", names, "I hope you're doing well. As per this letter, I invite you to my birthday party.", sep = ", ")
## [1] "Hi, Inger, I hope you're doing well. As per this letter, I invite you to my birthday party." 
## [2] "Hi, Peter, I hope you're doing well. As per this letter, I invite you to my birthday party." 
## [3] "Hi, Kalle, I hope you're doing well. As per this letter, I invite you to my birthday party." 
## [4] "Hi, Ingrid, I hope you're doing well. As per this letter, I invite you to my birthday party."

You could also collapse the strings contained in a vector together into one single string using the collapse argument.

str_c(names, collapse = ", ")
## [1] "Inger, Peter, Kalle, Ingrid"

This can also be achieved using the str_flatten() function.

str_flatten(names, collapse = ", ")
## [1] "Inger, Peter, Kalle, Ingrid"

2.1.1.5 Repetition

Repeating (or duplicating) strings is performed using str_dup(). The function takes two arguments: the string to be duplicated and the number of times.

str_dup("felix", 2)
## [1] "felixfelix"
str_dup("felix", 1:3)
## [1] "felix"           "felixfelix"      "felixfelixfelix"
str_dup(names, 2)
## [1] "IngerInger"   "PeterPeter"   "KalleKalle"   "IngridIngrid"
str_dup(names, 1:4)
## [1] "Inger"                    "PeterPeter"              
## [3] "KalleKalleKalle"          "IngridIngridIngridIngrid"

2.1.1.6 Removing unnecessary whitespaces

Often text contains unnecessary whitespaces.

unnecessary_whitespaces <- c("    on the left", "on the right    ", "    on both sides   ", "   literally    everywhere  ")

Removing the ones at the beginning or the end of a string can be accomplished using str_trim().

str_trim(unnecessary_whitespaces, side = "left")
## [1] "on the left"               "on the right    "         
## [3] "on both sides   "          "literally    everywhere  "
str_trim(unnecessary_whitespaces, side = "right")
## [1] "    on the left"            "on the right"              
## [3] "    on both sides"          "   literally    everywhere"
str_trim(unnecessary_whitespaces, side = "both") # the default option
## [1] "on the left"             "on the right"           
## [3] "on both sides"           "literally    everywhere"

str_trim() could not fix the last string though, where unnecessary whitespaces were also present in between words. Here, str_squish is more appropriate. It removes leading or trailing whitespaces as well as duplicated ones in between words.

str_squish(unnecessary_whitespaces)
## [1] "on the left"          "on the right"         "on both sides"       
## [4] "literally everywhere"

2.1.2 Regular expressions

Up to now, you have been introduced to the more basic functions of the stringr package. Those are useful, for sure, yet limited. However, to make use of the full potential of stringr, you will first have to get acquainted to regular expressions (also often abbreviated as “regex” with plural “regexes”).

Those regular expressions are patterns that can be used to describe certain strings. Hence, if you want to replace certain words with another one, you can write the proper regex and it will identify the strings you want to replace and the stringr function (i.e., str_replace()) will take care of the rest. Exemplary use cases of regexes are the identification of phone numbers, email addresses, or whether a password you choose on a web page consists of enough characters, an upper-case character, and at least one special character.

Before you dive into regexes, beware that they are quite complicated in the beginning (honestly, I was quite overwhelmed when I encountered them first). Yet, mastering them is very rewarding and will definitely pay off in the future.

2.1.2.1 Literal characters

The most basic regex patterns consist of literal characters only. str_view() tells you which parts of a string match a pattern is present in the element.

five_largest_cities <- c("Stockholm", "Göteborg", "Malmö", "Uppsala", "Västerås")

Note that regexes are case-sensitive.

str_view(five_largest_cities, "stockholm")
str_view(five_largest_cities, "Stockholm")

They also match parts of words:

str_view(five_largest_cities, "borg")

Moreover, they are “greedy,” they only match the first occurrence (in “Stockholm”):

str_view(five_largest_cities, "o")

This can be addressed in the stringr package by using str_._all() function – but more on that later.

If you want to match multiple literal characters (or words, for that sake), you can connect them using the | meta character (more on meta characters later).

str_view(five_largest_cities, "Stockholm|Göteborg")

Every letter of the English alphabet (or number/or combination of those) can serve as a literal character. Those literal characters match themselves. This is, however, not the case with the other sort of characters, so-called meta characters.

2.1.2.2 Metacharacters

When using regexes, the following characters are considered meta characters and have a special meaning:

. \ | ( ) { } [ ] ^ $ - * + ?

2.1.2.2.1 The wildcard

Did you notice how I used the dot to refer to the entirety of the str_._all() functions? This is basically what the . meta-character does: it matches every character except for a new line. The first call extracts all function names from the stringr package, the second one shows the matches (i.e., the elements of the vector where it can find the pattern).

stringr_functions <- ls("package:stringr")

str_detect(stringr_functions, "str_._all")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE

Well, as you can see, there are none. This is due to the fact that the . can only replace one character. We need some sort of multiplier to find them. The ones available are:

  • ? – zero or one
  • * – zero or more
  • + – one or more
  • {n} – exactly n
  • {n,} – n or more
  • {n,m} – between n and m

In our case, the appropriate one is +:

str_detect(stringr_functions, "str_.+_all")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [25]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [49] FALSE FALSE FALSE FALSE

However, if you want to match the character dot? This problem may arise when searching for clock time. A naive regex might look like this:

vectors_with_time <- c("13500", "13M00", "13.00")

str_detect(vectors_with_time, "13.00")
## [1] TRUE TRUE TRUE

Yet, it matches everything. We need some sort of literal dot. Here, the meta character \ comes in handy. By putting it in front of the meta character, it does no longer has its special meaning, and is interpreted as a literal character. This procedure is referred to as “escaping.” Hence, \ is also referred to as the “escape character.” Note that you will need to escape \ as well, and therefore it will look like this: \\..

str_detect(vectors_with_time, "13\\.00")
## [1] FALSE FALSE  TRUE

2.1.2.3 Sets of characters

You can also define sets of multiple characters using the [ ] meta characters. This can be used to define multiple possible characters that can appear in the same place.

sp_ce <- c("spice", "space")

str_view(sp_ce, "sp[ai]ce")

You can also define certain ranges of characters using the - meta character:

Same holds for numbers:

american_phone_number <- "(555) 555-1234"

str_view(american_phone_number, "\\([:digit:]{3}\\) [0-9]{3}-[0-9]{4}")

There are also predefined sets of characters, for instance digits or letters, which are called character classes. You can find them on the stringr cheatsheet.

Furthermore, you can put almost every meta character inside the square brackets without escaping them. This does not apply to the the caret (^) in first position, the dash -, the closing square bracket ], and the backslash \.

str_view(vector_of_strings, "[.]")
2.1.2.3.1 Negating sets of characters

Sometimes you will also want to exclude certain sets of characters or words. In order to achieve this, you can use the ^ meta character at the beginning of the range or set you are defining.

str_view(sp_ce, "sp[^i]ce")

2.1.2.4 Anchors

There is also a way to define whether you want the pattern to be present in the beginning ^ or at the end $ of a string. sentences are a couple of (i.e., 720) predefined example sentences. If I were now interested in the number of sentences that begin with a “the,” I could write the following regex:

shortened_sentences <- sentences[1:10]

str_view(shortened_sentences, "^The") 

If I wanted to know how many start with a “The” and end with a full stop, I could do this one:

str_view(shortened_sentences, "^The.+\\.$") 
2.1.2.4.1 Boundaries

Note that right now, the regex also matches the sentence which starts with a “These.” In order to address this, I need to tell the machine that it should only accept a “The” if there starts a new word thereafter. In regex syntax, this is done using so-called boundaries. Those are defined as \b as a word boundary and \B as no word boundary. (Note that you will need an additional escape character as you will have to escape the escape character itself.)

In my example, I would include the former if I were to search for sentences that begin with a single “The” and the latter if I were to search for sentences that begin with a word that starts with a “The” but are not “The” – such as “These.”

str_view(shortened_sentences, "^The\\b.+\\.$") 
str_view(shortened_sentences, "^The\\B.+\\.$") 
2.1.2.4.2 Lookarounds

A final common task is to extract certain words or values based on what comes before or after them. Look at the following example:

heights <- c("1m30cm", "2m01cm", "3m10cm")

Here, in order to identify the height in meters, the first task is to identify all the numbers that are followed by an “m.” The regex syntax for this looks like this: A(?=pattern) with A being the entity that is supposed to be found (hence, in this case, [0-9]+).

str_view(heights, "[0-9]+(?=m)")

The second step now is to identify the centimeters. This could of course be achieved using the same regex and replacing m by cm. However, we can also harness a so-called negative look ahead A(?!pattern), a so-called look behind (?<=pattern)A. The negative counterpart, the negative look behind (?<!pattern)A could be used to extract the meters.

The negative look ahead basically returns everything that is not followed by the defined pattern. The look behind returns everything that is preceded by the pattern, the negative look behind returns everything that is not preceded by the pattern.

In the following, I demonstrate how you could extract the centimeters using negative look ahead and look behind.

str_view(heights, "[0-9]+(?!m)") # negative look ahead
str_view(heights, "(?<=m)[0-9]+") # look behind

2.1.3 More advanced string manipulation

Now that you have learned about regexes, you can unleash the full power of stringr.

The basic syntax of a stringr function looks as follows: str_.*(string, regex("")). Some stringr functions also have the suffix _all which implies that they perform the operation not only on the first match (“greedy”) but on every match.

In order to demonstrate the different functions, I will again rely on the subset of example sentences.

2.1.3.1 Detect matches

str_detect can be used to determine whether a certain pattern is present in the string.

str_detect(shortened_sentences, "The\\b")
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

This also works very well in a dplyr::filter() call. Finding all action movies in the IMDB data set can be solved like this:

imdb_raw <- read_csv("data/imdb2006-2016.csv")
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Title, Genre, Description, Director, Actors
## dbl (7): Rank, Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), M...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imdb_raw %>% 
  filter(str_detect(Genre, "Action"))
## # A tibble: 303 × 12
##     Rank Title  Genre  Description Director Actors  Year `Runtime (Minut… Rating
##    <dbl> <chr>  <chr>  <chr>       <chr>    <chr>  <dbl>            <dbl>  <dbl>
##  1     1 Guard… Actio… A group of… James G… Chris…  2014              121    8.1
##  2     5 Suici… Actio… A secret g… David A… Will …  2016              123    6.2
##  3     6 The G… Actio… European m… Yimou Z… Matt …  2016              103    6.1
##  4     9 The L… Actio… A true-lif… James G… Charl…  2016              141    7.1
##  5    13 Rogue… Actio… The Rebel … Gareth … Felic…  2016              133    7.9
##  6    15 Colos… Actio… Gloria is … Nacho V… Anne …  2016              109    6.4
##  7    18 Jason… Actio… The CIA's … Paul Gr… Matt …  2016              123    6.7
##  8    25 Indep… Actio… Two decade… Roland … Liam …  2016              120    5.3
##  9    27 Bahub… Actio… In ancient… S.S. Ra… Prabh…  2015              159    8.3
## 10    30 Assas… Actio… When Callu… Justin … Micha…  2016              115    5.9
## # … with 293 more rows, and 3 more variables: Votes <dbl>,
## #   Revenue (Millions) <dbl>, Metascore <dbl>

If you want to know whether there are multiple matches present in each string, you can use str_count. Here, it might by advisable to set the ignore_case option to TRUE:

str_count(shortened_sentences, regex("the\\b", ignore_case = TRUE))
##  [1] 2 2 1 0 0 1 2 1 0 0

If you want to locate the match in the string, use str_locate. This returns a matrix, which is basically a vector of multiple dimensions.

str_locate(shortened_sentences, regex("The\\b", ignore_case = TRUE))
##       start end
##  [1,]     1   3
##  [2,]     6   8
##  [3,]    19  21
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]     1   3
##  [7,]     1   3
##  [8,]     1   3
##  [9,]    NA  NA
## [10,]    NA  NA

Moreover, this is a good example for the greediness of stringr functions. Hence, it is advisable to use str_locate_all which returns a list with one matrix for each element of the original vector:

str_locate_all(shortened_sentences, regex("The\\b", ignore_case = TRUE))
## [[1]]
##      start end
## [1,]     1   3
## [2,]    25  27
## 
## [[2]]
##      start end
## [1,]     6   8
## [2,]    19  21
## 
## [[3]]
##      start end
## [1,]    19  21
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]     1   3
## 
## [[7]]
##      start end
## [1,]     1   3
## [2,]    27  29
## 
## [[8]]
##      start end
## [1,]     1   3
## 
## [[9]]
##      start end
## 
## [[10]]
##      start end

2.1.3.2 Mutating strings

Mutating strings usually implies the replacement of certain elements (e.g., words) with other elements (or removing them, which is basically a special case of replacing them). In stringr this is performed using str_replace(string, pattern, replacement) and str_replace_all(string, pattern, replacement).

If I wanted, for instance, replace the first occurrence of “m” letters by “meters,” I would go about this the following way:

str_replace(heights, "m", "meters")
## [1] "1meters30cm" "2meters01cm" "3meters10cm"

Note that str_replace_all would have lead to the following outcome:

str_replace_all(heights, "m", "meters")
## [1] "1meters30cmeters" "2meters01cmeters" "3meters10cmeters"

However, I also want to replace the “cm” with “centimeters,” hence, I can harness another feature of str_replace_all():

str_replace_all(heights, c("m" = "meters", "cm" = "centimeters"))
## [1] "1meters30centimeterseters" "2meters01centimeterseters"
## [3] "3meters10centimeterseters"

What becomes obvious is that a “simple” regex containing just literal characters more often than not does not suffice. It will be your task to fix this. And while on it, you can also address the meter/meters problem – a “1” needs meter instead of meters. Another feature is that the replacements are performed in order. You can harness this for solving the problem.

Solution. Click to expand!

Solution:

str_replace_all(heights, c("(?<=[2-9]{1})m" = "meters", "(?<=[0-9]{2})m" = "meters", "(?<=1)m" = "meter", "(?<=01)cm$" = "centimeter", "cm$" = "centimeters"))
## [1] "1meter30centimeters"  "2meters01centimeter"  "3meters10centimeters"

2.1.3.3 Extracting text

str_extract(_all)() can be used to extract matching strings. In the mtcars data set, the first word describes the car brand. Here, I harness another regexp, the \\w which stands for any word character. Its opponent is \\W for any non-word character.

mtcars %>% 
  rownames_to_column(var = "car_model") %>% 
  transmute(manufacturer = str_extract(car_model, "^\\w+\\b"))
##    manufacturer
## 1         Mazda
## 2         Mazda
## 3        Datsun
## 4        Hornet
## 5        Hornet
## 6       Valiant
## 7        Duster
## 8          Merc
## 9          Merc
## 10         Merc
## 11         Merc
## 12         Merc
## 13         Merc
## 14         Merc
## 15     Cadillac
## 16      Lincoln
## 17     Chrysler
## 18         Fiat
## 19        Honda
## 20       Toyota
## 21       Toyota
## 22        Dodge
## 23          AMC
## 24       Camaro
## 25      Pontiac
## 26         Fiat
## 27      Porsche
## 28        Lotus
## 29         Ford
## 30      Ferrari
## 31     Maserati
## 32        Volvo

2.1.3.4 Split vectors

Another use case here would have been to split it into two columns: manufacturer and model. One approach would be to use str_split(). This function splits the string at every occurrence of the predefined pattern. In this example, I use a word boundary as the pattern:

manufacturer_model <- rownames(mtcars)
str_split(manufacturer_model, "\\b") %>% 
  head()
## [[1]]
## [1] ""      "Mazda" " "     "RX4"   ""     
## 
## [[2]]
## [1] ""      "Mazda" " "     "RX4"   " "     "Wag"   ""     
## 
## [[3]]
## [1] ""       "Datsun" " "      "710"    ""      
## 
## [[4]]
## [1] ""       "Hornet" " "      "4"      " "      "Drive"  ""      
## 
## [[5]]
## [1] ""           "Hornet"     " "          "Sportabout" ""          
## 
## [[6]]
## [1] ""        "Valiant" ""

This outputs a list containing the different singular words/special characters. This doesn’t make sense in this case. Here, however, the structure of the string is always roughly the same: “\[manufacturer\]\[ \]\[model description\].” Moreover, the manufacturer is only one word. Hence, the task can be fixed by splitting the string after the first word, which should indicate the manufacturer. This can be accomplished using str_split_fixed(). Fixed means that the number of splits is predefined. This returns a matrix that can easily become a tibble.

str_split_fixed(manufacturer_model, "(?<=\\w)\\b", n = 2) %>% 
  as_tibble() %>% 
  rename(manufacturer = V1,
         model = V2) %>% 
  mutate(model = str_squish(model))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 32 × 2
##    manufacturer model       
##    <chr>        <chr>       
##  1 Mazda        "RX4"       
##  2 Mazda        "RX4 Wag"   
##  3 Datsun       "710"       
##  4 Hornet       "4 Drive"   
##  5 Hornet       "Sportabout"
##  6 Valiant      ""          
##  7 Duster       "360"       
##  8 Merc         "240D"      
##  9 Merc         "230"       
## 10 Merc         "280"       
## # … with 22 more rows

2.2 Web scraping

As it will help you to identify the data you want to extract from the web page, I will provide you with a brief introduction to HTML. Then, I will show you how to scrape comparably simple web pages. Before you try to extract contents from a web page, though, you will have to ensure that it is actually legal. Every web site has a robots.txt document which tells scrapers (or robots, such as Google) what they are allowed to do and what not. You will find the document by typing [URL]/robots.txt into your browser’s address field.

Basically, the next two exemplary syntaxes are the most important for you. Moreover, crawl-delay 20 tells you to wait 20 seconds between making requests.

And sometimes they also look like this.

2.2.1 HTML 101

Web content is usually written in HTML (Hypertext Markup Language). An HTML document is comprised of elements that are letting its content appear a certain way.

The tree-like structure of an HTML document

The way they look is defined by so-called tags.

The opening tag is the name of the element (p in this case) in angle brackets, the closing tag is the same with a forward slash before the name. p stands for a paragraph element and would basically look like this:

My cat is very grumpy

The <p> tag makes sure that the text is standing by itself and that a line-break is included thereafter:

<p>My cat is very grumpy</p>. And so is my dog. would look like this:

My cat is very grumpy

. And so is my dog.

There do exist many types of tags indicating different kinds of elements (about 100). Every page must be in an <html> element with two children <head> and <body>. The former contains the page title and some meta data, the latter the contents you are actually seeing in your browser. So-called block tags, e.g., <h1> (heading 1), <p> (paragraph), or <ol> (ordered list), structure the page. Inline tags (<b> – bold, <a> – link) format text inside block tags.

You can nest elements, e.g., if you want to make certain things bold, you can wrap text in <b>:

My cat is very grumpy

Then, the <b> element is considered the child of the <p> element.

Elements can also bear attributes:

Those attributes will not appear in the actual content. Moreover, they are super-handy for us as scrapers. Here, class is the attribute name and "editor-note" the value. Another important attribute is id. Combined with CSS, they control the appearance of the element on the actual page. A class can be used by multiple HTML elements whereas an id is unique.

2.2.2 Selecting relevant content

To scrape the web, the first step is to simply read in the web page. rvest then stores it in the XML format – just another format to store information. For this, we use rvest’s read_html() function. Here, for instance, I download the Wikipedia page of U.S. American senators.

library(tidyverse)
library(rvest)
page <- read_html("https://en.wikipedia.org/wiki/List_of_current_United_States_senators")

To demonstrate the usage of CSS selectors, I create my own, basic web page using the rvest function minimal_html():

library(rvest)
basic_html <- minimal_html('
  <html>
  <head>
    <title>Page title</title>
  </head>
  <body>
    <h1 id="first">A heading</h1>
    <p class="paragraph">Some text &amp; <b>some bold text.</b></p>
    <a> Some more <i> italicized text which is not in a paragraph. </i> </a>
    <a class="paragraph">even more text &amp; <i>some italicized text.</i></p>
    <a id="link" href="www.nyt.com"> The New York Times </a>
  </body>
')

basic_html
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <h1 id="first">A heading</h1>\n    <p class="paragraph">Some  ...

CSS is the abbreviation for cascading style sheets and used to define the visual styling of HTML documents. CSS selectors are used to map elements in the HTML code to the relevant styles in the CSS. Hence, they define patterns that allow us to easily select certain elements on the page. CSS selectors can be used in conjunction with the rvest function html_elements() which takes as arguments the read-in page and a CSS selector. Alternatively you can also provide an XPath which is usually a bit more complicated and will not be covered in this tutorial.

  • p selects all <p> elements.
basic_html %>% html_elements(css = "p")
## {xml_nodeset (1)}
## [1] <p class="paragraph">Some text &amp; <b>some bold text.</b></p>
  • .title selects all elements that are of class “title”
basic_html %>% html_elements(css = ".title")
## {xml_nodeset (0)}

There are no elements of class “title.” But some of class “paragraph.”

basic_html %>% html_elements(css = ".paragraph")
## {xml_nodeset (2)}
## [1] <p class="paragraph">Some text &amp; <b>some bold text.</b></p>
## [2] <a class="paragraph">even more text &amp; <i>some italicized text.</i>\n  ...
  • p.paragraph analogously takes every <p> element which is of class “paragraph.”
basic_html %>% html_elements(css = "a.paragraph")
## {xml_nodeset (1)}
## [1] <a class="paragraph">even more text &amp; <i>some italicized text.</i>\n  ...
  • #link scrapes elements that are of id “link”
basic_html %>% html_elements(css = "#link")
## {xml_nodeset (1)}
## [1] <a id="link" href="www.nyt.com"> The New York Times </a>

You can also connect children with their parents by using the combinator. For instance, to extract the italicized text from “a.paragraph,” I can do “a.paragraph i.”

basic_html %>% html_elements(css = "a.paragraph i")
## {xml_nodeset (1)}
## [1] <i>some italicized text.</i>

You can also look at the children by using html_children():

basic_html %>% html_elements(css = "a.paragraph") %>% html_children()
## {xml_nodeset (1)}
## [1] <i>some italicized text.</i>

Unfortunately, web pages in the wild are usually not as easily readable as the small example one I came up with. Hence, I would recommend you to use the SelectorGadget – just drag it into your bookmarks list.

2.2.3 Scraping HTML pages with rvest

So far, I have shown you how HTML is written and how to select elements. However, what we want to achieve is extracting the data in a proper format and storing it in some sort of tibble. Therefore, we need functions that allow us to actually grab the data.

The following overview taken from the web scraping cheatsheet shows you the basic “flow” of scraping web pages plus the corresponding functions. In this tutorial, I will limit myself to rvest functions. This will enable you to scrape many web pages but not all of them. Some require more advanced packages such as RSelenium or httr.

In the first part, I will introduce you to scraping singular pages and extracting their contents. rvest also allows for proper sessions where you navigate on the web pages and fill out forms. This is to be introduced in the second part.

2.2.3.1 html_text() and html_text2()

Extracting text from an HTML is easy. You use html_text() or html_text2(). The former is faster but will give you not so nice results. The latter will give you the text like it would be returned in a web browser.

The following example is taken from the documentation

# To understand the difference between html_text() and html_text2()
# take the following html:

html <- minimal_html(
  "<p>This is a paragraph.
    This another sentence.<br>This should start on a new line"
)
# html_text() returns the raw underlying text, which includes white space
# that would be ignored by a browser, and ignores the <br>
html %>% html_element("p") %>% html_text() %>% writeLines()
## This is a paragraph.
##     This another sentence.This should start on a new line
# html_text2() simulates what a browser would display. Non-significant
# white space is collapsed, and <br> is turned into a line break
html %>% html_element("p") %>% html_text2() %>% writeLines()
## This is a paragraph. This another sentence.
## This should start on a new line

A “real example” would then look like this:

us_senators <- read_html("https://en.wikipedia.org/wiki/List_of_current_United_States_senators")
text <- us_senators %>%
  html_element(css = "p:nth-child(6)") %>% 
  html_text2()

2.2.3.2 Extracting attributes

You can also extract attributes such as links using html_attrs(). An example would be to extract the headlines and their corresponding links from r-bloggers.com.

rbloggers <- read_html("https://www.r-bloggers.com")

A quick check with the SelectorGadget told me that the element I am looking for is of class “.loop-title” and the child of it is “a,” standing for normal text. With html_attrs() I can extract the attributes. This gives me a list of named vectors containing the name of the attribute and the value:

r_blogger_postings <- rbloggers %>% html_elements(css = ".loop-title a")

r_blogger_postings %>% html_attrs() 
## [[1]]
##                                                                     href 
## "https://www.r-bloggers.com/2022/01/introducing-scale-model-in-greybox/" 
##                                                                      rel 
##                                                               "bookmark" 
## 
## [[2]]
##                                                                                                                       href 
## "https://www.r-bloggers.com/2022/01/plotting-bee-colony-observations-and-distributions-using-ggbeeswarm-and-geomtextpath/" 
##                                                                                                                        rel 
##                                                                                                                 "bookmark" 
## 
## [[3]]
##                                                                                 href 
## "https://www.r-bloggers.com/2022/01/non-linear-model-of-serial-dilutions-with-stan/" 
##                                                                                  rel 
##                                                                           "bookmark" 
## 
## [[4]]
##                                                               href 
## "https://www.r-bloggers.com/2022/01/predicting-future-recessions/" 
##                                                                rel 
##                                                         "bookmark" 
## 
## [[5]]
##                                                                                          href 
## "https://www.r-bloggers.com/2022/01/detecting-multicollinearity-its-not-that-easy-sometimes/" 
##                                                                                           rel 
##                                                                                    "bookmark" 
## 
## [[6]]
##                                                                                 href 
## "https://www.r-bloggers.com/2022/01/using-the-local-dialect-to-teach-r-programming/" 
##                                                                                  rel 
##                                                                           "bookmark" 
## 
## [[7]]
##                                                              href 
## "https://www.r-bloggers.com/2022/01/emayili-message-templates-2/" 
##                                                               rel 
##                                                        "bookmark" 
## 
## [[8]]
##                                                                    href 
## "https://www.r-bloggers.com/2022/01/ropensci-news-digest-january-2022/" 
##                                                                     rel 
##                                                              "bookmark" 
## 
## [[9]]
##                                                                                   href 
## "https://www.r-bloggers.com/2022/01/reduce-dependency-hell-from-testthat-to-tinytest/" 
##                                                                                    rel 
##                                                                             "bookmark" 
## 
## [[10]]
##                                                                   href 
## "https://www.r-bloggers.com/2022/01/emayili-sending-email-from-shiny/" 
##                                                                    rel 
##                                                             "bookmark" 
## 
## [[11]]
##                                                             href 
## "https://www.r-bloggers.com/2022/01/the-basics-of-r-in-spanish/" 
##                                                              rel 
##                                                       "bookmark" 
## 
## [[12]]
##                                                                                     href 
## "https://www.r-bloggers.com/2022/01/the-robustness-of-food-webs-to-species-extinctions/" 
##                                                                                      rel 
##                                                                               "bookmark" 
## 
## [[13]]
##                                                            href 
## "https://www.r-bloggers.com/2022/01/funny-3d-voronoi-diagrams/" 
##                                                             rel 
##                                                      "bookmark" 
## 
## [[14]]
##                                                                                            href 
## "https://www.r-bloggers.com/2022/01/announcing-the-appsilon-shiny-conference-27-29-april-2022/" 
##                                                                                             rel 
##                                                                                      "bookmark" 
## 
## [[15]]
##                                                                                    href 
## "https://www.r-bloggers.com/2022/01/one-of-the-first-steps-to-become-a-data-scientist/" 
##                                                                                     rel 
##                                                                              "bookmark" 
## 
## [[16]]
##                                                      href 
## "https://www.r-bloggers.com/2022/01/playing-wordle-in-r/" 
##                                                       rel 
##                                                "bookmark" 
## 
## [[17]]
##                                                                           href 
## "https://www.r-bloggers.com/2022/01/building-r-4-2-for-windows-with-openblas/" 
##                                                                            rel 
##                                                                     "bookmark" 
## 
## [[18]]
##                                                                                        href 
## "https://www.r-bloggers.com/2022/01/identifying-r-functions-packages-used-in-github-repos/" 
##                                                                                         rel 
##                                                                                  "bookmark" 
## 
## [[19]]
##                                                                                                         href 
## "https://www.r-bloggers.com/2022/01/analysing-seed-germination-and-emergence-data-with-r-a-tutorial-part-6/" 
##                                                                                                          rel 
##                                                                                                   "bookmark" 
## 
## [[20]]
##                                                                  href 
## "https://www.r-bloggers.com/2022/01/understanding-the-native-r-pipe/" 
##                                                                   rel 
##                                                            "bookmark"

Links are stored as attribute “href” – hyperlink reference. html_attr() allows me to extract the attribute’s value. Hence, building a tibble with the article’s title and its corresponding hyperlink is straight-forward now:

tibble(
  title = r_blogger_postings %>% html_text2(),
  link = r_blogger_postings %>% html_attr(name = "href")
) %>% 
  glimpse()
## Rows: 20
## Columns: 2
## $ title <chr> "Introducing scale model in greybox", "Plotting Bee Colony Obser…
## $ link  <chr> "https://www.r-bloggers.com/2022/01/introducing-scale-model-in-g…

Another approach for this would be using the polite package and its function html_attrs_dfr() which binds together all the different attributes column-wise the different elements row-wise.

library(polite)

rbloggers %>% 
  html_elements(css = ".loop-title a") %>% 
  html_attrs_dfr() %>% 
  select(title = 3, 
         link = 1) %>% 
  glimpse()
## Rows: 20
## Columns: 2
## $ title <chr> "Introducing scale model in greybox", "Plotting Bee Colony Obser…
## $ link  <chr> "https://www.r-bloggers.com/2022/01/introducing-scale-model-in-g…

2.2.3.3 Extracting tables

The general output format we strive for is a tibble. Oftentimes, data is already stored online in a table format, basically ready for us to analyze them. In the next example, I want to get a table from the Wikipedia page that contains the senators of different States in the United States I have used before. For this first, basic example, I do not use selectors for extracting the right table. You can use rvest::html_table(). It will give you a list containing all tables on this particular page. We can inspect it using str() which returns an overview of the list and the tibbles it contains.

tables <- us_senators %>% 
  html_table()

# str(tables)

Here, the table I want is the sixth one. We can grab it by either using double square brackets – [[6]] – or purrr’s pluck(6).

senators <- tables %>% 
  pluck(6)

glimpse(senators)
## Rows: 100
## Columns: 12
## $ State                        <chr> "Alabama", "Alabama", "Alaska", "Alaska",…
## $ Portrait                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Senator                      <chr> "Richard Shelby", "Tommy Tuberville", "Li…
## $ Party                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Party                        <chr> "Republican[d]", "Republican", "Republica…
## $ Born                         <chr> "(1934-05-06) May 6, 1934 (age 87)", "(19…
## $ `Occupation(s)`              <chr> "Lawyer", "College football coachPartner,…
## $ `Previous electiveoffice(s)` <chr> "U.S. HouseAlabama Senate", "None", "Alas…
## $ Education                    <chr> "University of Alabama (BA, LLB)\nBirming…
## $ `Assumed office`             <chr> "January 3, 1987", "January 3, 2021", "De…
## $ `Term up`                    <int> 2022, 2026, 2022, 2026, 2024, 2022, 2022,…
## $ `Residence[2]`               <chr> "Tuscaloosa", "Auburn[3]", "Girdwood", "A…
## alternative approach using css
senators <- us_senators %>% 
  html_elements("#senators") %>% 
  html_table() %>% 
  pluck(1)

You can see that the tibble contains “dirty” names and that the party column appears twice – which will make it impossible to work with the tibble later on. Hence, I use clean_names() from the janitor package to fix that. Also, in the variable that matters most to me, party_2, there are some footnotes which will appear as, for instance, “[a].” Hence, I remove them using a regex and the stringr function str_remove().

library(janitor)
senators_clean_names <- senators %>% 
  clean_names() %>% 
  select(-party) %>% 
  mutate(party = party_2 %>% 
           str_remove("\\[.\\]") %>% 
           as_factor()) 

Now, we have the table in a nice tibble and can go on with whatever we want to do with it (e.g., exercise 2).

2.2.4 Automating scraping

Well, grabbing singular points of data from web sites is nice. However, if you want to do things such as collecting large amounts of data or multiple pages, you will not be able to do this without some automation.

The example page we scrape today is https://wg-gesucht.de. Looking at its robots.txt tells us that we are allowed to scrape most of its pages. Unfortunately, we cannot fill out the form on the first page, as it is written using javascript – which goes beyond rvest’s capabilities. However, we can still fill this out by hand and, thereafter, start scraping the search results for the last month. Overall, the process will look as follows:

  1. Determine search parameters manually, copy URL of results list
  2. read in results list
  3. get links and names of all listed apartments
  4. go to next page of results

–> repeat 2–4 until satisfied (here: listings are ordered chronologically, stop when they were posted earlier than one month ago) Read page containing the different apartments

  1. optional: scrape listings (take-home exercise)

It is probably easiest to perform those things in a while loop, hence here a quick revision:

For the loop to run efficiently, space for every object should be pre-allocated (i.e., you create a list beforehand, its length can be determined by an educated guess). Hence, our while loop in pseudo-code will look like this:

output_list <- vector(mode = "list", length = 10000L)

while (date > today-30days) {
  read in wg-gesucht results list
  get links and store them in list
  get names and store them in list
  get date and store them in list
  go to next page
}

For moving to the next page, there are basically three approaches: the most basic one is pre-determining how the pages are numbered in the URL (e.g., webpage.com/page_1, webpage.com/page_2, webpage.com/page_3, etc.). Fancier is finding the next link on the web page through html_elements(). The fanciest – and compatible with most web pages, however, is scraping the web page using an html_session(). Then you can make R hit the next page using session_follow_link(). I will show you how you can scrape all “1-Zimmer-Wohnungen-in-Leipzig” using the aforementioned approaches.

2.2.4.1 Looping over URLs

First, we need to determine how the URLs for page 1 and 2 differ. Usually, we find a number in the URL that just changes that we can then manipulate to navigate trough the pages. I just went to the web page and copied the URL for the first page and for the second one.

url_1 <- "https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.0.html?category=1&city_id=77&rent_type=0&img=1&rent_types%5B0%5D=0"
url_2 <- "https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.1.html?category=1&city_id=77&rent_type=0&img=1&rent_types%5B0%5D=0"

Well, that’s a mess. Let’s find the difference in an R way (code stolen and adapted from StackOverflow).

initial_dist <- adist(url_1, url_2, counts = TRUE) %>% 
  attr("trafos") %>% 
  diag() %>% 
  str_locate_all("[^M]")

  
str_sub(url_1, start = initial_dist[[1]][1]-5, end = initial_dist[[1]][1]+5)
## [1] ".1.1.0.html"
str_sub(url_2, start = initial_dist[[1]][1]-5, end = initial_dist[[1]][1]+5)
## [1] ".1.1.1.html"

Now we can build our list of links to loop over:

links <- str_c(
  "https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.", 
  0:5, 
  ".html?category=1&city_id=77&rent_type=0&img=1&rent_types%5B0%5D=0"
  )

Looping over the pages and extracting the relevant content is then straight-forward. This method also bears the advantage that we know how long the maximum length of the output list upfront:

library(lubridate)

fix_date <- function(date_vec){
  proper_dates <- str_extract(date_vec, "[0-9]{2}.[0-1][0-9].[2][0][0-3][0-9]") %>% 
    parse_date(format = "%d.%m.%Y") %>% 
    .[!is.na(.)]
  today <- date_vec[str_detect(date_vec, "Minuten|Stunde")] %>% 
    str_replace(".+", today() %>% as.character()) %>% 
    ymd()
  days_ago <- date_vec[str_detect(date_vec, "Tag")] %>% 
    str_replace(., 
                ".+", 
                (today()-(days(str_extract(., "[1-4](?= Tag)") %>% 
                                 as.numeric()))) %>% 
                as.character()) %>% 
    ymd()
  c(today, days_ago, proper_dates)
}

output_list <- vector(mode = "list", length = length(links))

i <- 0
date <- today()
end_date <- today() - months(1)

while (date >= end_date) {
  i <- i + 1
  page <- read_html(links[[i]])

  output_list[[i]] <- page %>% 
    html_elements(".truncate_title a") %>% 
    html_attrs_dfr() %>% 
    filter(class == "detailansicht") %>% 
    select(link = href, title = .text) %>% 
    mutate(title = title %>% str_squish(),
           link = url_absolute(link, base = "https://www.wg-gesucht.de/"))

  output_list[[i]]$date <- page %>% 
    html_elements("span:nth-child(2)") %>% 
    html_text2() %>% 
    .[str_detect(., "^Online")] %>% 
    fix_date()
  
  date <- output_list[[i]]$date %>% tail(1)
  
  Sys.sleep(2)
}

output_list %>% bind_rows()

Extracting the link on the fly is basically the same thing, but at the end you need to replace the link argument by the one you extracted:

output_list <- vector(mode = "list", length = length(links))

i <- 0
date <- today()
end_date <- today() - months(1)

link <- links[[1]]

while (date >= end_date) {
  i <- i + 1
  page <- read_html(links[[i]])

  output_list[[i]] <- page %>% 
    html_elements(".truncate_title a") %>% 
    html_attrs_dfr() %>% 
    filter(class == "detailansicht") %>% 
    select(link = href, title = .text) %>% 
    mutate(title = title %>% str_squish())

  output_list[[i]]$date <- page %>% 
    html_elements("span:nth-child(2)") %>% 
    html_text2() %>% 
    .[str_detect(., "^Online")] %>% 
    fix_date()
  
  date <- output_list[[i]]$date %>% tail(1)
  
  link <- page %>% 
    html_elements("#main_column li:nth-child(15) a") %>% 
    html_attr("href") %>% 
    url_absolute(base = "https://www.wg-gesucht.de/")
}

output_list %>% bind_rows()

However, the slickest way to do this is by using a session. In a session, R behaves like a normal browser, stores cookies, allows you to navigate to pages, by going session_forward() or session_back(), session_follow_link()s on the page itself or session_jump_to() a different URL, or submit form()s with session_submit().

First, you start the session by simply calling session().

wg_session <- session("https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.0.html") 

When you want to save a page from the session, do so using read_html().

page <- read_html(wg_session)

If you want to follow a link, hit follow_link()

library(magrittr)
wg_session %<>% session_follow_link(css = "#main_column li:nth-child(15) a")

Wanna go back – session_back(); thereafter you can go session_forward(), too.

wg_session %<>% 
  session_back()

You can look at what your scraper has done with session_history().

wg_session %>% session_history()

Feel free to create a while loop with a session as an exercise (Exercise #3).

2.2.5 Forms

Sometimes we also want to provide certain input, e.g., to provide login credentials or to scrape a web site in a more systematic manner. Those information are usually provided using so-called forms. A <form> element can contain different other elements such as text fields or check boxes. Basically, we use html_form() to extract the form, html_form_set() to define what we want to submit and html_form_submit() to finally submit it. For a basic example, I search for something on Google.

google <- read_html("http://www.google.com")
search <- html_form(google)[[1]]

search_something <- search %>% html_form_set(q = "something")

vals <- list(q = "web scraping", hl = "en")

search <- search %>% html_form_set(!!!vals)

resp <- html_form_submit(search)

2.2.6 Scraping hacks

Some web pages are a bit fancier than the ones we have looked at so far (i.e., they use JavaScript). rvest works nicely for static web pages, but for more advanced ones you need different tools such as RSelenium. This, however, goes beyond the scope of this tutorial.

Some web pages might block you right away as they know that you are no “real” human being based on the user agent:

library(rvest)
my_session <- session("https://scrapethissite.com/")

my_session$response$request$options$useragent

Not very human. We can set it to a common one using the httr package (which actually powers rvest).

user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
session_with_ua <- html_session("https://scrapethissite.com/", user_a)
session_with_ua$response$request$options$useragent

A web page may sometimes gives you time outs (i.e., it doesn’t respond within a given time). This can break your loop. Wrapping your code in safely() or insistently() from the purrr package might help. The former moves on and notes down what has gone wrong, the latter keeps sending requests until it has been successful. They both work easiest if you put your scraping code in functions and wrap those with either insistently() or safely().

Sometimes a web page keeps blocking you. Consider using a proxy server.

my_proxy <- httr::use_proxy(url = "http://example.com",
                            user_name = "myusername",
                            password = "mypassword",
                            auth = "one of basic, digest, digest_ie, gssnegotiate, ntlm, any")

my_session <- html_session("https://scrapethissite.com/", my_proxy)

Find more useful information – including the stuff I just described – and links on this GitHub page.

2.3 Application Programming Interfaces (APIs)

While web scraping (or screen scraping, as you extraxt ths stuff that appears on your screen) is certainly fun, it should be seen as a last resort. More and more web platforms provide so-called Application Programming Interfaces (APIs).

“An application programming interface (API) is a connection between computers or between computer programs.” (Wikipedia)

There are a bunch of different sorts of APIs, but the most common one is the REST API. REST stands for “Representational State Transfer” and describes a set of rules the API designers are supposed to obey when developing their particular interface. You can make different requests, such as GET content, POST a file to a server – PUT is similar, or request to DELETE a file. We will ony focus on the GET part.

APIs basically offer you a structured way to communicate with the platform via your machine. In our use case, this means that you can get the data you want in a usually well-structured format and without all the “dirt” that you need to scrape off tediously (enough web scraping metaphors for today). With APIs you can generally quite clearly define what you want and how you want it. In R, we achieve this by using the httr (Wickham 2020) package. Moreover, using APIs does not bear the risk of acquiring information you are not supposed to access and you also do not need to worry about the server not being able to handle the load of your requests (usually, there are rate limits in place to address this particular issue). However, it’s not all fun and games with APIs: they might give you their data in a special format, both XML and JSON are common. The former is the one rvest uses as well, the latter can be tamed using jsonlite (Ooms, Temple Lang, and Hilaiel 2020) which is to be introduced as well. Moreover, you usually have to ask the platform for permission and perhaps pay to get it. Once you have received the keys you need, you can tell R to fill them automatically, similar to how your browser knows your Amazon password etc.; usethis (Wickham et al. 2021) can help you with such tasks. An overview of current existing APIs can be found on The Programmable Web

The best thing that can happen with APIs: some of them are so popular that people have already written specific R packages for working with them – an overview can be found on the ROpenSci web site. One example for this is Twitter and the rtweet package (Kearney 2019) which will be introduced in the end. Less work for us, great.

2.3.1 Obtaining their data

API requests are performed using URLs. Those start with the basic address of the API (e.g., https://api.nytimes.com), followed by the endpoint that you want to use (e.g., /lists). They also contain so-called headers which are provided as key-value pairs. Those headers can contain for instance authentification tokens or different search parameters. A request to the New York Times API to obtain articles for January 2019 would then look like this: https://api.nytimes.com/svc/archive/v1/2019/1.json?api-key=yourkey.

At most APIs, you will have to register first. As we will play with the New York Times API, do this here.

2.3.1.1 Making queries

A basic query is performed using the GET() function. However, first you need to define the call you want to make. The different keys and values they can take can be found in the API documentation. Of course, there is also a neater way to deal with the key problem. I will show it later.

library(httr)
key <- "qekEhoGTXqjsZnXpqHns0Vfa2U6T7ABf"
nyt_headlines <- modify_url(
  url = "https://api.nytimes.com/",
  path = "svc/news/v3/content/nyt/business.json",
  query = list(`api-key` = key))

response <- GET(nyt_headlines)

When it comes to the NYT news API, there is the problem that the type of section is specified not in the query but in the endpoint path itself. Hence, if we were to scrape the different sections, we would have to change the path itself, e.g., through str_c().

The Status: code you want to see here is 200 which stands for success. If you want to put it inside a function, you might want to break the function once you get a non-successful query. http_error() or http_status() are your friends here.

response %>% http_error()
response %>% http_status()

content() will give you the content of the request.

response %>% content()

What you see is also the content of the call – which is what we want. It is in a format which we cannot work with right away, though, it is in JSON.

2.3.1.2 JSON

The following unordered list is stolen from this blog entry:

  • The data are in name/value pairs
  • Data objects are separated by commas
  • Curly braces {} hold objects
  • Square brackets [] hold arrays
  • Each data element is enclosed with quotes "" if it is a character, or without quotes if it is a numeric value
writeLines(rawToChar(response$content))

jsonlite helps us to bring this output into a data frame.

library(jsonlite)
response %>% 
  content(as = "text") %>%
  fromJSON()

2.3.1.3 Dealing with authentification

Well, as we saw before, I basically put my official NYT API code publicly visible in this script. This is bad practice and should be avoided, especially if you work in a joint project (where everybody uses their own code) or if you put your scripts to public places (such as GitHub). The usethis package can help you here.

#usethis::edit_r_environ()
Sys.getenv("nyt_api_key")

Hence, if we now search for articles – find the proper parameters here, we provide the key by using the Sys.getenv function.

modify_url(
  url = "http://api.nytimes.com/svc/search/v2/articlesearch.json",
  query = list(q = "Trump",
               pub = "20161101",
               end_date = "20161110",
               `api-key` = Sys.getenv("nyt_api_key"))
) %>% 
  GET()

2.3.2 rtweet

Twitter is quite popular among social scientists. The main reason for this is arguably its data accessibility. For R users, the package you want to use is rtweet by Michael Kearney (find an overview of the different packages and their capabilities here). There is a great and quite complete presentation demonstrating its capabilities. This presentation is a bit outdated. The main difference to the package these days is that you will not need to register an app upfront anymore. All you need is a Twitter account. When you make your first request, a browser window will open where you log on to Twitter, authorize the app, and then you can just go for it. There are certain rate limits, too, which you will need to be aware of when you try to acquire data. Rate limits and the parameters you need to specify can be found in the extensive documentation.

In the following, I will just link to the respective vignettes. Please, feel free to play around with the functions yourself. As a starting point, I provide you the list of all German politicians and some tasks in exercise 3. A first introductioh gives this vignette.

library(rtweet)
lists_members(
  list_id = "1050737868606455810"
)

## Conclusion

To sum it up: when you have a good research idea that relies on Digital Trace Data that you need to collect, ask yourself the following questions:

  1. Is there an R package for it?
  2. If 1. == FALSE: Is there an API where I can get the data (if yes, use it)
  3. If 1. == FALSE & 2. == FALSE: Is screen scraping an option?

References

Kearney, Michael. 2019. “Rtweet: Collecting and Analyzing Twitter Data.” Journal of Open Source Software 4 (42): 1829. https://doi.org/10.21105/joss.01829.
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominik Nyhuis. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: Wiley.
Ooms, Jeroen, Duncan Temple Lang, and Lloyd Hilaiel. 2020. “Jsonlite: A Simple and Robust JSON Parser and Generator for R.”
Perepolkin, Dmytro. 2019. “Polite: Be Nice on the Web.”
Wickham, Hadley. 2019a. “Rvest: Easily Harvest (Scrape) Web Pages.”
———. 2019b. “Stringr: Simple, Consistent Wrappers for Common String Operations.”
———. 2020. “Httr: Tools for Working with URLs and HTTP.”
Wickham, Hadley, Jennifer Bryan, Malcolm Barrett, and RStudio. 2021. “Usethis: Automate Package and Project Setup.”