Chapter 3 Advanced rvest
3.1 session()
s
However, the slickest way to do this is by using a session()
. In a session, R behaves like a normal browser, stores cookies, allows you to navigate between pages, by going session_forward()
or session_back()
, session_follow_link()
s on the page itself or session_jump_to()
a different URL, or submit form()
s with session_submit()
.
First, you start the session by simply calling session()
.
Some servers may not want robots to make requests and block you for this reason. To circumnavigate this, we can set a “user agent” in a session. The user agent contains data that the server receives from us when we make the request. Hence, by adapting it we can trick the server into thinking that we are humans instead of robots. Let’s check the current user agent first:
## [1] "libcurl/7.88.1 r-curl/5.0.0 httr/1.4.5"
Not very human. We can set it to a common one using the httr
package (which powers rvest
).
user_a <- user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
session_with_ua <- session("https://scrapethissite.com/", user_a)
session_with_ua$response$request$options$useragent
## [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
You can check the response using session$response$status_code
– 200 is good.
## [1] 200
When you want to save a page from the session, do so using read_html()
.
If you want to open a new URL, hit session_jump_to()
.
session_with_ua <- session_with_ua |>
session_jump_to("https://www.scrapethissite.com/pages/")
session_with_ua
## <session> https://www.scrapethissite.com/pages/
## Status: 200
## Type: text/html; charset=utf-8
## Size: 10603
You can also click buttons on the page using CSS selectors:
session_with_ua <- session_with_ua |>
session_jump_to("https://www.scrapethissite.com/") |>
session_follow_link(css = ".btn-primary")
## Navigating to /lessons/
## <session> http://www.scrapethissite.com/lessons/sign-up/
## Status: 200
## Type: text/html; charset=utf-8
## Size: 24168
Wanna go back – session_back()
; thereafter you can go session_forward()
, too.
## <session> https://www.scrapethissite.com/
## Status: 200
## Type: text/html; charset=utf-8
## Size: 8117
## <session> http://www.scrapethissite.com/lessons/sign-up/
## Status: 200
## Type: text/html; charset=utf-8
## Size: 24168
You can look at what your scraper has done with session_history()
.
## https://www.scrapethissite.com/
## https://www.scrapethissite.com/pages/
## https://www.scrapethissite.com/
## - http://www.scrapethissite.com/lessons/sign-up/
3.2 Forms
Sometimes we also want to provide certain input, e.g., to provide login credentials or to scrape a website more systematically. That information is usually provided using so-called forms. A <form>
element can contain different other elements such as text fields or check boxes. Basically, we use html_form()
to extract the form, html_form_set()
to define what we want to submit, and html_form_submit()
to finally submit it. For a basic example, we search for something on Google.
google <- read_html("http://www.google.com")
search <- html_form(google) |> pluck(1)
search |> str()
## List of 5
## $ name : chr "f"
## $ method : chr "GET"
## $ action : chr "http://www.google.com/search"
## $ enctype: chr "form"
## $ fields :List of 10
## ..$ ie :List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "ie"
## .. ..$ value: chr "ISO-8859-1"
## .. ..$ attr :List of 3
## .. .. ..$ name : chr "ie"
## .. .. ..$ value: chr "ISO-8859-1"
## .. .. ..$ type : chr "hidden"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ hl :List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "hl"
## .. ..$ value: chr "en"
## .. ..$ attr :List of 3
## .. .. ..$ value: chr "en"
## .. .. ..$ name : chr "hl"
## .. .. ..$ type : chr "hidden"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ source:List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "source"
## .. ..$ value: chr "hp"
## .. ..$ attr :List of 3
## .. .. ..$ name : chr "source"
## .. .. ..$ type : chr "hidden"
## .. .. ..$ value: chr "hp"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ biw :List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "biw"
## .. ..$ value: NULL
## .. ..$ attr :List of 2
## .. .. ..$ name: chr "biw"
## .. .. ..$ type: chr "hidden"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ bih :List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "bih"
## .. ..$ value: NULL
## .. ..$ attr :List of 2
## .. .. ..$ name: chr "bih"
## .. .. ..$ type: chr "hidden"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ q :List of 4
## .. ..$ type : chr "text"
## .. ..$ name : chr "q"
## .. ..$ value: chr ""
## .. ..$ attr :List of 8
## .. .. ..$ class : chr "lst"
## .. .. ..$ style : chr "margin:0;padding:5px 8px 0 6px;vertical-align:top;color:#000"
## .. .. ..$ autocomplete: chr "off"
## .. .. ..$ value : chr ""
## .. .. ..$ title : chr "Google Search"
## .. .. ..$ maxlength : chr "2048"
## .. .. ..$ name : chr "q"
## .. .. ..$ size : chr "57"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ btnG :List of 4
## .. ..$ type : chr "submit"
## .. ..$ name : chr "btnG"
## .. ..$ value: chr "Google Search"
## .. ..$ attr :List of 4
## .. .. ..$ class: chr "lsb"
## .. .. ..$ value: chr "Google Search"
## .. .. ..$ name : chr "btnG"
## .. .. ..$ type : chr "submit"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ btnI :List of 4
## .. ..$ type : chr "submit"
## .. ..$ name : chr "btnI"
## .. ..$ value: chr "I'm Feeling Lucky"
## .. ..$ attr :List of 5
## .. .. ..$ class: chr "lsb"
## .. .. ..$ id : chr "tsuid_1"
## .. .. ..$ value: chr "I'm Feeling Lucky"
## .. .. ..$ name : chr "btnI"
## .. .. ..$ type : chr "submit"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ iflsig:List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "iflsig"
## .. ..$ value: chr "AOEireoAAAAAZIsp4Z0ygNGfSM_CyJ9e46YcKtV2SPIs"
## .. ..$ attr :List of 3
## .. .. ..$ value: chr "AOEireoAAAAAZIsp4Z0ygNGfSM_CyJ9e46YcKtV2SPIs"
## .. .. ..$ name : chr "iflsig"
## .. .. ..$ type : chr "hidden"
## .. ..- attr(*, "class")= chr "rvest_field"
## ..$ gbv :List of 4
## .. ..$ type : chr "hidden"
## .. ..$ name : chr "gbv"
## .. ..$ value: chr "1"
## .. ..$ attr :List of 4
## .. .. ..$ id : chr "gbv"
## .. .. ..$ name : chr "gbv"
## .. .. ..$ type : chr "hidden"
## .. .. ..$ value: chr "1"
## .. ..- attr(*, "class")= chr "rvest_field"
## - attr(*, "class")= chr "rvest_form"
search_something <- search |> html_form_set(q = "something")
resp <- html_form_submit(search_something, submit = "btnG")
read_html(resp)
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body jsmodel="hspDDf">\n<header id="hdr"><script nonce="dJJkfiyhHv142Njk ...
vals <- list(q = "web scraping", hl = "en")
search <- search |> html_form_set(!!!vals)
resp <- html_form_submit(search)
read_html(resp)
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body jsmodel="hspDDf">\n<header id="hdr"><script nonce="z8WGPiQLcNVJ3QUC ...
If you are working with a session, the workflow is as follows:
- Extract the form.
- Set it.
- Start your session on the page with the form.
- Submit the form using
session_submit()
.
google_form <- read_html("http://www.google.com") |>
html_form() |>
pluck(1) #another way to do [[1]]
search_something <- google_form |> html_form_set(q = "something")
google_session <- session("http://www.google.com") |>
session_submit(search_something, submit = "btnG")
google_session |>
read_html()
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body jsmodel="hspDDf">\n<header id="hdr"><script nonce="Je3R55fYxgBQXBw_ ...
3.3 Scraping hacks
Some web pages are a bit fancier than the ones we have looked at so far (i.e., they use JavaScript). rvest
works nicely for static web pages, but for more advanced ones you need different tools such as RSelenium
. This, however, goes beyond the scope of this tutorial.
A web page may sometimes give you time-outs (i.e., it doesn’t respond within a given time). This can break your loop. Wrapping your code in safely()
or insistently()
from the purrr
package might help. The former moves on and notes down what has gone wrong, the latter keeps sending requests until it has been successful. They both work easiest if you put your scraping code in functions and wrap those with either insistently()
or safely()
.
Sometimes a web page keeps blocking you. Consider using a proxy server.
my_proxy <- httr::use_proxy(url = "http://example.com",
user_name = "myusername",
password = "mypassword",
auth = "one of basic, digest, digest_ie, gssnegotiate, ntlm, any")
my_session <- session("https://scrapethissite.com/", my_proxy)
Find more useful information – including the stuff we just described – and links on this GitHub page.
3.4 Automating scraping
Well, grabbing singular points of data from websites is nice. However, if you want to do things such as collecting large amounts of data or multiple pages, you will not be able to do this without some automation.
An example here would again be the R-bloggers page. It provides you with plenty of R-related content. If you were now eager to scrape all the articles, you would first need to acquire all the different links leading to the blog postings. Hence, you would need to navigate through the site’s pages first to acquire the links.
In general, there are two ways to go about this. The first is to manually create a list of URLs the scraper will visit and take the content you need, therefore not needing to identify where it needs to go next. The other one would be automatically acquiring its next destination from the page (i.e., identifying the “go on” button). Both strategies can also be nicely combined with some sort of session()
.
3.5 Looping over pages
For the first approach, we need to check the URLs first. How do they change as we navigate through the pages?
url_1 <- "https://www.r-bloggers.com/page/2/"
url_2 <- "https://www.r-bloggers.com/page/3/"
initial_dist <- adist(url_1, url_2, counts = TRUE) |>
attr("trafos") |>
diag() |>
str_locate_all("[^M]")
str_sub(url_1, start = initial_dist[[1]][1]-5, end = initial_dist[[1]][1]+5) # makes sense for longer urls
## [1] "page/2/"
## [1] "page/3/"
There is some sort of underlying pattern and we can harness that. url_1
refers to the second page, url_2
to the third. Hence, if we just combine the basic URL and, say, the numbers from 1 to 10, we could then visit all the pages (exercise 3a) and extract the content we want.
urls <- str_c("https://www.r-bloggers.com/page/", 1:10, "/") # this is the stringr equivalent of paste()
urls
## [1] "https://www.r-bloggers.com/page/1/" "https://www.r-bloggers.com/page/2/"
## [3] "https://www.r-bloggers.com/page/3/" "https://www.r-bloggers.com/page/4/"
## [5] "https://www.r-bloggers.com/page/5/" "https://www.r-bloggers.com/page/6/"
## [7] "https://www.r-bloggers.com/page/7/" "https://www.r-bloggers.com/page/8/"
## [9] "https://www.r-bloggers.com/page/9/" "https://www.r-bloggers.com/page/10/"
You can run this in a for-loop. For the loop to run efficiently, space should be pre-allocated for every object (i.e., you create a list beforehand, and its length can be determined by an educated guess).
result_list <- vector(mode = "list", length = length(urls)) # pre-allocate space!!!
for (i in seq_along(urls)){
result_list[[i]] <- read_html(urls[[i]])
}
You can of course also just purrr::map()
over it:
3.7 Exercises
- Start a session with the tidyverse Wikipedia page. Adapt your user agent to some sort of different value. Proceed to Hadley Wickham’s page. Go back. Go forth. Check the
session_history()
to see if it has worked.
tidyverse_wiki <- "https://en.wikipedia.org/wiki/Tidyverse"
hadley_wiki <- "https://en.wikipedia.org/wiki/Hadley_Wickham"
user_agent <- user_agent("Hi, I'm Felix and I'm trying to steal your data.")
wiki_session <- session(tidyverse_wiki, user_agent)
wiki_session %<>% session_jump_to(hadley_wiki) %>%
session_back() %>%
session_forward()
wiki_session %>% session_history()
- Start a session on “https://www.scrapethissite.com/pages/advanced/?gotcha=login”, fill out, and submit the form. Any value for login and password will do. (Disclaimer: you have to add the URL as an “action” attribute after creating the form, see this tutorial. –
login_form$action <- url
)
url <- "https://www.scrapethissite.com/pages/advanced/?gotcha=login"
#extract and set login form here
login_form$action <- url # add url as action attribute
# submit form
base_session <- session("https://www.scrapethissite.com/pages/advanced/?gotcha=login") %>%
session_submit(login_form)
url <- "https://www.scrapethissite.com/pages/advanced/?gotcha=login"
#extract and set login form here
login_form <- read_html(url) %>%
html_form() %>%
pluck(1) %>%
html_form_set(user = "123",
pass = "456")
login_form$action <- url # add url as action attribute
# submit form
base_session <- session("https://www.scrapethissite.com/pages/advanced/?gotcha=login") %>%
session_submit(login_form)
base_session %>% read_html() %>% html_text()