Intermediate Web Scraping and API Harvesting
In today’s digital age, the internet serves as a vast repository of information, encompassing websites that offer valuable data for various purposes. While some of them are readily available to be “plucked” through an automated approach via suitable application programming interfaces (APIs), considerable amounts of information available on the web are primarily designed for human consumption and, therefore, not readily accessible for automated analysis or integration into data-driven projects. This is where web scraping comes into play. Web scraping is the process of extracting data from websites automatically, allowing us to transform unstructured web content into structured data that can be utilized for analysis, research, and decision-making. This is sometimes also referred to as “screen scraping” as web pages are opened in the same form as they appear on your screen.
This tutorial covers what I call “intermediate web scraping.” In my view, “basic web scraping” encompasses understanding the basics of HTML and harnessing them to extract the desired content. However, some web pages may pose unexpected challenges such as login forms or automated bot detection. “Intermediate scraping” now refers to more advanced techniques to circumnavigate these challenges. However, it distinguished itself from “advanced web scraping” which would – again, in my very own view – be characterized by the usage of a so-called headless browser which can be controlled through code (e.g., via
RSelenium). Since this marks a perfect emulation of a human-controlled web browser, basically every web page can be scraped via this approach.
Web developers sometimes go long ways to prevent users from automatically accessing their contents for a reason: multiple, repeated, automated requests require significant bandwidth on their servers’ end and may eventually shut them down entirely. Hence, they may provide access for bots through a particular interface where well-defined requests can be made. This enables to maintain as little traffic as possible. For instance, if one wants to acquire a particular piece of information from a web page, they could extract it from the page through CSS selectors AFTER they have loaded the full page – including all the redundant information. Through API requests, on the other hand, they can define what they want to see UPFRONT and, hence, make very targeted requests.
In this tutorial, students will be shown both flavors of automated data acquisition. I assume students to be familiar with R and
tidyverse as well as basic
In the beginning, I will provide a brief recap on the logic of HTML and how this can be used in
rvest to extract content. Then, I will speak about techniques to simulate user behavior such as setting a user agent (
httr::user_agent()), use a proxy (
httr::use_proxy()), circumnavigate necessary cookies (
rvest::session()), (login) forms (
rvest::html_form()), and buttons (
rvest::session_follow_link()). Subsequently, I will present two techniques to automate extraction via
rvest. Finally, I will dwell on the usage of APIs for making requests and extracting information.