19 Collecting web data

In this chapter, we introduce how to collect web data, and we use the Yahoo Finance data and Wikipedia data as examples.

We touch upon two situations: using an R wrapper to connect to a web service to access data directly via an API call, and scraping web pages with the help of an R package.

19.1 CRAN Task View: Web Technologies and Services

The CRAN Task View on Web Technologies and Services summarizes packages and strategies for efficiently interacting with resources over the internet with R. This includes:

  • Direct data download and ingestion
  • Online services
  • Frameworks for building web-based R applications

What could be of interest to us includes Tools for Working with the Web, especially the Core Tools for HTTP requests, which we would use in web scraping. Once we have acquired data from the web, we need tools that parse structured web data, such as HTML, XML, and JSON. This is described in Parsing Structured Web Data. Other topics that could be of interest to us are the packages that connect to web services, including social media, web analytics, and the packages that scrape data from the publications.

19.2 R API wrappers

A common way to access data on the web is using the APIs provided by the web services. API stands for Application Programming Interface (API). It is a set of rules between two pieces of software in order for them to interact with each other.

When using an API to retrieve data from an online service, there are three primary elements to consider: access, request, and response.

  1. Access refers to who is authorized to request data from the API. Some APIs are open to everyone, while others require users to create an authenticated account or apply for a developer account. For example, Yahoo Finance API is free and open to all users, but Twitter API requires developers to create an account before accessing the API.

  2. Request involves specifying the data we want to retrieve from the API, such as the time period or ticker symbol of a stock.

  3. Response is the data returned by the API in response to our request, such as the historical stock prices for the specified time period or ticker symbol.

APIs often come with usage limits, which can be in the form of daily limits or a cap on the number of requests within a specific time frame. These limits are put in place to ensure that the service provider’s servers are not overloaded and can operate under predictable loads. If the usage limit is exceeded, we must wait until the next window opens before making additional requests. If higher access to the API is required, it is worth checking if there is a paid subscription plan that provides increased usage limits or more extensive features.

An API wrapper is a language-specific package that wraps up multiple API calls to make complicated functions easy to use. An R API wrapper is a package written in R that simplifies the use of an API by encapsulating multiple API calls into easy-to-use functions. Its purpose is to provide a convenient interface for developers or users to interact with the API, without having to worry about the underlying technical details of the API implementation.

Not every online service offers APIs. In cases where an API is unavailable, we may have to manually scrape web data by writing scripts. However, before proceeding with web scraping, we must ensure that we are not breaching any terms and policies or local laws. It is crucial to maintain ethical standards throughout the process.

19.3 Yahoo Finance data

R packages developed to collect data from the website include quantmod, tidyquant, and yfR.

We introduce quantmod and tidyquant below.

quantmod

In quantmod, getSymbols() retrieves data from the Yahoo Finance website. The first argument specifies the symbols.

library(quantmod)

tickers <- c("AFL","AAPL", "MMM")
stock_env <- new.env()
getSymbols(tickers, from = "2021-01-01", to = "2021-01-31", env = stock_env)
stock_list <- eapply(stock_env, cbind)

We choose to create a new environment to collect data, and all the xts objects collected via this package would be placed in that new environment stock_env. Otherwise, all the objects would go to our current workspace.

We are not yet ready to work with the downloaded objects, until we bind them together. We use eapply() to apply the function cbind() on those objects and bind them by columns into a wide format.

tidyquant

tidyquant’s tq_get() is equivalent to getSymbols() from quantmod. Its argument get is equivalent to quantmod’s src.

library(tidyquant)

tickers <- c("AFL","AAPL", "MMM", "FB", "AMZN")
tq <- tq_get(tickers, get = "stock.prices", from = "2021-01-01", to = "2021-01-31") 

The returned format is of a long format.

19.4 Web scraping

To collect the information of S&P 500 company that’s stored in a table on the Wikipedia web page, we use the package rvest, which is also part of tidyverse. rvest allows us to more easily scrape data from the web pages.

Scraping is defined as programmatically collecting human readable content from the web pages.

We will follow this tutorial to scrape the content stored in the S&P 500 table.

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tickers <- url %>%
  read_html() %>%
  html_element(xpath = '//*[@id="constituents"]') %>% 
  html_table()

First, we create a new data frame tickers that’s going to store the data we collect from the web page.

We use the function read_html() to read the HTML page. Its first argument is the URL pointing to that web page.

Then, we want to select the correct HTML nodes to extract the table element from that page. Among all the elements on that page, such as headers, images, and tables, we locate that particular table by finding out the XPath indicating which HTML node to select.

The question is then how to find that XPath to this table. If we use Google Chrome, we can use its Inspect feature to locate the HTML nodes corresponding to the elements on a web page.

Finally, we use html_table() to select elements from an HTML document. It parses an HTML table into a data frame.