3 Collecting structured and unstructured data from the Web
In this chapter we will use four new packages:
* httr
handles API calls.
* rtweet
which is an R package for efficient use of the Twitter API.
* tidyquant
which focuses on financial analysis.
* rvest
which allows us to scrape HTML web pages.
We will first install these packages (only once):
install.packages(c("httr","rtweet","tidyquant","rvest"))
Then we will load them into our environment as follows:
library(tidyverse)
library(ggformula)
library(ggthemes)
library(httr)
library(jsonlite)
library(rtweet)
library(tidyquant)
library(lubridate)
library(rvest)
3.1 API calls
We will first collect structured output from Application Programming Interfaces (APIs).
3.1.1 Simple API calls: dog-facts
We will starts by using the package httr
to make simple API calls. httr
allows us to retrieve and post information from specific URLs (AKA endpoints). It has two basic functions: the GET()
function that retrieves information from a given URL, and the POST()
function that transfers information to a given URL.
As an example, we will use the dog-facts API (https://dukengn.github.io/Dog-facts-API/):
= GET("https://dog-facts-api.herokuapp.com/api/v1/resources/dogs/all")
r r
## Response [https://dog-facts-api.herokuapp.com/api/v1/resources/dogs/all]
## Date: 2021-11-06 20:41
## Status: 200
## Content-Type: application/json
## Size: 57.7 kB
## [
## {
## "fact": "All dogs can be traced back 40 million years ago to a weasel-lik...
## },
## {
## "fact": "Ancient Egyptians revered their dogs. When a pet dog would die, ...
## },
## {
## "fact": "Small quantities of grapes and raisins can cause renal failure i...
## },
## ...
One way to access the information of any R object, is to call the function names()
:
names(r)
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
By running names(r)
we observe that there is a field content
. This field stores the response of the API call in binary format. We can access any field of an R
object with the dollar sign $
as follows:
# Run r$content in your machine to see the results.
# I omit the output here for presentation purposes.
$content r
Recall that the dollar sign
$
can also be used to access any columnc
from a tibblet
(t$c
).
As humans, we can’t really work with binary code; to encode and translate this binary response we can use the rawToChar()
function and get the result as a JSON object:
= rawToChar(r$content) j
The results are in JSON format because this is how the specific API formats its responses. Most APIs return JSON objects, but some might occassionaly return different formats (e.g., CSV or XML).
Now we can call the function fromJSON()
from package jsonlite
(see Section 2.2.2) that immediately transforms a JSON object into a data frame:
= fromJSON(rawToChar(r$content))
t %>% head t
## fact
## 1 All dogs can be traced back 40 million years ago to a weasel-like animal called the Miacis which dwelled in trees and dens. The Miacis later evolved into the Tomarctus, a direct forbear of the genus Canis, which includes the wolf and jackal as well as the dog.
## 2 Ancient Egyptians revered their dogs. When a pet dog would die, the owners shaved off their eyebrows, smeared mud in their hair, and mourned aloud for days.
## 3 Small quantities of grapes and raisins can cause renal failure in dogs. Chocolate, macadamia nuts, cooked onions, or anything with caffeine can also be harmful.
## 4 Apple and pear seeds contain arsenic, which may be deadly to dogs.
## 5 Rock star Ozzy Osborne saved his wife Sharon’s Pomeranian from a coyote by tackling ad wresting the coyote until it released the dog.
## 6 Dogs have sweat glands in between their paws.
Recall that a data frame behaves almost identical to a tibble. Yet, for consistency, we can use the function as_tibble
to transform its internal representation into a tibble. Combining with the code from the previous chunk:
= as_tibble(fromJSON(rawToChar(r$content)))
t %>% head t
## # A tibble: 6 × 1
## fact
## <chr>
## 1 All dogs can be traced back 40 million years ago to a weasel-like animal call…
## 2 Ancient Egyptians revered their dogs. When a pet dog would die, the owners sh…
## 3 Small quantities of grapes and raisins can cause renal failure in dogs. Choco…
## 4 Apple and pear seeds contain arsenic, which may be deadly to dogs.
## 5 Rock star Ozzy Osborne saved his wife Sharon’s Pomeranian from a coyote by ta…
## 6 Dogs have sweat glands in between their paws.
Now we have a final clean tibble that we can explore and manipulate as we discusses earlier in Section 2.
3.1.2 API calls with parameters
Besides getting all available information from an API endpoint, we often need to specify the chunks of information that we are interested in.
To show an example we will use the COVID-19 API that allows us to get live stats on COVID infections, recoveries, and deaths, per country. The API’s documentation can be found here: https://documenter.getpostman.com/view/10808728/SzS8rjbc (Main website: https://covid19api.com)
Assume that we want to get the confirmed cases in the Greece, between June 1st and August 26th.
To do so, we add after the endpoint a question mark ?
, followed by the parameters that define the period we are looking for:
= GET("https://api.covid19api.com/country/greece/status/confirmed?from=2021-06-01T00:00:00Z&to=2021-08-26T00:00:00Z")
r r
## Response [https://api.covid19api.com/country/greece/status/confirmed?from=2021-06-01T00:00:00Z&to=2021-08-26T00:00:00Z]
## Date: 2021-11-06 20:41
## Status: 200
## Content-Type: application/json; charset=UTF-8
## Size: 15 kB
## [{"Country":"Greece","CountryCode":"GR","Province":"","City":"","CityCode":""...
Similar to before, we transform the result to a tibble:
= as_tibble(fromJSON(rawToChar(r$content)))
r1 %>% head r1
## # A tibble: 6 × 10
## Country CountryCode Province City CityCode Lat Lon Cases Status Date
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 Greece GR "" "" "" 39.07 21.82 404163 confir… 2021-0…
## 2 Greece GR "" "" "" 39.07 21.82 405542 confir… 2021-0…
## 3 Greece GR "" "" "" 39.07 21.82 406751 confir… 2021-0…
## 4 Greece GR "" "" "" 39.07 21.82 407857 confir… 2021-0…
## 5 Greece GR "" "" "" 39.07 21.82 408789 confir… 2021-0…
## 6 Greece GR "" "" "" 39.07 21.82 409368 confir… 2021-0…
3.1.3 Handling dates
If we examine the resulting tibble, we can see that there is a column named Date
. However R thinks that this is character (<chr>
) column.
Fortunately, we can transform this column to a date type with the function as_date()
from the package lubridate
:
$Date = as_date(r1$Date)
r1%>% head r1
## # A tibble: 6 × 10
## Country CountryCode Province City CityCode Lat Lon Cases Status
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 Greece GR "" "" "" 39.07 21.82 404163 confirmed
## 2 Greece GR "" "" "" 39.07 21.82 405542 confirmed
## 3 Greece GR "" "" "" 39.07 21.82 406751 confirmed
## 4 Greece GR "" "" "" 39.07 21.82 407857 confirmed
## 5 Greece GR "" "" "" 39.07 21.82 408789 confirmed
## 6 Greece GR "" "" "" 39.07 21.82 409368 confirmed
## # … with 1 more variable: Date <date>
Note that we have already loaded the package
lubridate
in our working environment in the begining.
Once we transform the column into a date type, we can use date-specific functions from the package lubridate
that can manipulate dates. For instance, we can use the function wday()
which will return the day of the week for each date:
wday(r1$Date, label = T) %>% head
## [1] Tue Wed Thu Fri Sat Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
In the previous call of
wday
we setlabel = T
. This argument returns the results in labels (such as Fri, Sat, Sun, Mon, etc.). Experiment with runningwday()
withlabel = F
to see the difference.
3.1.4 Data analysis of data collected from an API
Let us assume now that we want to estimate the number of confirmed cases per weekday. First, we need to create a weekday
column that stores the day of the week:
= r1 %>% mutate(weekday = wday(Date, label=T))
r2 %>% head r2
## # A tibble: 6 × 11
## Country CountryCode Province City CityCode Lat Lon Cases Status
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 Greece GR "" "" "" 39.07 21.82 404163 confirmed
## 2 Greece GR "" "" "" 39.07 21.82 405542 confirmed
## 3 Greece GR "" "" "" 39.07 21.82 406751 confirmed
## 4 Greece GR "" "" "" 39.07 21.82 407857 confirmed
## 5 Greece GR "" "" "" 39.07 21.82 408789 confirmed
## 6 Greece GR "" "" "" 39.07 21.82 409368 confirmed
## # … with 2 more variables: Date <date>, weekday <ord>
Now if we look at the data, the API provides commutative numbers of confirmed cases in each area. So in order to get the new number of confirmed cases per day, we will need to subtract from each day the number of confirmed cases from the previous day.
R has a function lag()
that takes as input a variable and shifts the variable one time period so that each observation stores the value of the previous period. This is particularly useful in time series data.
lag()
to work properly the tibble must be ordered. In this example, it is already ordered by the API response.
lag()
function here: https://dplyr.tidyverse.org/reference/lead-lag.html
As an example, check out the first six values of the original Cases column:
= r1
r2 $Cases %>% head r2
## [1] 404163 405542 406751 407857 408789 409368
Now see how the lag()
function shifts those values:
lag(r2$Cases) %>% head
## [1] NA 404163 405542 406751 407857 408789
With lag()
we can estimate the numberOfNewCases
for each day by subtracting the number of previous cases stored in the lagged variable. Then, we can group by the new column weekday to estimate the average number of new cases per weekday:
%>% mutate(weekday = wday(Date, label=T), numberOfNewCases = Cases - lag(Cases)) %>%
r1 select(Cases,weekday,numberOfNewCases) %>% group_by(weekday) %>%
summarize(averageCases = mean(numberOfNewCases, na.rm = T))
## # A tibble: 7 × 2
## weekday averageCases
## <ord> <dbl>
## 1 Sun 1153.
## 2 Mon 1531.
## 3 Tue 2975.
## 4 Wed 2121.
## 5 Thu 2077.
## 6 Fri 2008.
## 7 Sat 1905.
Alternatively, we can plot the number of new cases over time:
%>% mutate(weekday = wday(Date, label=T),numberOfNewCases = Cases - lag(Cases)) %>%
r1 gf_line(numberOfNewCases~Date) %>% gf_smooth(se=T)
Note that
gf_line
knows how to use theDate
column as a date: it automatically transforms the date values intoJun
,Jul
, andAug
.
3.1.5 Authentication with the Twitter API wrapper rtweet
Not every API is open for everyone to query. Some APIs require authentication. Take for instance the Twitter API. In order to gain access to Twitter data, we first need to have a Twitter account, and submit an application that explains the reasons that we want to use the Twitter API. You can find more info here: https://developer.twitter.com/en/docs
Instead of using the GET()
function to access the Twitter API, we will use the package rtweet
which is an API wrapper.
API wrappers are language-specific packages that wrap API calls into easy-to-use functions. So instead of using every time the function GET
with a specific endpoint, wrappers include functions that encapsulate these end points and streamline the communication between our program and the API.
rtweet
wrapper` here:https://github.com/ropensci/rtweet
3.1.5.1 create_token()
The first thing we will do in order to use the wrapper rtweet
is to create a unique signature that allows us to completely control and manipulate our twitter account. We will need the key, secret key, access token, secret access token, and the application name. We can get this info from our developer twitter account (keys and tokens section).
Then, we can store these keys into variables as follows:
= "key"
key = "secret"
secret = "access"
access = "access_secret"
access_secret = "appname" app_name
Once we have the necessary keys, we can use the function create_token()
to generate our unique signature.
= create_token(app = app_name,
myToken consumer_key = key,
consumer_secret = secret,
access_token = access,
access_secret = access_secret)
3.1.5.2 post_tweet()
Now we can use our signature along with the function post_tweet
to post a new tweet:
post_tweet("The students of ISYS3350 are the best! Period.", token = myToken)
3.1.5.3 search_tweets()
We can use the function search_tweets()
to look for specific tweets, for instance, tweets that include the hashtag #analytics:
= search_tweets("#analytics", n=50,token = myToken,
analytics include_rts = F)
%>% head analytics
## # A tibble: 6 × 90
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 939462156 145708561… 2021-11-06 20:40:55 tweetgirlmem "I need… Twitt…
## 2 900369386715979776 145708537… 2021-11-06 20:40:00 DataVault_UK "Should… Semru…
## 3 3219670842 145708517… 2021-11-06 20:39:11 njoyflyfish… "Total … smcap…
## 4 3219670842 145708453… 2021-11-06 20:36:38 njoyflyfish… "Total … smcap…
## 5 3219670842 145707754… 2021-11-06 20:08:52 njoyflyfish… "Total … smcap…
## 6 3219670842 145708067… 2021-11-06 20:21:18 njoyflyfish… "Total … smcap…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <lgl>,
## # reply_to_user_id <lgl>, reply_to_screen_name <lgl>, is_quote <lgl>,
## # is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## # quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## # urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## # media_url <list>, media_t.co <list>, media_expanded_url <list>,
## # media_type <list>, ext_media_url <list>, ext_media_t.co <list>, …
Note that you can increase the argument
n=50
to fetch more tweets. The argumentinclude_rts=F
tellsrtweet
to not return retweets.
3.1.6 tidyquant
A different wrapper that focuses on financial markets is the tidyquant
wrapper. This is a package that facilitates financial analysis, as it focuses on providing the tools to perform stock portfolio analysis at scale.
For our example, we want to plot the stock prices for “AAPL”,‘TSLA’, and ‘ZM’, for 9 months, between January 2021 and September 2021. The function tq_get()
returns quantitative data in tibble format:
= c("AAPL",'TSLA','ZM') %>% tq_get(get = "stock.prices",
r from = "2021-01-01",
to = "2021-09-01")
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
%>% head r
## # A tibble: 6 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2021-01-04 134. 134. 127. 129. 143301900 129.
## 2 AAPL 2021-01-05 129. 132. 128. 131. 97664900 130.
## 3 AAPL 2021-01-06 128. 131. 126. 127. 155088000 126.
## 4 AAPL 2021-01-07 128. 132. 128. 131. 109578200 130.
## 5 AAPL 2021-01-08 132. 133. 130. 132. 105158200 131.
## 6 AAPL 2021-01-11 129. 130. 128. 129. 100384500 128.
Now we can plot the results:
%>% gf_line(adjusted ~date,color = ~symbol) %>%
r gf_smooth(se=T, fill = ~symbol)
3.2 Web scraping
Web pages are written in various coding languages that web browses can read and understand. When scraping web pages, we deal with their code. This code is often written in three languages: Hypertext Markup Language (HTML), Cascading Style Sheets (CSS), and Javascript.
- HTML code defines the structure and the content of a web page.
- CSS code customizes the style and look of a page.
- Javascript makes a page dynamic and interactive.
In this Section we’ll focus on how to use R to scrape to read the static parts of a webpage that are written in HTML and CSS.
At the end of this Section you can find a brief optional introduction on how to scrape dynamic web pages.
Unlike R, HTML is not a programming language. Instead, it is called a markup language — it describes the content and structure of a web page. HTML is organized using tags, which are surrounded by <>
symbols. Different tags perform different functions. Together, many tags form the content of a web page.
When we scrape a web page, we are downloading its HTML, CSS and Javascript code. Hence, in order to extract any useful information from a web page,
we will need to know its HTML structure and target specific HTML and CSS tags that we care about.
3.2.1 Scraping random quotes
Lets start with something very simple. Let’s try to scrape some quotes from http://quotes.toscrape.com/
First, we need to download the contents of the page that we are interested in.
The function read_html()
from package rvest
does that by reading any webpage from a given URL.
= read_html("http://quotes.toscrape.com/")
s s
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <div class="container">\n <div class="row header-box"> ...
Instead of manually exploring HTML tags and trying to identify how to parse valuable information, we can use the Chrome extension SelectorGadget
.
In our example, by clicking on the quote text of the http://quotes.toscrape.com/ web page we can identify that they are surrounded by the tag “.text”.
Once we have the relevant tag keyword that we are interested in, we can use the function html_nodes()
from the package rvest
to extract all the information stored within the tags that we selected. (Note that, the function html_nodes()
works particularly well with the SelectorGadget
extension.)
%>% html_nodes(".text") s
## {xml_nodeset (10)}
## [1] <span class="text" itemprop="text">“The world as we have created it is a ...
## [2] <span class="text" itemprop="text">“It is our choices, Harry, that show ...
## [3] <span class="text" itemprop="text">“There are only two ways to live your ...
## [4] <span class="text" itemprop="text">“The person, be it gentleman or lady, ...
## [5] <span class="text" itemprop="text">“Imperfection is beauty, madness is g ...
## [6] <span class="text" itemprop="text">“Try not to become a man of success. ...
## [7] <span class="text" itemprop="text">“It is better to be hated for what yo ...
## [8] <span class="text" itemprop="text">“I have not failed. I've just found 1 ...
## [9] <span class="text" itemprop="text">“A woman is like a tea bag; you never ...
## [10] <span class="text" itemprop="text">“A day without sunshine is like, you ...
This is nice, as we now got all the quotes that we were interested in.
However, we only care about the actual quote text and not about the accompanying HTML tags.
Thankfully, the rvest
package offers the function html_text()
, which extracts the text out of the html tags:
= s %>% html_nodes(".text") %>% html_text()
s1 s1
## [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
## [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
## [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
## [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
## [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
## [6] "“Try not to become a man of success. Rather become a man of value.”"
## [7] "“It is better to be hated for what you are than to be loved for what you are not.”"
## [8] "“I have not failed. I've just found 10,000 ways that won't work.”"
## [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
## [10] "“A day without sunshine is like, you know, night.”"
The result of this process is not a tibble, but instead, a vector of characters:
class(s1)
## [1] "character"
We can transform a vector into a single-columned tibble, with the function as_tibble_col
:
= s %>% html_nodes(".text") %>% html_text() %>% as_tibble_col("quote")
s1 s1
## # A tibble: 10 × 1
## quote
## <chr>
## 1 “The world as we have created it is a process of our thinking. It cannot be …
## 2 “It is our choices, Harry, that show what we truly are, far more than our ab…
## 3 “There are only two ways to live your life. One is as though nothing is a mi…
## 4 “The person, be it gentleman or lady, who has not pleasure in a good novel, …
## 5 “Imperfection is beauty, madness is genius and it's better to be absolutely …
## 6 “Try not to become a man of success. Rather become a man of value.”
## 7 “It is better to be hated for what you are than to be loved for what you are…
## 8 “I have not failed. I've just found 10,000 ways that won't work.”
## 9 “A woman is like a tea bag; you never know how strong it is until it's in ho…
## 10 “A day without sunshine is like, you know, night.”
Inside the function as_tibble_col
, we can provide the column name that we want our single-column tibble to have.
The function
as_tibble_col
is a member of the larger family of functionsas_tibble
. Run?as_tibble
in the console for more information.
3.2.2 Binding columns
Let’s assume that besides the quote, we also care about the quote’s author. By using the SelectorGadget extension we identify that the tag “.author” encapsulates the author information:
= s %>% html_nodes(".author") %>% html_text() %>% as_tibble_col("author")
s2 s2
## # A tibble: 10 × 1
## author
## <chr>
## 1 Albert Einstein
## 2 J.K. Rowling
## 3 Albert Einstein
## 4 Jane Austen
## 5 Marilyn Monroe
## 6 Albert Einstein
## 7 André Gide
## 8 Thomas A. Edison
## 9 Eleanor Roosevelt
## 10 Steve Martin
Now we have two tibbles, s1
and s2
, but we want to combine them so that we get one tibble with columns “quote, author”. The function bind_cols
allows us to place the two tibbles side by side:
= bind_cols(s1,s2)
t t
## # A tibble: 10 × 2
## quote author
## <chr> <chr>
## 1 “The world as we have created it is a process of our thinking… Albert Einste…
## 2 “It is our choices, Harry, that show what we truly are, far m… J.K. Rowling
## 3 “There are only two ways to live your life. One is as though … Albert Einste…
## 4 “The person, be it gentleman or lady, who has not pleasure in… Jane Austen
## 5 “Imperfection is beauty, madness is genius and it's better to… Marilyn Monroe
## 6 “Try not to become a man of success. Rather become a man of v… Albert Einste…
## 7 “It is better to be hated for what you are than to be loved f… André Gide
## 8 “I have not failed. I've just found 10,000 ways that won't wo… Thomas A. Edi…
## 9 “A woman is like a tea bag; you never know how strong it is u… Eleanor Roose…
## 10 “A day without sunshine is like, you know, night.” Steve Martin
3.2.3 Scraping Yahoo! finance comments and reactions
Next, we will try to do something a little bit more substantial. Assume that we want to create a unique dataset about a set of securities that we care about. Perhaps, we can find some unique information into every day comments and reactions of people who post on the Yahoo! finance board. For instance, assume that we care about the the AAPL stock:
= read_html("https://finance.yahoo.com/quote/AAPL/community?p=AAPL")
j j
## {html_document}
## <html id="atomic" class="NoJs desktop" lang="en-US">
## [1] <head prefix="og: http://ogp.me/ns#">\n<meta http-equiv="Content-Type" co ...
## [2] <body>\n<div id="app"><div class="" data-reactroot="" data-reactid="1" da ...
By using the SelectorGadget Chrome extension, we identify the tag “.Pend\(8px\)”, which encapsulates users’ comments/responses:
= j %>% html_nodes(".Pend\\(8px\\)") %>% html_text()
j1 head(j1)
## [1] "Apple (AAPL) recently released the latest version of iOS, the operating system that powers millions of iPhones. And with it comes a feature called SharePlay, which Apple has been previewing since its big developers conference back in June. I love This Platform https://perfectroi.today/x8gjgn ! This is the most amazing piece of software I have ever used. Totally customizable and incredibly powerful. There is very little I want that it doesn't already have."
## [2] "apple just hired Tesla's autopilot software director. fabulous news."
## [3] "WOW!! Who bought 7 400 000 AAPL shares for $1,117,992,000 at close! ??Similar transaction yesterday..."
## [4] "Every other fang stock moves up and down so fast except Apple, another tight range on Friday! Just take a look at fb, Amazon, msft, tesla, goog and more!"
## [5] "How can qcomm beat like that and have no supply issues in this environment. Sounds like Tim and Luca just being super conservative as always with next quarter. I’m thinking supply chain getting better and Apple going to have a blowout holiday quarter."
## [6] "WE posted \"A move stays on the board.\" This week the profits on the Nov 5 algorithim is just one example.WE posted AAPL is undervalued at $150.Our target is $160.AAPL has printed $157 twice. AAPL trades in trends of three.WE have no time table on this."
Let’s now add an additional column in the tibble that identifies the symbol of the stock:
= j %>% html_nodes(".Pend\\(8px\\)") %>%
d html_text() %>% as_tibble_col("comment") %>%
mutate(stockCode = "AAPL")
head(d)
## # A tibble: 6 × 2
## comment stockCode
## <chr> <chr>
## 1 "Apple (AAPL) recently released the latest version of iOS, the oper… AAPL
## 2 "apple just hired Tesla's autopilot software director. fabulous ne… AAPL
## 3 "WOW!! Who bought 7 400 000 AAPL shares for $1,117,992,000 at clo… AAPL
## 4 "Every other fang stock moves up and down so fast except Apple, ano… AAPL
## 5 "How can qcomm beat like that and have no supply issues in this env… AAPL
## 6 "WE posted \"A move stays on the board.\" This week the profits on … AAPL
3.2.4 Repetitive operations with purr::map
Often we are not interested only in a single stock, but instead, we want to collect information on multiple stocks. One way to do this would be to manually go and update each time the URL of the stock that we care about. However, this is not particularly efficient, or sustainable, especially when we are dealing with hundreds of stocks.
Luckily, the tidyverse
package purr
and its family of functions map()
allow us to perform such repetitive operations efficiently. Specifically the map()
function transforms the input object by applying a given function to each element of the input. For instance, we can apply the function nchar
to each comment we scraped by calling map()
:
%>% select(comment) %>% map(nchar) d
## $comment
## [1] 529 69 104 153 251 252 94 104 318 490 146 110 378 295 198 187 259 58 121
## [20] 177
Function
nchar(x)
calculates the number of cahracters ofx
.
Function map()
returns an object of type list
.
However, we often would like to get different results. For instance, we might want to estimate the average number of characters of each comment. The function map_dbl
allows us to get this value directly as double, instead as a list:
%>% select(comment) %>% map(nchar) %>% map_dbl(mean) d
## comment
## 214.65
3.2.5 Custom R functions
Now back to our original problem. Based on what we have seen so far, it would have been nice to use the map function and apply the same process repeatedly to different stocks to extract their comments and reactions into one unified tibble. But how can we do this? What function goes on yahoo finance and extracts the information we want for an arbitrary set of stocks?
This is a rhetoric question: there isn’t a function that does that. But luckily, we can create our own unique function to do so:
= function(stockSymbol){
getYahooFinanceComments = read_html(paste("https://finance.yahoo.com/quote/",stockSymbol,"/community?p=", stockSymbol, sep=""))
j = j %>% html_nodes(".Pend\\(8px\\)") %>%
j html_text() %>% as_tibble_col("comment") %>%
mutate(stockCode = stockSymbol)
return(j)
}
There are some things to point out here:
- The keyword
function(){}
identifies that this will be a custom R function. - The name
getYahooFinanceComments
is arbitrary. You can give your function any name you like. - Inside the parentheses of
function()
, a function can have any number of parameters. - Inside the brackets
{}
we identify the functionality of the function. If a function has parameters, these parameters are accessed and used inside the brackets{}
. - The special keyword
return()
returns the result of the function. This result can be a tibble, a vector, a number, a string, or any other R object we would like it to be. - To define the function you will have to run it. If you update the code of the function, you will need to re-rerun it in order for the updates to take effect.
Inside the
getYahooFinanceComments
custom function we use for the first time in this book the functionpaste()
. This function concatenates vectors after converting them to characters. Runpaste("one","apple")
to see the result. Then re-run it with the optionsep="_"
as follows:paste("one","apple", sep="_")
.
Function getYahooFinanceComments
consolidates the steps we discussed above and returns a final tibble with two columns for a given stock symbol.
For clarity, this is what the paste()
function does inside read_html
:
= "ZM"
stockSymbol paste("https://finance.yahoo.com/quote/",stockSymbol,"/community?p=", stockSymbol, sep="")
## [1] "https://finance.yahoo.com/quote/ZM/community?p=ZM"
Once a function is defined in our environment, we can call it, similarly to how we call any other R function such us filter
, summarize
, mean
, and so on. For instance, I can call the new function getYahooFinanceComments
on the TSLA stock:
= getYahooFinanceComments("TSLA")
resultTibble resultTibble
## # A tibble: 20 × 2
## comment stockCode
## <chr> <chr>
## 1 "I don’t care where the stock price goes today or tomorrow. I’m wi… TSLA
## 2 "Ahahaha Elon Musik did twitter for $GME TY Elon, will buy more $T… TSLA
## 3 "THIS WEEK Tesla will be reporting the number of deliveries for th… TSLA
## 4 "Circuits in all markets will break today. Historic. This is wonde… TSLA
## 5 "bBeing evasive and coy has worked for Tesla/Musk for almost 10 ye… TSLA
## 6 "CEO from $TSLA new vendor $OZSC, \"Right now companies like Tesla… TSLA
## 7 "May 27, 2020 - Tesla cuts prices on ALL Models July 2, 2020 - Te… TSLA
## 8 "Recently, Goldman Sachs upgraded $TSLA to a buy, and gave a new p… TSLA
## 9 "#MARA Montana Business Digest: Marathon Digital \"Flips the Switc… TSLA
## 10 "$TSLA beats Q1 earnings:$0.93 non-GAAP EPS vs consensus $0.80EPS … TSLA
## 11 "Tesla paradox alert: Tesla claims they will grow 50% yoy for mul… TSLA
## 12 "Tesla's market cap is now higher than Berkshire Hathaway. Sales (… TSLA
## 13 "Join the NIO opportunity... TSLA shipped 90K cars per Q1NIO shipp… TSLA
## 14 "A short squeeze is a rapid increase in the price of a stock owing… TSLA
## 15 "Goldman Sachs Upgrades $TSLA to Buy, price target to $780 on Impr… TSLA
## 16 "As a former short of this stock I learned the hard way. That is t… TSLA
## 17 "Morgan Stanley upgrades $TSLA to Overweight, raises price target … TSLA
## 18 "Elon says Tesla is \"open\" to selling software and drive train c… TSLA
## 19 "If you bought a TESLA when the stock IPO’d you would have paid ju… TSLA
## 20 "I have a great feeling we’re going to see some nice gains tomorro… TSLA
3.2.6 Combining custom functions with map
Now, we can use the function map
along with our custom function getYahooFinanceComments
to scrape multiple stocks with one simple command, without manually adjusting the code.
Specifically, we will use the function map_dfr
that will return the result as a single tibble (data frame):
= c("AAPL",'TSLA','ZM') %>% map_dfr(getYahooFinanceComments)
d %>% head d
## # A tibble: 6 × 2
## comment stockCode
## <chr> <chr>
## 1 "Apple (AAPL) recently released the latest version of iOS, the oper… AAPL
## 2 "apple just hired Tesla's autopilot software director. fabulous ne… AAPL
## 3 "WOW!! Who bought 7 400 000 AAPL shares for $1,117,992,000 at clo… AAPL
## 4 "Every other fang stock moves up and down so fast except Apple, ano… AAPL
## 5 "How can qcomm beat like that and have no supply issues in this env… AAPL
## 6 "WE posted \"A move stays on the board.\" This week the profits on … AAPL
3.3 Optional and Advanced: Scraping dynamic pages with R Selenium
So far we have talked about scraping static web pages. Often, many of the pages that we care about are dynamic (interactive). In those cases, we need more advanced techniques to web scrabe.
The package RSelenium
automates web browser’s actions and facilitates the scraping of dynamic web pages.
First we need to install and load the package:
install.packages("RSelenium")
library(RSelenium)
For RSelenium
to work we need to use a browser driver (code that mimics a browser).
Hence, we will install a Firefox webdriver (this is different than your browser).
Download it from here (make sure you also have a Firefox on your machine): https://github.com/mozilla/geckodriver/releases/tag/v0.27.0
Store the driver in your Rmd folder.
If you do not have
Java
installed on your machine, you will also need to download a recent JDK from here: https://www.oracle.com/java/technologies/javase-jdk15-downloads.html
Once you complete the setup, you can run the following, and you should observe a Firefox browser opening up:
<- rsDriver(browser=c("firefox"),port=4449L)
driver<- driver[["client"]] remDr
Next we can use the command navigate
to visit a URL. Let us visit the Yahoo! finance main web page (by running this code, you should be able to see the webpage in your Firefox browser):
$navigate("https://finance.yahoo.com") remDr
Assume that our goal is to be able to search anything on the Yahoo! finance. We first need to identify the tags of the search box.
Similar to the SelectorGadget chrome extension, you can install on chrome the Selenium IDE extension. This extension will allow you to record your actions, and extract the necessary tags.
Function findElement
allows us to identify the element that we are looking for. Here, the next line, identifies the position of the Yahoo! finance search box in the main page:
<- remDr$findElement(using = 'id', value = 'yfin-usr-qry') search_box
Once we have identified the location of the search box, we can send to it any information we want to search for with the function sendKeysToElement()
. For instance, the ETF VTI:
$sendKeysToElement(list("VTI")) search_box
With the previous command, you should be able to see the letters VTI appearing in the search box. Mind-blowing right? :)
Now we will identify the search box tag, and then click on it with the function clickElement
:
<- remDr$findElement(using = 'css', value = "#header-desktop-search-button > .Cur\\(p\\)")
clickSearch $clickElement() clickSearch
RSelenium
here: https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu