Chapter 5 Web Scraping and Functions
Previous sections looked at how to obtain data that was already loaded into R or how to use pre-made functions to retrieve data from websites like FanGraphs and Baseball Savant. In this section, we will explore how to scrape data (using the rvest
package) from other websites you find and how to write your own functions to improve this process.
5.1 Basic Web Scraping
5.1.1 Example: Scraping Baseball-Reference Draft Data
Here is code that allows us to scrape data from the first round of the 2004 draft from baseball reference. The url refers to this webpage.
library(rvest)
<- "https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0"
url <- read_html(url)
html
<- html %>%
first_2004 html_element("table") %>%
html_table()
The rvest
includes the functions we are using here. This package is designed with functions that make data scraping easier to do in R.
In the code above, we are doing the following:
- assign the link of the url to an object named
url
. This allows us to refer to the url in code without always needing to copy and paste it. - Using the
read_html()
function to store the html code from the url. This function will require an internet connection to grab the html code from the website. - The last three lines of code are processing the html code to get the data we want.
- We are naming this data
first_2004
and starting with ourhtml
object. - The html code is being piped into the
html_element()
function and told to find the desired outputs (which in our case is tables). - This is then piped into the
html_table()
function, which converts the html code for our table into a data frame in R.
- We are naming this data
After running this code, you will have an object in your data frame containing data from the entirety of the first round of the 2004 MLB draft.
Here is what the first few rows of the dataset should look like:
Year | Rnd | DT | OvPck | FrRnd | RdPck | Tm | Signed | Bonus | Name | Pos | WAR | G_Hitter | AB | HR | BA | OPS | G_Pitcher | W | L | ERA | WHIP | SV | Type | Drafted.Out.of |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 | 1 | NA | 1 | FrRnd | 1 | Padres | Y | $3,150,000 | Matt Bush (minors) | SS | 1.7 | 8 | 0 | 0 | NA | NA | 217 | 12 | 11 | 3.75 | 1.20 | 15 | HS | Mission Bay HS (San Diego, CA) |
2004 | 1 | NA | 2 | FrRnd | 2 | Tigers | Y | $3,120,000 | Justin Verlander (minors) | RHP | 80.9 | 24 | 50 | 0 | 0.100 | 0.200 | 509 | 257 | 141 | 3.24 | 1.12 | 0 | 4Yr | Old Dominion University (Norfolk, VA) |
2004 | 1 | NA | 3 | FrRnd | 3 | Mets | Y | $3,000,000 | Philip Humber (minors) | RHP | 0.9 | 9 | 11 | 0 | 0.091 | 0.182 | 97 | 16 | 23 | 5.31 | 1.42 | 0 | 4Yr | Rice University (Houston, TX) |
2004 | 1 | NA | 4 | FrRnd | 4 | Devil Rays | Y | $3,200,000 | Jeff Niemann (minors) | RHP | 4.3 | 9 | 13 | 0 | 0.077 | 0.154 | 97 | 40 | 26 | 4.08 | 1.29 | 0 | 4Yr | Rice University (Houston, TX) |
2004 | 1 | NA | 5 | FrRnd | 5 | Brewers | Y | $2,200,000 | Mark Rogers (minors) | RHP | 1.1 | 12 | 16 | 0 | 0.250 | 0.625 | 11 | 3 | 1 | 3.49 | 1.12 | 0 | HS | Mount Ararat School (Topsham, ME) |
2004 | 1 | NA | 6 | FrRnd | 6 | Indians | Y | $2,475,000 | Jeremy Sowers (minors) | LHP | 1.6 | 4 | 4 | 0 | 0.250 | 0.750 | 72 | 18 | 30 | 5.18 | 1.44 | 0 | 4Yr | Vanderbilt University (Nashville, TN) |
2004 | 1 | NA | 7 | FrRnd | 7 | Reds | Y | $2,300,000 | Homer Bailey (minors) | RHP | 6.2 | 208 | 373 | 0 | 0.164 | 0.375 | 245 | 81 | 86 | 4.56 | 1.37 | 0 | HS | La Grange HS (La Grange, TX) |
2004 | 1 | NA | 8 | FrRnd | 8 | Orioles | N | Wade Townsend (minors) | RHP | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4Yr | Rice University (Houston, TX) | |
2004 | 1 | NA | 9 | FrRnd | 9 | Rockies | Y | $2,150,000 | Chris Nelson (minors) | SS | -2.6 | 282 | 834 | 16 | 0.265 | 0.699 | NA | NA | NA | NA | NA | NA | HS | Redan HS (Stone Mountain, GA) |
2004 | 1 | NA | 10 | FrRnd | 10 | Rangers | Y | $2,025,000 | Thomas Diamond (minors) | RHP | -0.5 | 16 | 7 | 0 | 0.000 | 0.125 | 16 | 1 | 3 | 6.83 | 1.76 | 0 | 4Yr | University of New Orleans (New Orleans, LA) |
5.2 Writing Functions
The code above allowed us to scrape data for a single round from a single year’s draft. If we wanted first round data from multiple years or multiple rounds from a single year, we would end up with very repetitive code.
One option is to write our own function to eliminate some of that repetitive code and to make the process quicker. In the url from before, there were two specific parts that controlled what round and what year we gather our data from.
https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg
To verify this, you could try replacing the two highlighted parts and going to the webpage in your browser. (Note: The end of the url is cropped out above in order to fit it on the page.)
To get data from any draft year/round we want, we could write a function that replaces the two highlighted parts with user inputted values.
Here is some code that would do just that:
<- function(year, round) {
scrape_draft
require(rvest)
<- paste0("https://www.baseball-reference.com/draft/?year_ID=",
url
year,"&draft_round=",
round,"&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
<- url %>%
data read_html()
<- data %>%
draft_data html_element("table") %>%
html_table()
draft_data }
Inside the function()
function, we are specifying that our function will have two arguments (year and round). These will correspond to the highlighted parts of the url from before. The paste0()
function puts these into the url in the right place and stores them as our url object. From here, we can do the same thing we did before to scrape data for a chosen year/round.
The code below is an example of how we can use the function we wrote. Remember that our new scrape_draft()
function has two arguments: year and round. Therefore, the code below uses the function to scrape Baseball Reference for the 1st round of 2006 draft.
<- scrape_draft(year = 2006, round = 1) first_2006
Below are the first few rows to show you that everything worked properly.
Year | Rnd | DT | OvPck | FrRnd | RdPck | Tm | Signed | Bonus | Name | Pos | WAR | G_Hitter | AB | HR | BA | OPS | G_Pitcher | W | L | ERA | WHIP | SV | Type | Drafted.Out.of |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 | 1 | NA | 1 | FrRnd | 1 | Royals | Y | $3,500,000 | Luke Hochevar (minors) | RHP | 3.7 | 18 | 16 | 0 | 0.063 | 0.125 | 279 | 46 | 65 | 4.98 | 1.34 | 3 | ||
2006 | 1 | NA | 2 | FrRnd | 2 | Rockies | Y | $3,250,000 | Greg Reynolds (minors) | RHP | -1.5 | 31 | 30 | 0 | 0.167 | 0.460 | 33 | 6 | 11 | 7.01 | 1.65 | 0 | 4Yr | Stanford University (Palo Alto, CA) |
2006 | 1 | NA | 3 | FrRnd | 3 | Devil Rays | Y | $3,000,000 | Evan Longoria (minors) | 3B | 58.6 | 1986 | 7306 | 342 | 0.264 | 0.804 | NA | NA | NA | NA | NA | NA | 4Yr | California State University, Long Beach (Long Beach, CA) |
2006 | 1 | NA | 4 | FrRnd | 4 | Pirates | Y | $2,750,000 | Brad Lincoln (minors) | RHP | 0.4 | 53 | 38 | 0 | 0.237 | 0.520 | 99 | 9 | 11 | 4.74 | 1.39 | 1 | 4Yr | University of Houston (Houston, TX) |
2006 | 1 | NA | 5 | FrRnd | 5 | Mariners | Y | $2,450,000 | Brandon Morrow (minors) | RHP | 11.1 | 115 | 24 | 0 | 0.000 | 0.040 | 334 | 51 | 43 | 3.96 | 1.31 | 40 | 4Yr | University of California, Berkeley (Berkeley, CA) |
2006 | 1 | NA | 6 | FrRnd | 6 | Tigers | Y | $3,550,000 | Andrew Miller (minors) | LHP | 7.8 | 185 | 74 | 0 | 0.054 | 0.108 | 612 | 55 | 55 | 4.03 | 1.34 | 63 | 4Yr | University of North Carolina at Chapel Hill (Chapel Hill, NC) |
2006 | 1 | NA | 7 | FrRnd | 7 | Dodgers | Y | $2,300,000 | Clayton Kershaw (minors) | LHP | 79.9 | 357 | 698 | 1 | 0.162 | 0.390 | 425 | 210 | 92 | 2.48 | 1.00 | 0 | HS | Highland Park HS (Dallas, TX) |
2006 | 1 | NA | 8 | FrRnd | 8 | Reds | Y | $2,000,000 | Drew Stubbs (minors) | OF | 7.9 | 911 | 2834 | 92 | 0.242 | 0.704 | NA | NA | NA | NA | NA | NA | 4Yr | University of Texas at Austin (Austin, TX) |
2006 | 1 | NA | 9 | FrRnd | 9 | Orioles | Y | $2,100,000 | Billy Rowell (minors) | 3B | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | HS | Bishop Eustace Preparatory School (Pennsauken, NJ) |
2006 | 1 | NA | 10 | FrRnd | 10 | Giants | Y | $2,025,000 | Tim Lincecum (minors) | RHP | 19.5 | 262 | 474 | 0 | 0.112 | 0.300 | 278 | 110 | 89 | 3.74 | 1.29 | 1 | 4Yr | University of Washington (Seattle, WA) |
Just like the code we made at the beginning of this section, we are able to obtain a dataset containing all of the players drafted in the 1st round of the 2006 draft.
5.2.1 Example: Completing the rest of the Draft Dataset
Now, let’s finalize the Draft Dataset we used in the first section of this chapter. We are looking to use every first round pick from the years 2004-2013. To do this, we can use the function we created to easily select the years of our choice. Then, we can use the rbind()
function to put it all together.
<- scrape_draft(year = 2004, round = 1)
first_2004 <- scrape_draft(year = 2005, round = 1)
first_2005 <- scrape_draft(year = 2006, round = 1)
first_2006 <- scrape_draft(year = 2007, round = 1)
first_2007 <- scrape_draft(year = 2008, round = 1)
first_2008 <- scrape_draft(year = 2009, round = 1)
first_2009 <- scrape_draft(year = 2010, round = 1)
first_2010 <- scrape_draft(year = 2011, round = 1)
first_2011 <- scrape_draft(year = 2012, round = 1)
first_2012 <- scrape_draft(year = 2013, round = 1)
first_2013
<- rbind(first_2004, first_2005, first_2006, first_2007, first_2008,
all_draft first_2009, first_2010, first_2011, first_2012, first_2013)
As you can see, each first round pick from 2004-2013 has been combined into a single dataset. In a later section, we will talk about how to create loops which will make this process even faster.
Note This data is currently being stored in the github link here.
5.3 More Difficult Web Scraping
In the first example about MLB drafts, finding the table was easy because there was only one table on the web page. However, you may find websites that have multiple tables on a page which will cause troubles when trying to get the correct one into R.
Let’s explore the Korean Baseball Organization (KBO), Wikipedia page to learn more.
As you can see on this page, there are plenty of tables throughout the page. Let’s say that we want to use the table with each team’s stadium, capacity and year founded.
This Stanford resource is very informative which gives a more detailed example on using the CSS Selector. We will work through the example on the KBO web page here.
Like the previous example, we need to store the url as an object in R. However, we will also need to find a CSS Selector that corresponds to our specific table. To find the CSS Selector we have to examine the html code within the web page. Below are the steps taken to do this:
- First, right-click on the table we decided to work with.
- Navigate to the inspect option and click it.
- We will then see the code that has created the webpage and this is where we will find what we need.
- Hover over each line of code in the new window until the line of code you are hovering over highlights the entire table needed. Note It is likely that this line starts with the word “table”.
- Then, right click this code and choose the “Copy Selector” option.
- Paste the copied code into an object as shown below:
<- "https://en.wikipedia.org/wiki/KBO_League"
url
<- "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(75)" css_selector
So far, we have put our url, and our copied selector into objects above. At this point, our code will look like the draft data scraping example shown above. This code comes from the rvest
package. In this chunk, the main change from the prior example is what is being put inside the html_element()
function.
Instead of “table”, we will put in our CSS Selector as shown below.
library(rvest)
<- url %>%
KBO_data read_html() %>%
html_element(css = css_selector) %>%
html_table()
Using this process, we were able to choose a specific table on a webpage instead of using the defaulted first table. Below you can see the data we scraped from the Wikipedia page:
Team | City | Stadium | Capacity | Founded | Joined |
---|---|---|---|---|---|
Doosan Bears | Seoul | Jamsil Baseball Stadium | 25,000 | 1982 | 1982 |
Hanwha Eagles | Daejeon | Hanwha Life Eagles Park | 13,000 | 1985 | 1986 |
Kia Tigers | Gwangju | Gwangju-Kia Champions Field | 20,500 | 1982 | 1982 |
Kiwoom Heroes | Seoul | Gocheok Sky Dome | 16,744 | 2008 | 2008 |
KT Wiz | Suwon | Suwon kt wiz Park | 20,000 | 2013 | 2015 |
LG Twins | Seoul | Jamsil Baseball Stadium | 25,000 | 1982 | 1982 |
Lotte Giants | Busan | Sajik Baseball Stadium | 24,500 | 1975 | 1982 |
NC Dinos | Changwon | Changwon NC Park | 22,112 | 2011 | 2013 |
Samsung Lions | Daegu | Daegu Samsung Lions Park | 24,000 | 1982 | 1982 |
SSG Landers | Incheon | Incheon SSG Landers Field | 23,000 | 2000 | 2000 |
5.4 Data Scraping Ethics
Now that we’ve finished the section on web scraping, it is important to note some ethics in this section. To see a more comprehensive look into the ethics of web scraping, R for Data Science is a great resource.
The legality and ethics behind web scraping is quite complicated. However, it is a good rule of thumb to make sure that the data you are scraping is:
- Public
- Non-personal
- Accurate
When accessing a website, Terms and Conditions often pop-up. These are a way for pages to have some sort of legal claim to the data on their page. We must respect these pages as much as possible and should not proceed with scraping.
Additionally, websites with personal data on their pages should not be scraped at any time. While not totally illegal, the ethics surrounding this is very hazy and should be avoided at all costs.
Finally, we must also be careful of overloading any websites’ servers. Because scraping often involves accessing a page several times, their servers were likely not meant to be accessed as such. We must remain courteous in that we are careful in how much/how often we are accessing a webpage.