Chapter 5 Web Scraping and Functions | Fundamentals of Collecting and Analyzing Baseball Data

5.1 Basic Web Scraping

5.1.1 Example: Scraping Baseball-Reference Draft Data

Here is code that allows us to scrape data from the first round of the 2004 draft from baseball reference. The url refers to this webpage.

library(rvest)

url <- "https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0"
html <- read_html(url)

first_2004 <- html %>%
  html_element("table") %>%
  html_table()

The rvest includes the functions we are using here. This package is designed with functions that make data scraping easier to do in R.

In the code above, we are doing the following:

assign the link of the url to an object named url. This allows us to refer to the url in code without always needing to copy and paste it.
Using the read_html() function to store the html code from the url. This function will require an internet connection to grab the html code from the website.
The last three lines of code are processing the html code to get the data we want.
- We are naming this data first_2004 and starting with our html object.
- The html code is being piped into the html_element() function and told to find the desired outputs (which in our case is tables).
- This is then piped into the html_table() function, which converts the html code for our table into a data frame in R.

After running this code, you will have an object in your data frame containing data from the entirety of the first round of the 2004 MLB draft.

Here is what the first few rows of the dataset should look like:

Year	Rnd	DT	OvPck	FrRnd	RdPck	Tm	Signed	Bonus	Name	Pos	WAR	G_Hitter	AB	HR	BA	OPS	G_Pitcher	W	L	ERA	WHIP	SV	Type	Drafted.Out.of
2004	1	NA	1	FrRnd	1	Padres	Y	$3,150,000	Matt Bush (minors)	SS	1.7	8	0	0	NA	NA	217	12	11	3.75	1.20	15	HS	Mission Bay HS (San Diego, CA)
2004	1	NA	2	FrRnd	2	Tigers	Y	$3,120,000	Justin Verlander (minors)	RHP	80.9	24	50	0	0.100	0.200	509	257	141	3.24	1.12	0	4Yr	Old Dominion University (Norfolk, VA)
2004	1	NA	3	FrRnd	3	Mets	Y	$3,000,000	Philip Humber (minors)	RHP	0.9	9	11	0	0.091	0.182	97	16	23	5.31	1.42	0	4Yr	Rice University (Houston, TX)
2004	1	NA	4	FrRnd	4	Devil Rays	Y	$3,200,000	Jeff Niemann (minors)	RHP	4.3	9	13	0	0.077	0.154	97	40	26	4.08	1.29	0	4Yr	Rice University (Houston, TX)
2004	1	NA	5	FrRnd	5	Brewers	Y	$2,200,000	Mark Rogers (minors)	RHP	1.1	12	16	0	0.250	0.625	11	3	1	3.49	1.12	0	HS	Mount Ararat School (Topsham, ME)
2004	1	NA	6	FrRnd	6	Indians	Y	$2,475,000	Jeremy Sowers (minors)	LHP	1.6	4	4	0	0.250	0.750	72	18	30	5.18	1.44	0	4Yr	Vanderbilt University (Nashville, TN)
2004	1	NA	7	FrRnd	7	Reds	Y	$2,300,000	Homer Bailey (minors)	RHP	6.2	208	373	0	0.164	0.375	245	81	86	4.56	1.37	0	HS	La Grange HS (La Grange, TX)
2004	1	NA	8	FrRnd	8	Orioles	N		Wade Townsend (minors)	RHP	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4Yr	Rice University (Houston, TX)
2004	1	NA	9	FrRnd	9	Rockies	Y	$2,150,000	Chris Nelson (minors)	SS	-2.6	282	834	16	0.265	0.699	NA	NA	NA	NA	NA	NA	HS	Redan HS (Stone Mountain, GA)
2004	1	NA	10	FrRnd	10	Rangers	Y	$2,025,000	Thomas Diamond (minors)	RHP	-0.5	16	7	0	0.000	0.125	16	1	3	6.83	1.76	0	4Yr	University of New Orleans (New Orleans, LA)

5.2 Writing Functions

The code above allowed us to scrape data for a single round from a single year’s draft. If we wanted first round data from multiple years or multiple rounds from a single year, we would end up with very repetitive code.

One option is to write our own function to eliminate some of that repetitive code and to make the process quicker. In the url from before, there were two specific parts that controlled what round and what year we gather our data from.

https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg

To verify this, you could try replacing the two highlighted parts and going to the webpage in your browser. (Note: The end of the url is cropped out above in order to fit it on the page.)

To get data from any draft year/round we want, we could write a function that replaces the two highlighted parts with user inputted values.

Here is some code that would do just that:

scrape_draft <- function(year, round) {
  
  require(rvest)
  
  url <- paste0("https://www.baseball-reference.com/draft/?year_ID=",
                year,
                "&draft_round=",
                round,
                "&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
  
  data <- url %>%
    read_html()
  
  draft_data <- data %>%
    html_element("table") %>%
    html_table()
  draft_data
}

Inside the function() function, we are specifying that our function will have two arguments (year and round). These will correspond to the highlighted parts of the url from before. The paste0() function puts these into the url in the right place and stores them as our url object. From here, we can do the same thing we did before to scrape data for a chosen year/round.

The code below is an example of how we can use the function we wrote. Remember that our new scrape_draft() function has two arguments: year and round. Therefore, the code below uses the function to scrape Baseball Reference for the 1st round of 2006 draft.

first_2006 <- scrape_draft(year = 2006, round = 1)

Below are the first few rows to show you that everything worked properly.

Year	Rnd	DT	OvPck	FrRnd	RdPck	Tm	Signed	Bonus	Name	Pos	WAR	G_Hitter	AB	HR	BA	OPS	G_Pitcher	W	L	ERA	WHIP	SV	Type	Drafted.Out.of
2006	1	NA	1	FrRnd	1	Royals	Y	$3,500,000	Luke Hochevar (minors)	RHP	3.7	18	16	0	0.063	0.125	279	46	65	4.98	1.34	3
2006	1	NA	2	FrRnd	2	Rockies	Y	$3,250,000	Greg Reynolds (minors)	RHP	-1.5	31	30	0	0.167	0.460	33	6	11	7.01	1.65	0	4Yr	Stanford University (Palo Alto, CA)
2006	1	NA	3	FrRnd	3	Devil Rays	Y	$3,000,000	Evan Longoria (minors)	3B	58.6	1986	7306	342	0.264	0.804	NA	NA	NA	NA	NA	NA	4Yr	California State University, Long Beach (Long Beach, CA)
2006	1	NA	4	FrRnd	4	Pirates	Y	$2,750,000	Brad Lincoln (minors)	RHP	0.4	53	38	0	0.237	0.520	99	9	11	4.74	1.39	1	4Yr	University of Houston (Houston, TX)
2006	1	NA	5	FrRnd	5	Mariners	Y	$2,450,000	Brandon Morrow (minors)	RHP	11.1	115	24	0	0.000	0.040	334	51	43	3.96	1.31	40	4Yr	University of California, Berkeley (Berkeley, CA)
2006	1	NA	6	FrRnd	6	Tigers	Y	$3,550,000	Andrew Miller (minors)	LHP	7.8	185	74	0	0.054	0.108	612	55	55	4.03	1.34	63	4Yr	University of North Carolina at Chapel Hill (Chapel Hill, NC)
2006	1	NA	7	FrRnd	7	Dodgers	Y	$2,300,000	Clayton Kershaw (minors)	LHP	79.9	357	698	1	0.162	0.390	425	210	92	2.48	1.00	0	HS	Highland Park HS (Dallas, TX)
2006	1	NA	8	FrRnd	8	Reds	Y	$2,000,000	Drew Stubbs (minors)	OF	7.9	911	2834	92	0.242	0.704	NA	NA	NA	NA	NA	NA	4Yr	University of Texas at Austin (Austin, TX)
2006	1	NA	9	FrRnd	9	Orioles	Y	$2,100,000	Billy Rowell (minors)	3B	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	HS	Bishop Eustace Preparatory School (Pennsauken, NJ)
2006	1	NA	10	FrRnd	10	Giants	Y	$2,025,000	Tim Lincecum (minors)	RHP	19.5	262	474	0	0.112	0.300	278	110	89	3.74	1.29	1	4Yr	University of Washington (Seattle, WA)

Just like the code we made at the beginning of this section, we are able to obtain a dataset containing all of the players drafted in the 1st round of the 2006 draft.

5.2.1 Example: Completing the rest of the Draft Dataset

Now, let’s finalize the Draft Dataset we used in the first section of this chapter. We are looking to use every first round pick from the years 2004-2013. To do this, we can use the function we created to easily select the years of our choice. Then, we can use the rbind() function to put it all together.

first_2004 <- scrape_draft(year = 2004, round = 1)
first_2005 <- scrape_draft(year = 2005, round = 1)
first_2006 <- scrape_draft(year = 2006, round = 1)
first_2007 <- scrape_draft(year = 2007, round = 1)
first_2008 <- scrape_draft(year = 2008, round = 1)
first_2009 <- scrape_draft(year = 2009, round = 1)
first_2010 <- scrape_draft(year = 2010, round = 1)
first_2011 <- scrape_draft(year = 2011, round = 1)
first_2012 <- scrape_draft(year = 2012, round = 1)
first_2013 <- scrape_draft(year = 2013, round = 1)

all_draft <- rbind(first_2004, first_2005, first_2006, first_2007, first_2008,
                   first_2009, first_2010, first_2011, first_2012, first_2013)

As you can see, each first round pick from 2004-2013 has been combined into a single dataset. In a later section, we will talk about how to create loops which will make this process even faster.

Note This data is currently being stored in the github link here.

5.3 More Difficult Web Scraping

In the first example about MLB drafts, finding the table was easy because there was only one table on the web page. However, you may find websites that have multiple tables on a page which will cause troubles when trying to get the correct one into R.

Let’s explore the Korean Baseball Organization (KBO), Wikipedia page to learn more.

As you can see on this page, there are plenty of tables throughout the page. Let’s say that we want to use the table with each team’s stadium, capacity and year founded.

This Stanford resource is very informative which gives a more detailed example on using the CSS Selector. We will work through the example on the KBO web page here.

Like the previous example, we need to store the url as an object in R. However, we will also need to find a CSS Selector that corresponds to our specific table. To find the CSS Selector we have to examine the html code within the web page. Below are the steps taken to do this:

First, right-click on the table we decided to work with.
Navigate to the inspect option and click it.
We will then see the code that has created the webpage and this is where we will find what we need.
Hover over each line of code in the new window until the line of code you are hovering over highlights the entire table needed. Note It is likely that this line starts with the word “table”.
Then, right click this code and choose the “Copy Selector” option.
Paste the copied code into an object as shown below:

url <- "https://en.wikipedia.org/wiki/KBO_League"

css_selector <- "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(75)"

So far, we have put our url, and our copied selector into objects above. At this point, our code will look like the draft data scraping example shown above. This code comes from the rvest package. In this chunk, the main change from the prior example is what is being put inside the html_element() function.

Instead of “table”, we will put in our CSS Selector as shown below.

library(rvest)

KBO_data <- url %>% 
    read_html() %>% 
    html_element(css = css_selector) %>% 
    html_table()

Using this process, we were able to choose a specific table on a webpage instead of using the defaulted first table. Below you can see the data we scraped from the Wikipedia page:

Team	City	Stadium	Capacity	Founded	Joined
Doosan Bears	Seoul	Jamsil Baseball Stadium	25,000	1982	1982
Hanwha Eagles	Daejeon	Hanwha Life Eagles Park	13,000	1985	1986
Kia Tigers	Gwangju	Gwangju-Kia Champions Field	20,500	1982	1982
Kiwoom Heroes	Seoul	Gocheok Sky Dome	16,744	2008	2008
KT Wiz	Suwon	Suwon kt wiz Park	20,000	2013	2015
LG Twins	Seoul	Jamsil Baseball Stadium	25,000	1982	1982
Lotte Giants	Busan	Sajik Baseball Stadium	24,500	1975	1982
NC Dinos	Changwon	Changwon NC Park	22,112	2011	2013
Samsung Lions	Daegu	Daegu Samsung Lions Park	24,000	1982	1982
SSG Landers	Incheon	Incheon SSG Landers Field	23,000	2000	2000

5.4 Data Scraping Ethics

Now that we’ve finished the section on web scraping, it is important to note some ethics in this section. To see a more comprehensive look into the ethics of web scraping, R for Data Science is a great resource.

The legality and ethics behind web scraping is quite complicated. However, it is a good rule of thumb to make sure that the data you are scraping is:

Public
Non-personal
Accurate

When accessing a website, Terms and Conditions often pop-up. These are a way for pages to have some sort of legal claim to the data on their page. We must respect these pages as much as possible and should not proceed with scraping.

Additionally, websites with personal data on their pages should not be scraped at any time. While not totally illegal, the ethics surrounding this is very hazy and should be avoided at all costs.

Finally, we must also be careful of overloading any websites’ servers. Because scraping often involves accessing a page several times, their servers were likely not meant to be accessed as such. We must remain courteous in that we are careful in how much/how often we are accessing a webpage.