12.2 Scrapping from Web

We will use the rvest package to scrap directly from the web. However, it is sometimes convenient to know what to extract using some minor tools. We will use SelectorGadget from Chrome browser.

With the keyword SelectorGadget, use internet search engine to download and install the file. The program is easy to use. The first click will select area and then subsequent click will include or exclude elements.

To install and load the rvest package, we use the following code:

install.packages("rvest")
library(rvest)

12.2.1 Wikipedia Table

We will do two scrapping exercises:

  1. scarp from Wikipedia table, and
  2. scrap from an unfriendly website.

The following code extracts the student t’s distribution table from Wikipedia. Using the SelectorGadget, we can see that the table is called .wikitable. Then we will extract that using html_nodes() and then we parse the html data into a dataframe using html_table().

link <-paste0("https://en.wikipedia.org/wiki/",
              "Student%27s_t-distribution")
webpage <- read_html(link)
data <- html_nodes(webpage,".wikitable")
table<- html_table(data[[1]],header = FALSE)

12.2.2 Other Websites

To scarp from unstructural data, then we need to find what is the selector using the SelectorGadget. Then we can read the data as text.

link<-paste0("http://www.fas.nus.edu.sg/ecs/",
             "people/staff.html")
webpage <- read_html(link)
data <- html_nodes(webpage,"br+ table td")
content <-html_text(data)

Then we can transform dataset into dataframe.

df = data.frame(matrix(content,ncol=5,byrow=T),
                stringsAsFactors=FALSE)
colnames(df)<-df[1,]
df[-1,]
##                           Title                       Name       Tel
## 2                       Manager    Ms PAK Ming Foon, Ginny 6516 3956
## 3                       Manager              Ms Nicky KHEH 6516 4878
## 4                       Manager                Ms WEI Qing 6516 8909
## 5             Assistant Manager          Ms WOON Swee Yoke 6516 6027
## 6             Assistant Manager            Ms NEO Seok Min 6516 3941
## 7                     Executive            Ms TAN Pei Ying 6601 3508
## 8  Management Assistant Officer           Ms CHEE Lee Kuen 6516 3942
## 9  Management Assistant Officer Ms Fatimah AHMAD\r\n\t\t\t\t   6516 3950
## 10 Management Assistant Officer           Ms Salinah ZUBER 6516 3958
## 11 Management Assistant Officer            Ms Diana ISMAIL 6516 6013
## 12 Management Assistant Officer          Mdm TAN Leng Choo 6516 1304
##       Email                        Main Area
## 2   ecspmfg                    Undergraduate
## 3    ecsklc                         Graduate
## 4      weiq   Graduate (Master of Economics)
## 5    ecswsy                    Undergraduate
## 6    ecssec        Head's Personal Assistant
## 7  pei.ying              Department seminars
## 8    ecsclk                      Timetabling
## 9     ecsfa Undergraduate (levels 1000-2000)
## 10    ecssz Undergraduate (levels 3000-4000)
## 11    ecsdi            Graduate (Coursework)
## 12   ecstlc              Graduate (Research)
row.names(df) <- NULL
head(df[2:3], n=3)
##                      Name       Tel
## 1                    Name       Tel
## 2 Ms PAK Ming Foon, Ginny 6516 3956
## 3           Ms Nicky KHEH 6516 4878