8.3 Lab 1: Scraping tables
We will start by loading the rvest
package, which will help us scrape data from the web.
library(rvest)
The goal of this exercise is to scrape the inequality numbers across countries, and then clean it so that we can generate a plot showing the evolution in this variable over time.
The first step is to read the html code from the website we want to scrape, using the read_html()
function. If we want to see the html in text format, we can then use html_text()
.
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
html <- read_html(url) # reading the html code into memory
html # not very informative
substr(html_text(html), 1, 1000) # first 1000 characters
To extract all the tables in the html code automatically, we use html_table()
. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the third element of this list.
tab <- html_table(html, fill=TRUE)
# str(tab)
data <- tab[[9]]
Now let’s clean the data so that we can use it for our analysis.. We will also change the variable names so that it’s easier to work with them.
names(data)
names(data) <- c("Country", "1975", "1985", "1990", "1995", "2000", "2005", "2009")
data <- data[-1,] # delete first row
We can also store this table as a .csv file.
write.csv(data, "www/data_inquality.csv")
# Where is this file stored?
And now we can plot this data to see how income inequality has increased over time across countries. It’s easier to plot data over time once it’s converted to long-format.
Q: What is long-format as opposed to wide-format?
library(plotly)
library(ggplot2)
library(tidyr)
data.plot <- data %>%
gather(time, inequality, -Country) %>% # Conversion to long format
mutate(time = as.numeric(time)) %>%
filter(complete.cases(.)) # Delete missings
plot_ly(data = data.plot,
x = ~time,
y = ~inequality,
type = 'scatter',
mode = 'lines',
color = ~Country,
display = "legendonly")
Q: Is that a good plot?