Big data and Social Science

8.3 Lab 1: Scraping tables

We will start by loading the rvest package, which will help us scrape data from the web.

library(rvest)

The goal of this exercise is to scrape the inequality numbers across countries, and then clean it so that we can generate a plot showing the evolution in this variable over time.

The first step is to read the html code from the website we want to scrape, using the read_html() function. If we want to see the html in text format, we can then use html_text().

url <- "https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
html <- read_html(url) # reading the html code into memory
html # not very informative
substr(html_text(html), 1, 1000) # first 1000 characters

To extract all the tables in the html code automatically, we use html_table(). Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the third element of this list.

tab <- html_table(html, fill=TRUE)
# str(tab)
data <- tab[[9]]

Now let’s clean the data so that we can use it for our analysis.. We will also change the variable names so that it’s easier to work with them.

names(data)
names(data) <- c("Country", "1975", "1985", "1990", "1995", "2000", "2005", "2009")
data <- data[-1,] # delete first row

We can also store this table as a .csv file.

write.csv(data, "www/data_inquality.csv")
# Where is this file stored?

And now we can plot this data to see how income inequality has increased over time across countries. It’s easier to plot data over time once it’s converted to long-format.

Q: What is long-format as opposed to wide-format?

library(plotly)
library(ggplot2)
library(tidyr)

data.plot <- data %>% 
    gather(time, inequality, -Country) %>% # Conversion to long format
    mutate(time = as.numeric(time)) %>%
    filter(complete.cases(.)) # Delete missings

plot_ly(data = data.plot, 
       x = ~time, 
       y = ~inequality,
       type = 'scatter', 
       mode = 'lines',
       color = ~Country,
       display = "legendonly")

Q: Is that a good plot?