5.3 Lab: Scraping tables
The goal of this lab is to scrape the inequality numbers across countries, and then clean it so that we can generate a plot showing the evolution in this variable over time.
We will start by loading the rvest
package, which will help us scrape data from the web.
The first step is to read the html code from the website we want to scrape, using the read_html()
function. If we want to see the html in text format, we can then use html_text()
.
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
html <- read_html(url) # reading the html code into memory
html # not very informative
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
## [1] "List of countries by income equality - Wikipediadocument.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":!1,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"0ab9a48d-11e2-4f55-b00e-c61e6f3a89d3\",\"wgCSPNonce\":!1,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"List_of_countries_by_income_equality\",\"wgTitle\":\"List of countries by income equality\",\"wgCurRevisionId\":1018722763,\"wgRevisionId\":1018722763,\"wgArticleId\":2249026,\"wgIsArticle\":!0,\"wgIsRedirect\":!1,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"Webarchive template wayback links\",\"CS1 errors: missing periodical\",\"Articles with short description\",\"Short description is different from Wikidata\",\"Wikipedia articles in need of updating from December 2020\",\n\"All W"
To extract all the tables in the html code automatically, we use html_table()
. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the third element of this list.
Now let’s clean the data so that we can use it for our analysis.. We will also change the variable names so that it’s easier to work with them.
## [1] "Country" "mid-70s" "mid-80s" "~ 1990" "mid-90s"
## [6] "~ 2000" "mid-2000s" "Late 2000s"
names(data) <- c("Country", "1975", "1985", "1990", "1995", "2000", "2005", "2009")
data <- data[-1,] # delete first row
We can also store this table as a .csv file.
And now we can plot this data to see how income inequality has increased over time across countries. It’s easier to plot data over time once it’s converted to long-format.
Q: What is long-format as opposed to wide-format?
library(plotly)
library(ggplot2)
library(tidyr)
data.plot <- data %>%
gather(time, inequality, -Country) %>% # Conversion to long format
mutate(time = as.numeric(time)) %>%
filter(complete.cases(.)) # Delete missings
plot_ly(data = data.plot,
x = ~time,
y = ~inequality,
type = 'scatter',
mode = 'lines',
color = ~Country,
display = "legendonly")
Q: Is that a good plot?