5.5 Lab: Scraping (more) tables
The goal of this lab is to scrape population numbers across cities in the U.S. This lab also contains a little bit of data cleaning.
When there are many tables on a website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.
library(rvest)
library(xml2)
url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE) # What does "fill" do? ?html_table
length(tables)
## [1] 12
The function now returns 12 different tables. We had to use the option fill=TRUE
because some of the tables appear to have incomplete rows.
In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.
In the source code, search for the text of the page (e.g. 2017 rank). Right above it you will see: <table class="wikitable sortable"...">
. This is the CSS selector that contains the table. (You can also find this information by right-clicking anywhere on the table, and choosing Inspect).
Now that we now what we’re looking for, let’s use html_nodes()
to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)
## [1] 7
There are 7 tables in total, and we will extract the first one.
## 'data.frame': 314 obs. of 11 variables:
## $ 2019rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ City : chr "New York City[d]" "Los Angeles" "Chicago" "Houston[3]" ...
## $ State[c] : chr "New York" "California" "Illinois" "Texas" ...
## $ 2019estimate : chr "8,336,817" "3,979,576" "2,693,976" "2,320,268" ...
## $ 2010Census : chr "8,175,133" "3,792,621" "2,695,598" "2,100,263" ...
## $ Change : chr "+1.98%" "+4.93%" "-0.06%" "+10.48%" ...
## $ 2016 land area : chr "301.5 sq mi" "468.7 sq mi" "227.3 sq mi" "637.5 sq mi" ...
## $ 2016 land area : chr "780.9 km2" "1,213.9 km2" "588.7 km2" "1,651.1 km2" ...
## $ 2016 population density: chr "28,317/sq mi" "8,484/sq mi" "11,900/sq mi" "3,613/sq mi" ...
## $ 2016 population density: chr "10,933/km2" "3,276/km2" "4,600/km2" "1,395/km2" ...
## $ Location : chr ".mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-"| __truncated__ "34°01'N 118°25'W<U+FEFF> / <U+FEFF>34.01°N 118.41°W<U+FEFF> / 34.01; -118.41<U+FEFF> (2 Los Angeles)" "41°50'N 87°41'W<U+FEFF> / <U+FEFF>41.83°N 87.68°W<U+FEFF> / 41.83; -87.68<U+FEFF> (3 Chicago)" "29°47'N 95°23'W<U+FEFF> / <U+FEFF>29.78°N 95.39°W<U+FEFF> / 29.78; -95.39<U+FEFF> (4 Houston)" ...
As in the previous case, we still need to clean the data before we can use it.
We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names. (We’ll come back to this later in the course.)
data$city_name <- gsub('\\[.*\\]', '', data$City) # What does gsub do?
data$population <- data[,"2019estimate"]
data$population <- as.numeric(gsub(",", "", data$population))
data$rank <- data[,"2019rank"]
data <- data[,c("rank", "population", "city_name")]
# data <- data %>% select(rank, population, city_name)
#data[,c(14,13,12)]
Now we’re ready to generate the figure:
library(ggplot2)
library(plotly)
plot_ly(data = data,
x = ~rank,
y = ~population,
text = ~city_name,
type = 'scatter',
mode = 'text+markers',
textposition = 'middle right')
Let’s take a subset of the larger cities (population>1000000).