5.5 Lab: Scraping (more) tables

The goal of this lab is to scrape population numbers across cities in the U.S. This lab also contains a little bit of data cleaning.

When there are many tables on a website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.

## [1] 12

The function now returns 12 different tables. We had to use the option fill=TRUE because some of the tables appear to have incomplete rows.

In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.

In the source code, search for the text of the page (e.g. 2017 rank). Right above it you will see: <table class="wikitable sortable"...">. This is the CSS selector that contains the table. (You can also find this information by right-clicking anywhere on the table, and choosing Inspect).

Now that we now what we’re looking for, let’s use html_nodes() to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)

## [1] 7

There are 7 tables in total, and we will extract the first one.

## 'data.frame':    314 obs. of  11 variables:
##  $ 2019rank               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ City                   : chr  "New York City[d]" "Los Angeles" "Chicago" "Houston[3]" ...
##  $ State[c]               : chr  "New York" "California" "Illinois" "Texas" ...
##  $ 2019estimate           : chr  "8,336,817" "3,979,576" "2,693,976" "2,320,268" ...
##  $ 2010Census             : chr  "8,175,133" "3,792,621" "2,695,598" "2,100,263" ...
##  $ Change                 : chr  "+1.98%" "+4.93%" "-0.06%" "+10.48%" ...
##  $ 2016 land area         : chr  "301.5 sq mi" "468.7 sq mi" "227.3 sq mi" "637.5 sq mi" ...
##  $ 2016 land area         : chr  "780.9 km2" "1,213.9 km2" "588.7 km2" "1,651.1 km2" ...
##  $ 2016 population density: chr  "28,317/sq mi" "8,484/sq mi" "11,900/sq mi" "3,613/sq mi" ...
##  $ 2016 population density: chr  "10,933/km2" "3,276/km2" "4,600/km2" "1,395/km2" ...
##  $ Location               : chr  ".mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-"| __truncated__ "34°01'N 118°25'W<U+FEFF> / <U+FEFF>34.01°N 118.41°W<U+FEFF> / 34.01; -118.41<U+FEFF> (2 Los Angeles)" "41°50'N 87°41'W<U+FEFF> / <U+FEFF>41.83°N 87.68°W<U+FEFF> / 41.83; -87.68<U+FEFF> (3 Chicago)" "29°47'N 95°23'W<U+FEFF> / <U+FEFF>29.78°N 95.39°W<U+FEFF> / 29.78; -95.39<U+FEFF> (4 Houston)" ...

As in the previous case, we still need to clean the data before we can use it.

We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names. (We’ll come back to this later in the course.)

Now we’re ready to generate the figure:

New York CityLos AngelesChicagoHoustonPhoenixPhiladelphiaSan AntonioSan DiegoDallasSan JoseAustinJacksonvilleFort WorthColumbusCharlotteSan FranciscoIndianapolisSeattleDenverWashingtonBostonEl PasoNashvilleDetroitOklahoma CityPortlandLas VegasMemphisLouisvilleBaltimoreMilwaukeeAlbuquerqueTucsonFresnoMesaSacramentoAtlantaKansas CityColorado SpringsOmahaRaleighMiamiLong BeachVirginia BeachOaklandMinneapolisTulsaTampaArlingtonNew OrleansWichitaBakersfieldClevelandAuroraAnaheimHonoluluSanta AnaRiversideCorpus ChristiLexingtonHendersonStocktonSaint PaulCincinnatiSt. LouisPittsburghGreensboroLincolnAnchoragePlanoOrlandoIrvineNewarkDurhamChula VistaToledoFort WayneSt. PetersburgLaredoJersey CityChandlerMadisonLubbockScottsdaleRenoBuffaloGilbertGlendaleNorth Las VegasWinston–SalemChesapeakeNorfolkFremontGarlandIrvingHialeahRichmondBoiseSpokaneBaton RougeTacomaSan BernardinoModestoFontanaDes MoinesMoreno ValleySanta ClaritaFayettevilleBirminghamOxnardRochesterPort St. LucieGrand RapidsHuntsvilleSalt Lake CityFriscoYonkersAmarilloGlendaleHuntington BeachMcKinneyMontgomeryAugustaAuroraAkronLittle RockTempeColumbusOverland ParkGrand PrairieTallahasseeCape CoralMobileKnoxvilleShreveportWorcesterOntarioVancouverSioux FallsChattanoogaBrownsvilleFort LauderdaleProvidenceNewport NewsRancho CucamongaSanta RosaPeoriaOceansideElk GroveSalemPembroke PinesEugeneGarden GroveCaryFort CollinsCoronaSpringfieldJacksonAlexandriaHaywardClarksvilleLakewoodLancasterSalinasPalmdaleHollywoodSpringfieldMaconKansas CitySunnyvalePomonaKilleenEscondidoPasadenaNapervilleBellevueJolietMurfreesboroMidlandRockfordPatersonSavannahBridgeportTorranceMcAllenSyracuseSurpriseDentonRosevilleThorntonMiramarPasadenaMesquiteOlatheDaytonCarrolltonWacoOrangeFullertonCharlestonWest Valley CityVisaliaHamptonGainesvilleWarrenCoral SpringsCedar RapidsRound RockSterling HeightsKentColumbiaSanta ClaraNew HavenStamfordConcordElizabethAthensThousand OaksLafayetteSimi ValleyTopekaNormanFargoWilmingtonAbileneOdessaColumbiaPearlandVictorvilleHartfordVallejoAllentownBerkeleyRichardsonArvadaAnn ArborRochesterCambridgeSugar LandLansingEvansvilleCollege StationFairfieldClearwaterBeaumontIndependenceProvoWest JordanMurrietaPalm BayEl MonteCarlsbadNorth CharlestonTemeculaClovisSpringfieldMeridianWestminsterCosta MesaHigh PointManchesterPuebloLakelandPompano BeachWest Palm BeachAntiochEverettDowneyLowellCentennialElginRichmondPeoriaBroken ArrowMiami GardensBillingsJurupa ValleySandy SpringsGreshamLewisvilleHillsboroVenturaGreeleyInglewoodWaterburyLeague CitySanta MariaTylerDavieDaly CityBoulderAllenWest CovinaSparksWichita FallsGreen BaySan MateoNorwalkRialtoLas CrucesChicoEl CajonBurbankSouth BendRentonVistaDavenportEdinburgTuscaloosaCarmelSpokane ValleySan AngeloVacavilleBend05010015020025030001M2M3M4M5M6M7M8M
rankpopulation

Let’s take a subset of the larger cities (population>1000000).

New York CityLos AngelesChicagoHoustonPhoenixPhiladelphiaSan AntonioSan DiegoDallasSan Jose2468101M2M3M4M5M6M7M8M
rankpopulation