After this module you should be able to… |
Perform basic coding operations in R |
Produce simple thematic maps of epidemiologic data in R. |
Module 4: Spatial Mapping in R
Additional Resources
Geocomputation with R by Robin Lovelace. This will be a recurring ‘additional resource’ as it provides lots of useful insight and strategy for working with spatial data in R
. I encourage you to browse it quickly now, but return often when you have questions about how to handle geographic data (especially of class sf
) in R
.
An introduction to the ggplot2
package. This is just one of dozens of great online resources introducing the grammar of graphics approach to plotting in R
.
A basic introduction to the tmap
package This is also only one of many introductions to the tmap
mapping package. tmap
builds on the grammar of graphics philosophy of ggplot2
, but brings a lot of tools useful for thematic mapping!
Introduction to TIDYCENSUS
The tidycensus package by Kyle Walker is an R package designed to facilitate the process of acquiring and working with US Census Bureau (USCB) population data in the R environment.
It is designed to streamline the data wrangling process for spatial Census data analysis.
With tidycensus R users can acquire census reported variables of interest AND geometric and spatial attributes which helps facilitate mapping and spatial analysis.
US Census Bureau data consists of:
US decennial census data from 2010 and 2020 requested using
.1-year and 5-year American Community Survey (ACS) population and demographic data requested using
.Population Projection Estimates (PEP) requested using
.ACS Public Use Microdata requested using
.ACS Migration Flows requested using
.
Note this is not a comprehensive tutorial on tidycensus.
It will give you an introduction to the capabilities of the package.
For more information on USCB data products visit https://www.census.gov/data.html.
Geography in Tidycensus
The USCB reports data from the national level to county to census block level.
Importantly, not all data sources report data at all geographies.
The decennial census and 5-year ACS estimates report down to census block.
However, PEP only reports down to county level and metro/micro-politan areas.
Searching for variables in Tidycensus
One challenge is extracting variable names from the USCB data sources.
The tidycensus package contains a
By browsing or subsetting the table, you can search for the relevant variables needed for your analysis.
# install.packages("tidycensus")
library(tidycensus)
#Read in ACS variables
<- load_variables(2020, "acs5")
vars
head(vars)
Extracting the data in Tidycensus
Let’s work through examples on how to extract ACS.
The parameters passed through the
- geography which the the spatial resolution you want reported.
Here we want county level data.
- state is an optional parameter if you want a specific state.
The default is NULL if you want the US.
- variables listed the needed variables from the list above.
Here we chose the variable for median household income.
year (self-explanatory?) 5.
geometry is an important parameter. When set to
the data frame will include the spatial boundaries of the counties in your data frame and will produce an object.
#Extract ACS median household income at county level for GA. 2020
<- get_acs(
ga_hinc geography = "county",
state = "Georgia",
variables= c("B19013_001"),
year = 2020,
geometry = T)
head(ga_hinc)
The resulting data frame a row for each county.
Each row lists the county name, the estimate of the variable, and the associated margins of error (otherwise referred to as sampling error from ACS).
Additionally, we have a geographic identification (GEOID) and the geometric features column.
Next let’s work through examples on how to extract PEP.
The parameters passed through the
geography which the the spatial resolution you want reported. Here we want county level data.
state is an optional parameter if you want a specific state. The default is NULL if you want the US.
product lists three optioins: (1) population totals (“population”), (2) populations by characteristics (“characteristics”), and (3) housing units (“housing”). Here we request populations by sex and race breakdowns.
**breakdown* lists RACE, SEX, AGE (for years 2020+), and HISPANIC.
year (self-explanatory?)
geometry again set to
the data frame will include the spatial boundaries of the counties in your data frame and will produce an object.
#Extract PEP populations by Race-Sex at county level for GA. 2019
<- get_estimates(
ga_sexrace geography = "county",
state = "Georgia",
product= c("characteristics"),
breakdown = c("SEX","RACE"),
breakdown_labels = T,
year = 2019,
geometry = T)
head(ga_sexrace)
The resulting data frame a row for each county.
Each row lists the county name, the value of the variable.
For example, the population estimate for 2019 for the White population in Fulton County is 483838 across both sexes.
To map population data from the US Census Bureau, you must register a Census API key first. To obtain a census API go to https://api.census.gov/data/key_signup.html.
The in R type census_api_key(“YOUR KEY”, install = TRUE) to install the census API key. You only need to do this once.
From tidycensus you can extract information from the decennial census, Population Estimation Projection (PEP), and American Community Survey (ACS). Let’s see some examples:
- First you may need to look up which variables to download from the data set: For example, we can look up all the listed variables in the ACS and find which variables have label “Cherokee”.
# Load variables for the 2020 Census decennial data
<- load_variables(year = 2020, dataset = "acs5", cache = TRUE)
vars head(vars)
name | label | concept | geography |
B01001_001 | Estimate!!Total: | SEX BY AGE | block group |
B01001_002 | Estimate!!Total:!!Male: | SEX BY AGE | block group |
B01001_003 | Estimate!!Total:!!Male:!!Under 5 years | SEX BY AGE | block group |
B01001_004 | Estimate!!Total:!!Male:!!5 to 9 years | SEX BY AGE | block group |
B01001_005 | Estimate!!Total:!!Male:!!10 to 14 years | SEX BY AGE | block group |
B01001_006 | Estimate!!Total:!!Male:!!15 to 17 years | SEX BY AGE | block group |
- This code extracts PEP population counts at the county level for Oklahoma.
#Extract county level population counts
<- get_estimates(
ok_population geography = "county",
state = "Oklahoma",
product= c("population"),
year = 2022,
geometry = T)
- This code extracts PEP population counts by demographic breakdowns at the county level for Oklahoma. The PEP reported breakdowns are not available for recent years via tidycensus, and must be downloaded directly from USCB.gov.
#Extract county level population counts by characteristics
<- get_estimates(
ok_characteristics geography = "county",
state = "Oklahoma",
product= c("characteristics"),
breakdown = c("RACE", "SEX"),
breakdown_labels = T,
year = 2019,
geometry = T)
- This code extracts ACS reported Cherokee population estimates at county level for Oklahoma.
<- get_acs(
ok_cherokee geography = "county",
state = "Oklahoma",
variables = "B02014_009",
year = 2022,
geometry=T
)
Introduction to Shapefile and Tigris
Shapefiles are a popular method of representing geospatial vector data in GIS.
The shapefile can store data in the form of points, lines, or polygons.
They can be used to represent administrative boundaries (think country/state/county borders) or roads, rivers, lakes etc.
These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data.
Three mandatory files related to shapefiles use the extensions (.shp, .shx, .dbf).
The actual shapefile relates to the.shp file.
Shapefiles can be found on the internet with various publishers.
For example,the US census bureau publishes shapefiles for the entire US.
The USCB’s TIGER/Line database includes high-quality series of geographic datasets suitable for spatial and cartographic analyses.
The TIGER/Line shapefiles include three general types of data:
Legal entities: Legal geographic entities including county and state boundaries.
Statistical entities: Non legal entities but boundaries with statistical meaning, i.e., census tracts and. block groups.
Geographic features: Geographic datasets linked with aggregrate demographic information including water and road fetures.
From the USCB website, we downloaded the shapefile that contains all roads for the state of GA.
Using the
This gives an
Using the
library(sf)
#Read in a shapefile
<- st_read("data/shapefiles/tl_2019_13_prisecroads.shp")
shp
class(shp)
head(shp)
#Plot boundaries contained in shapefile.
tm_shape(shp) +
tm_lines(col = "black")
The tigris package simplifies the process for R users of obtaining and using Census geographic datasets.
The full list of shapefiles in tigris are available here
For example, the
This function produces an
# install.pakcages("tigris", dep =T)
library(tigris)
library(tmap)
#Call in state boundaries for US
<- states()
st head(st)
Additionally, we can extract shapefiles corresponding to specific counties within a state.
library(tigris)
library(tmap)
#Extract county borders for GA from the tigris package
<- counties("GA")
ga_counties
#Plot the county borders
tm_shape(ga_counties) +
tm_borders(col = "black")
Let’s add to this map by adding roads and water areas.
We get the roads for GA using the
To obtain the water areas, we must obtain water geometries for each county within the state of GA.
We use the
#Extract the FIPS identifier for each county in GA
<- unique(ga_counties$COUNTYFP)
fips
#Extract the primary and secondary roads for GA
<- primary_secondary_roads("GA")
road_data
#To create areas of water, we must extract for each FIPS ID separately.
#Create a blank list
<- list()
water_list
#For each county c, extract the water area using area_water()
#This takes a while!!!
for(c in 1:length(fips)){
<- area_water("GA", fips[c])
water_list[[c]]
}
#Use the do.call and rbind functions to compile the list into a data frame
<- do.call("rbind", water_list)
water_data
#Map the county borders, areas of water, and roads.
tm_shape(ga_counties) +
tm_borders(col = "black") +
tm_shape(water_data) +
tm_borders(col = "blue") +
tm_shape(road_data) +
tm_lines(col = "orange")
Introduction to sf
A new class of spatial objects has been defined called simple features sf.
Illustration by Allison Horst, see Simple Features Blog
What is a feature?
A feature is thought of as a thing, or an object in the real world, such as a building or a tree. Features have geometry describing where on Earth the feature is located, and they have attributes, which describe other properties.
Simple feature geometry types
The following seven simple feature types are most common.
Type | Description |
---|---|
POINT | zero-dimensional geometry containing single point |
LINESTRING | sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry |
POLYGON | geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring |
MULTIPOINT | set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal |
MULTILINESTRING | set of linestrings |
MULTIPOLYGON | set of polygons |
We are going to use data from New Haven CT to work with different kinds of spatial data.
First we need to run the commands below to use the
The New Haven data contains several
blocks is a data frame with population demographics broken down by block.
These attributes are saved within spatial polygons (boundaries of the blocks).
breach is a data frame with location data for public disturbances (crimes).
These attributes are saved with spatial points (x,y coordinates of the breaches).
roads contains spatial lines that mark road trajectories.
#Install and Call in the sf and GISTools libraries.
library(sf)
#Call in the newhaven data, notice this calls in multiple dataframes, examine a by clickin on them in the evironment pane
list2env(readRDS('data/newhaven.rds'), envir=environment())
<environment: R_GlobalEnv>
Below shows the blocks sf data frame which gives an example of how a
Below the geometry type is listed as a polygon which captures the boundaries of the census blocks.
It also shows the bounding box of the minimum and maximum (x,y) coordinates.
Each column denotes the spatial attributes or characteristics for each block (row), e.g., The POP1990 column reports the total population in 1990 for block 1 (pop=3071).
The last column geometry holds the spatial features of each block meaning the geographic boundaries of the polygon.
Many of the dplyr functions described above can be used on an sf data frame.
For example, we can use the
Additionally, we can create a new variable called RentOcc that calculates the number of housing units that are renter occupied from to current variables: (a) the of housing units in
#Use the filter function to subset to blocks with population >1000
# Use mutate function to create RentOcc = number of renter occ housing units.
<- blocks%>%
blocks_subset filter(POP1990>1000) %>%
mutate(RentOcc = HSE_UNITS * P_RENTROCC/100)
Spatial Joins combine multiple spatial data sets in the case where both data sets are spatial features.
To join a
The data in data set
If
For example, if there is a location in
In the below code, we merge together the data from the block data set (population attributes of the census blocks) and the breach data set (locations of crimes).
The new data set called block_breach is a polygon object, with attributes of the polygon containing each particular point being joined as columns.
Note: If you want to join a spatial data set with a non-sf object, you can use the
The
The
The observations that are unmatched are discarded.
<- breach %>%
temp st_join(blocks)
An Introduction to TMAP
This section will concentrate on learning the functions in the tmap (Tennekes, 2015).
This package focuses on mapping the spatial distribution of data attributes.
It has a similar grammar to plotting with the *ggplot package in that it seeks to handle each element of the map separately in a series of layers.
The process of making maps using tmap is one in which a series of layers are added to the map.
First
The input to
Layer 1
For example, we may want to plot the New Haven crime data.
The first layer of the plot is the geographic polygons (census block boundaries) of New Haven.
We input the
We plot the borders spatial layer of the data which outlines census block boundaries.
To add custom styles to the background, type
?tm_style
?tm_layout
tm_shape(blocks)+
tm_borders(col = "red", lty = "dashed") +
tm_style("natural", bg.color = "lightblue") +
tm_layout(title = "New Haven CT",
title.size = 1,
title.position = c(0.55, "top"))
So what you can see in the above code is one set of layers for the borders where we customize the look, color, and outline.
Layer 2
A plot of geographic boundaries alone does not give much information.
We add a second layer where we can include the spatial attributes of the areas (polygons).
To add the second layer to the current plot, add another
For example, here we will plot the percent of renter occupied housing units.
tm_shape(blocks)+
tm_borders(col = "red", lty = "dashed") +
tm_shape(blocks) +
tm_polygons(col = "P_RENTROCC", palette = "Blues") +
tm_layout(title = "New Haven CT",
title.size = 1,
title.position = c(0.55, "top"))
Layer 3
The third layer of the New Haven plot may include the point layer which plots the geographic coordinates of the breaches of the peace.
Extending upon layers 1 and 2, for layer 3, we use the
?tm_style
?tm_layout
tm_shape(blocks)+
tm_polygons(col = "P_RENTROCC", palette = "Blues") +
tm_borders(col = "red", lty = "dashed") +
tm_shape(breach) +
tm_dots(col = "blue", size = 0.1) +
tm_layout(title = "New Haven CT",
title.size = 1,
title.position = c(0.55, "top"))
Layer 4
The fourth layer of the New Haven plot may include the lines layer which plots the roads in New Haven.
Again, add another
We add line data using the
?tm_style
?tm_layout
st_crs(roads) <- st_crs(blocks) #this sets the projection of roads to be equal to the blocks sf dataframe
tm_shape(blocks)+
tm_borders(col = "red", lty = "dashed") +
tm_polygons(col = "P_RENTROCC", palette = "Blues", alpha = 0.5) +
tm_shape(breach) +
tm_dots(col = "blue", size = 0.1) +
tm_shape(roads) +
tm_lines(col = "purple", alpha = 0.5) +
tm_layout(title = "New Haven CT",
title.size = 1,
title.position = c(0.55, "top"))
Saving Your Map
Having created a map in a window, you may now want to save the map for either printing or incorporating into a document.
There are a few simple steps to save a map in multiple formats.
If you plot the map in the Viewer pane (bottom right), you can click on the Export button to save as a pdf or copy to clipboard.
To directly save a map as a pdf/png or jep.
Above the open street map object was assigned to “om_map”.
Taking a pdf export for example, print the map after using the
command, which lists the file path and the pdf file name and extension.
<- tm_shape(breach)+
om_map tm_dots()
#Save as a pdf
pdf("mymap.pdf")
print(om_map)
off()
#Save as a jpeg
jpeg("filepath/mymap.jpg")
print(om_map)
dev.off()
Below is an example of using the
Note how we have changed
Alternatively, we can use the
We create a data frame with