Until now, this book has focused on non-spatial tabular datasets. This chapter will make the shift to processing and visualizing vector geospatial data. Vector data consists of features that represent geographic phenomena. Points can be used to represent small objects such as wells, weather stations, or field plot locations. Lines represent one-dimensional linear features such as roads and streams. Polygons represent two-dimensional natural features such as lakes, vegetation patches, and burn scars as well as administrative units such as counties, districts, and counties. The type of geometry used to represent a particular feature may depend on the scale of the measurement. Points can be used to represent city locations at a continental scale, whereas a polygon would be used to map the boundary of an individual city.
Each feature in a vector dataset is described by a set of spatial coordinates and one or more attributes that characterize the feature. For example, a vector dataset of forest stands would include the vertices of each stand boundary polygon as well an attribute table containing variables such as stand age, tree species abundances, and timber volume for each stand. This book uses the sf package (Pebesma 2022) to work with vector datasets as “simple features” in R. This approach uses data frame objects to store vector data. Each feature is represented by one row in the data frame, with attributes stored as columns and spatial information stored in a special geometry column. This means that we can continue to use the same dplyr and tidyr functions that have already been covered to work with geospatial data. The ggplot2 package can also be used to generate maps from data frames containing geospatial data.
This chapter will use the sf package along with several of the tidyverse packages from the previous chapters and several other packages that contain some helpful graphics functions. If these packages are not already installed, you can install them using the
install.packages() function or by selecting Tools > Install Packages in the RStudio menu.
library(sf) library(rgdal) library(ggplot2) library(dplyr) library(tidyr) library(scales) library(RColorBrewer) library(units) library(cowplot)
This chapter uses several datasets that characterize the historical distribution of tornadoes in the state of Oklahoma. They are provided in ESRI shapefile format, which was developed several decades ago but remains one of the widely used file formats for vector geospatial data. It is a multiple file format, where separate files contain the feature geometries, attribute table, spatial indices, and coordinate reference system. All files have the same filename with different extensions. The shapefiles are imported to
sf objects using the
st_read() function. The
quiet = FALSE argument suppresses output to the console when importing spatial datasets. To read in a shapefile, it is only necessary to specify the filename with a “.shp” extension. However, all the files, including the “.shp” file as well as the “.dbf”, “.shx”, and “.prj” files, need to be present in the directory from which the data are read. Remember that all of the code in this book is written with the assumption that data are being read from the user’s working directory.
ok_counties.shp dataset contains county boundaries for the state of Oklahoma. The
ok_tornado_point.shp dataset and the
ok_tornado_path.shp dataset contain historical information about tornadoes in Oklahoma. The points are the initial locations of tornado touchdown, and the paths are lines that identify the path of each tornado after touchdown. These data were derived from larger, national-level datasets generated by the National Oceanographic and Atmospheric Administration (NOAA) National Weather Service Storm Prediction Center (https://www.spc.noaa.gov/gis/svrgis/).
<- st_read("ok_counties.shp", quiet = TRUE) okcounty <- st_read("ok_tornado_point.shp", quiet = TRUE) tpoint <- st_read("ok_tornado_path.shp", quiet = TRUE) tpath class(okcounty) ##  "sf" "data.frame" glimpse(okcounty) ## Rows: 77 ## Columns: 8 ## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40",… ## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153",… ## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "0110… ## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500… ## $ GEOID <chr> "40077", "40025", "40011", "40107", "4010… ## $ NAME <chr> "Latimer", "Cimarron", "Blaine", "Okfuske… ## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06",… ## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., P…
sf objects contain a column called
geometry. This is a special column that contains the geospatial information about the location of each feature. This column should not be modified directly. It is used by the functions in the sf package for geospatial data processing.
An older package that is still frequently used for working with geospatial data in R is sp. This package has been around for much longer than sf, and there are a number of other R packages that work with geospatial data objects defined in sp. Therefore, it is sometimes necessary to convert geospatial data between these two object classes. The following examples show how to convert
okcounty into an sp
SpatialPolygonsDataFrame object and then back to an
<- as(okcounty, 'Spatial') okcounty_sp class(okcounty_sp) ##  "SpatialPolygonsDataFrame" ## attr(,"package") ##  "sp" <- st_as_sf(okcounty_sp) okcounty_sf class(okcounty_sf) ##  "sf" "data.frame"
To generate a map of counties using
ggplot() with a sf object, the
geom_sf() function is used. This map shows the borders of all counties in Oklahoma. No plot aesthetics are specified, and
fill = NA makes the counties transparent. The map is displayed over the default gray background, and graticules of longitude and latitude are shown and labeled. (Figure 5.1).
ggplot(data = okcounty) + geom_sf(fill = NA)
The tornado dataset contains data from 1950-2021. If the objective is to map tornadoes for only recent years, then the
sf object needs to be modified to contain only the desired years. Because
sf objects are a type of data frame, they can be modified using the dplyr functions that were covered in Chapter 3. The tornado data include a number of columns with information on the timing, locations, and effects of the tornadoes.
names(tpoint) ##  "om" "yr" "mo" "dy" "date" ##  "time" "tz" "st" "stf" "stn" ##  "mag" "inj" "fat" "loss" "closs" ##  "slat" "slon" "elat" "elon" "len" ##  "wid" "fc" "geometry"
yr column indicates the year of each tornado and can be used to filter the data to a smaller year range from 2016-2021. This column is retained in the output along with
om, a unique ID code, and
date, the date of each tornado.
<- tpoint %>% tpoint_16_21 filter(yr >= 2016 & yr <= 2021) %>% select(om, yr, date) <- tpath %>% tpath_16_21 filter(yr >= 2016 & yr <= 2021) %>% select(om, yr, date)
Now, the tornado points and paths can be mapped. Two
geom_sf() functions are used, one to map the county boundaries and another to map the tornadoes. Because each function maps a different dataset, the
data argument must be provided in each
geom_sf() function instead of in the
ggplot() function. The
theme_bw() function is also used to display the map over a white background while retaining the graticules (Figure 5.2).
ggplot() + geom_sf(data = okcounty, fill = NA) + geom_sf(data = tpoint_16_21) + theme_bw()
To make the tornado path lines easier to see in relation to the county boundaries, they are displayed in red and their sizes are increased to be larger than the default line width of 0.5 (Figure 5.3).
ggplot() + geom_sf(data = okcounty, fill = NA) + geom_sf(data = tpath_16_21, color = "red", size = 1) + theme_bw()
To view the years of the tornadoes on the map, an aesthetic can be specified. The
color argument is specified as
as.factor(yr) so that the year is displayed as a discrete variable instead of a continuous variable. The
scale_color_discrete() function is then used to set the legend name. The
theme_void() function removes the plot axes and labels and shows only the map (Figure 5.4).
ggplot() + geom_sf(data = tpoint_16_21, aes(color = as.factor(yr))) + geom_sf(data = okcounty, fill = NA) + scale_color_discrete(name = "Year") + coord_sf(datum = NA) + theme_void()
facet_wrap() can be used to display the tornadoes for each year on a separate map. The sizes of the points are reduced slightly so that they are easier to see on the smaller maps (Figure 5.5).
ggplot() + geom_sf(data = okcounty, fill = NA, color = "gray") + geom_sf(data = tpoint_16_21, size = 0.75) + facet_wrap(vars(yr), ncol = 2) + coord_sf(datum = NA) + theme_void()
The number of tornado points in each county can be calculated using the
st_join() function to carry out a spatial join. This is different from the join functions covered in Chapter 3 in that
st_join() links rows from the two tables based on the spatial locations instead of their attributes. Afterward, each row in
countypnt contains additional columns from the
okcounty dataset that correspond to the county that the tornado is within.
<- st_join(tpoint_16_21, okcounty) countypnt glimpse(countypnt) ## Rows: 434 ## Columns: 11 ## $ om <dbl> 613662, 613675, 613676, 613677, 613678, 6… ## $ yr <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016,… ## $ date <chr> "2016-03-23", "2016-03-30", "2016-03-30",… ## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40",… ## $ COUNTYFP <chr> "001", "113", "105", "131", "035", "139",… ## $ COUNTYNS <chr> "01101788", "01101844", "01101840", "0110… ## $ AFFGEOID <chr> "0500000US40001", "0500000US40113", "0500… ## $ GEOID <chr> "40001", "40113", "40105", "40131", "4003… ## $ NAME <chr> "Adair", "Osage", "Nowata", "Rogers", "Cr… ## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06",… ## $ geometry <POINT [°]> POINT (-94.5042 35.6916), POINT (-9…
The joined data frame still contains one record for each tornado. To compute the total number of tornadoes per county,
countypnt must be grouped by the
GEOID county code. Then, the numbers of tornadoes in each county can be calculated using
summarize() with the
n() function to count the records in each group. Before grouping and summarizing,
countypnt must be converted from an
sf object to a normal data frame using
<- st_drop_geometry(countypnt) countypnt <- countypnt %>% countysum group_by(GEOID) %>% summarize(tcnt = n()) glimpse(countysum) ## Rows: 75 ## Columns: 2 ## $ GEOID <chr> "40001", "40005", "40007", "40009", "40011",… ## $ tcnt <int> 6, 3, 4, 8, 1, 4, 10, 5, 7, 5, 3, 12, 10, 5,…
okcounty is joined to
countysum so that each polygon is associated with the appropriate tornado summary. Using
left_join() instead of
inner_join() ensures that all of the county polygons are retained in the output of the join. A few counties that had no tornadoes during 2016-2021 are missing from
countysum, and therefore have
NA values in the joined table. In this case, we know that
NA means zero tornadoes, so the
NA values are replaced by zeroes. The
mutate() function computes the area of each county using
st_area() and then calculates the density of tornadoes per county. Density is initially in tornadoes per square meter, but is converted to tornadoes per 1000 km2. The
st_area() function returns a column with explicit measurement units, but these are removed using the
drop_units() function for simplicity. For more information about measurement units, consult the documentation of the units package (Pebesma et al. 2022).
<- okcounty %>% countymap left_join(countysum, by = "GEOID") %>% replace(is.na(.), 0) %>% mutate(area = st_area(okcounty), tdens = 10^6 * 10^3 * tcnt / area) %>% drop_units() glimpse(countymap) ## Rows: 77 ## Columns: 11 ## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40",… ## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153",… ## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "0110… ## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500… ## $ GEOID <chr> "40077", "40025", "40011", "40107", "4010… ## $ NAME <chr> "Latimer", "Cimarron", "Blaine", "Okfuske… ## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06",… ## $ tcnt <dbl> 1, 12, 1, 10, 6, 2, 6, 0, 4, 9, 3, 10, 12… ## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., P… ## $ area <dbl> 1890663261, 4766283042, 2427121029, 16572… ## $ tdens <dbl> 0.5289149, 2.5176851, 0.4120108, 6.034094…
st_write() function can be used to save
sf objects to a variety of file formats. In many cases, the function can determine the output format from the output filename extension. The following code saves the county-level tornado summaries in ESRI shapefile format. The
append = FALSE option overwrites the shapefile if it already exists.
st_write(countymap, dsn = "oktornadosum.shp", append = FALSE)
An alternative format for storing geospatial data is the Open Geospatial Consortium (OGC) GeoPackage. The data are stored in a SQLite database that may contain one or more layers. In this example, the
delete_dsn = TRUE argument overwrites the entire GeoPackage. Leaving this argument at its default value of
FALSE would add a new layer to an existing database.
st_write(countymap, dsn = "oktornado.gpkg", layer = "countysum", delete_dsn = TRUE)
layer_options argument must be set to
"RFC7946 = YES" to save the data in the newest GeoJSON specification.
st_write(countymap, "oktornado.geojson", layer_options = "RFC7946 = YES")
Another way to display the tornadoes is as a choropleth map, where summary statistics for each county are displayed as different colors. The county-level tornado density can be as a choropleth using the
fill aesthetic with
geom_sf(). By default, the fill colors are based on a dark-to-light blue color ramp. The
theme_void() function eliminates the axes and graticules and displays only the map on a white background (Figure 5.6).
ggplot(data = countymap) + geom_sf(aes(fill = tdens)) + theme_void()
In general, choropleth maps are most effective when mapping rates, such as the density of tornadoes per area or the prevalence of a disease per number of people. In some cases, it is necessary to map count data such as the number of tornadoes or disease cases in each county. People tend to naturally associate sizes with quantitie when viewing maps, and displaying the counts as graduated symbols is often an effective approach.
To map symbols, the county polygons must first be converted to points. The
st_centroid() generates a point feature located at the centroid of each county. The
st_geometry_type() function returns the feature geometry type. Setting
by_geometry = FALSE returns one geometry type for the entire dataset instead of for every feature.
st_geometry_type(okcounty, by_geometry=FALSE) ##  POLYGON ## 18 Levels: GEOMETRY POINT LINESTRING POLYGON ... TRIANGLE = st_centroid(countymap) okcntrd st_geometry_type(okcntrd, by_geometry = FALSE) ##  POINT ## 18 Levels: GEOMETRY POINT LINESTRING POLYGON ... TRIANGLE
The tornado counts can be mapped using the
okcentrd dataset with the
size aesthetic. One point is plotted for each county centroid, and the size of the point is proportional to the number of tornadoes in the county (Figure 5.7).
ggplot() + geom_sf(data = okcntrd, aes(size = tcnt)) + geom_sf(data = okcounty, fill = NA) + theme_void()
Additional ggplot2 functions can be added to improve the appearance of the map. The
scale_fill_distiller() function allows the specification of a different color ramp. This example uses the “YlOrRd” palette from the RColorBrewer package. The
pretty_breaks() function from the scales package is used to automatically select a set of breaks for the legend. The legend is moved to the bottom of the map to better accomodate the longer legend title (Figure 5.8).
Note that using a “complete” theme like
theme_void() will remove any settings made by a previous
theme() function. Therefore, it is necessary to call
theme() in the
ggplot() call to implement the combined settings. Also, note how the
expression() function is used in specifying the
name argument for
scale_fill_distiller() to combine text with a superscript.
ggplot(data = countymap) + geom_sf(aes(fill = tdens)) + scale_fill_distiller(name = expression("Tornadoes/1000 km"^2), palette = "YlOrRd", breaks = pretty_breaks(), direction = 1) + theme_void() + theme(legend.position = "bottom")
The RColorBrewer package provides a selection of palettes designed for choropleth mapping (Harrower and Brewer 2003). The
display_brewer_all() function generates a chart with examples of all the available palettes (Figure 5.9).
There are three types of ColorBrewer palettes. The top group is Figure 5.9 contains sequential palettes that are suitable for mapping ordered data along numerical scales (e.g., temperatures ranging from 0 to 30 degrees C) or ordinal categories (e.g., temperatures classified as cold, warm, and hot). These sequential palettes may include a single color or multiple colors, but have no clear breaks in the scale. The middle group in Figure 5.9 contains qualitative palettes, which use color to distinguish between different categories without implying order. The lower group in Figure 5.9 contains divergent palettes that emphasize a specific breakpoint in the data. Divergent palettes are often used to indicate values that are above or below the mean or to highlight values that are higher or lower than zero. More details about these palettes, including recommendations for color schemes that are most effective for different types of computer displays and for accommodating color-blind viewers, are available at http://colorbrewer2.org.
The ColorBrewer palettes each contain a finite number of colors that are intended to be associated with categories in a choropleth map. Note that the
scale_fill_distiller() function used to generate the color scale for the map in Figure 5.8 operates a bit differently. This function takes a ColorBrewer palette and converts it to a continuous color ramp. The next map example will show how to define categories and map each one as a distinctive color. To view the colors for a given number of categories and a specific palette, the
display.brewer.pal() function is used with the number of categories as the first argument and the palette name as the second palette (Figure 5.10).
Rather than using continuous scales for color and size, it is often recommended to aggregate the data into a small number of classes (typically 3-6). Using discrete classes makes it easier to associate each color or symbol in the map with a specific range of values. To accomplish this step, we need to add a couple of new classified variables using
cut() function is used to split the continuous variables based on user-specified breaks. The incidence variable is split based on quantiles (i.e., percentiles) defined in the
qbrks object. The population breaks are manually specified.
<- 4 numclas <- seq(0, 1, length.out = numclas + 1) qbrks qbrks##  0.00 0.25 0.50 0.75 1.00 <- countymap %>% countymap mutate(tdens_c1 = cut(tdens, breaks = quantile(tdens, breaks = qbrks), include.lowest = T))
tdens_c1 column is a discrete factor instead of a continuous numerical variable as in the previous example. As a result, the
scale_fill_brewer() function is now used instead of
scale_fill_distiller(). The comma-separated numbers specify the range of tornado densities for each of the categories (Figure 5.11).
ggplot(data = countymap) + geom_sf(aes(fill = tdens_c1)) + scale_fill_brewer(name = expression("Tornadoes/1000 km"^2), palette = "YlOrRd") + theme_void() + theme(legend.position = "bottom")
Similar to choropleth maps, graduated symbol maps are often easier to interpret if they include a limited number of symbol sizes. To generate a classified map of tornado counts, they are converted to discrete categories using the
cut() function. Instead of using quantiles, the breakpoints for the classification are selected manually and stored in the
<- max(okcntrd$tcnt) maxcnt <- c(0, 2, 5, 10, maxcnt) brkpts <- okcntrd %>% okcntrd mutate(tcnt_c1 = cut(tcnt, breaks = brkpts, include.lowest = T))
The resulting map has four symbol sizes, each associated with a specific range of tornado counts (Figure 5.12).
ggplot(data = okcntrd) + geom_sf(aes(size = tcnt_c1)) + scale_size_discrete(name="Tornadoes") + geom_sf(data = okcounty, fill = NA) + theme_void() + theme(legend.position = "bottom")
By default, maps and charts generated using
ggplot() are output to the Plots tab in the lower right-hand corner of the RStudio GUI. However, it is often necessary to export maps and other graphics to files and explicitly specify their dimensions and resolution. This is usually the case when generating graphics for publications that must meet specific size and formatting criteria. The simplest way to do this is with the
ggsave() function. This example exports the most recent output of
ggplot() to a portable network graphics (PNG) file called
tornado.png with dimensions of 6 x 4 inches and a resolution of 300 pixels per inch. Other units besides inches can be used by specifying the
ggsave("tornado.png", width = 6, height = 4, dpi = 300) ggsave("tornado2.png", width = 15, height = 10, units = "cm", dpi = 100)
A variety of graphical file types are supported, including TIFF, JPEG, and PDF files. Note that the arguments vary with the type of file being created. When saving a JPEG, the
quality option can be specified. When saving a TIFF, a
compression type can be specified. When saving a PDF, the
dpi argument is ignored.
ggsave("tornado.jpeg", width = 6, height = 4, dpi = 300, quality = 90) ggsave("tornado.tiff", width = 6, height = 4, dpi = 300, compression = "lzw") ggsave("tornado.pdf", width = 6, height = 4)
The output of
ggplot() can also be saved as an R object that can be output to graphics files using
plot argument is used to specify the
ggplot object to be saved.
<- ggplot(data = countymap) + choropleth geom_sf(aes(fill = tdens_c1)) + scale_fill_brewer(name="Density", palette = "YlOrRd") + theme_void() <- ggplot(data = okcntrd) + gradsymbol geom_sf(aes(size = tcnt_c1)) + scale_size_discrete(name="Count") + geom_sf(data = okcounty, fill = NA) + theme_void() ggsave("choropleth.tiff", plot = choropleth, width = 6, height = 4, dpi = 300, compression = "lzw") ggsave("gradsymbol.tiff", plot = gradsymbol, width = 6, height = 4, dpi = 300, compression = "lzw")
Saved graphs and maps can also be combined into a composite figure using the cowplot package (Wilke 2020). The
plot_grid() function provides a variety of options for arranging figures in a regular grid. This basic example provides a label for each subplot and specifies additional arguments to plot the maps in a single column, justify the labels, move the labels to the top of each map, and align the maps horizontally and vertically, so they are exactly the same size.
plot_grid(choropleth, gradsymbol, labels = c("A) Choropleth Map", "B) Graduated Symbol Map", label_size = 12), ncol = 1, hjust = 0, label_x = 0, align = "hv")
A variety of additional resources for working with the sf package can be found at https://r-spatial.github.io/sf/index.html. These include a link to the sf “cheatsheet” as well as a variety of articles, vignettes, and blog posts that provide additional examples of how to work with vector geospatial data using this package. The book Geocomputation with R by Robin Lovelace and others (Lovelace, Nowosad, and Muenchow 2019) is also an excellent reference that encompasses sf as well as many other R packages for working with geospatial data. Look for their forthcoming second edition for the most up-to-date information on geospatial data processing with R.
Generate a map of tornado paths where the paths from each year are displayed as a different color, similar to the map of tornado points in Figure 5.4. Create a composite figure containing the map of tornado paths and the map of tornado points using
Summarize the density of tornado points by both county and year and generate a faceted plot that displays maps of county-level tornado density from 2016-2021.
Generate four choropleth maps of tornado density based on quantile breaks with numbers of classes ranging from 3 to 6. Create a composite figure containing the four maps using
plot_grid()and examine how the number of classes changes the interpretation of the map.