4.3 Mapping

Often, data will include a spatial component, and you will want to map the data either for exploratory data analysis or to present interesting aspects of the data to others. R has a range of capabilities for mapping data. The simplest techniques involve using data that includes latitude and longitude values and using these location values as the x and y aesthetics in a regular plot. R also has the ability to work with more complex spatial data objects and import shapefiles through extensions like the sp package.

In this section, we will cover the basics of mapping in R and touch on some of the more advanced possibilities. We will also present some useful packages for making quick but attractive maps in R. R also now has the capability to make interactive maps using the plotly and leaflet packages; in the end of this section, we’ll present these packages and explain a bit more about htmlWidgets in general.

4.3.1 Basics of mapping

4.3.1.1 Creating maps with ggplot2

The most basic way to map data in R is to create a regular ggplot object and map longitude to the x aesthetic and latitude to the y aesthetic. You can use this technique to create maps of geographic areas, like states or countries, and to map locations as points, lines, and other shapes. The ggplot2 package includes a few datasets with geographic information that can be accessed with the map_data function. We’ll pull one of these to use as an example of this basic method of mapping.

You can use the map_data function from the ggplot2 package to pull data for maps at different levels (“usa”, “state”, “world”, “county”). The data you pull give locations of the borders of geographic polygons like states and counties. For example, you can get the polygon location data for U.S. states by running the following code:

library(ggplot2)
us_map <- map_data("state")
head(us_map, 3)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>

Notice that the dataframe includes columns with location (long and lat). It also includes a column describing the order in which these points should be connected to form a polygon (order), the name of the state (region), and a group column that separates the points into unique polygons that should be plotted (more on this in a minute).

If you plot the points for a couple of state, mapping longitude to the x aesthetic and latitude to the y aesthetic, you can see that the points show the outline of the state:

us_map %>% 
  filter(region %in% c("north carolina", "south carolina")) %>%
  ggplot(aes(x = long, y = lat)) +
  geom_point()

If you try to join these points by just using a path geom rather than a points geom, however, you’ll have a problem:

us_map %>% 
  filter(region %in% c("north carolina", "south carolina")) %>%
  ggplot(aes(x = long, y = lat)) +
  geom_path()

If you create a path for all the points in the map, without separating polygons for different geographic groupings (like states or islands), the path will be drawn without picking up the pen between one state’s polygon and the next state’s polygon, resulting in unwanted connecting lines.

Mapping a group aesthetic in the ggplot object fixes this problem. This will plot a separate path or polygon for each separate polygon. In the U.S. states data, each polygon’s group is specified by the group column. No two states share a group, and some states have more than one group (if, for example, they have islands). Here is the code for mapping the group column to the group aesthetic to create the map:

us_map %>% 
  filter(region %in% c("north carolina", "south carolina")) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_path()

You may have noticed that we used a path rather than line geom to plot the state borders in the previous maps. This is because the line geom connects points by their order on the x-axis. While you often want that for statistical graphs, for maps in ggplot2 the x-axis is longitude, and we want to connect the points in a way that outlines the geographic areas. The geom_path function connects the points in the order they appear in the dataframe, which typically gives us the desired plot for mapping geographic areas. You likely will also sometimes want to use a polygon geom for mapping geographic areas, as shown in some of the following examples.

If you would like to set the color inside each geographic area, you should use a polygon geom rather than a path geom. You can then use the fill aesthetic to set the color inside the polygon and the color aesthetic to set the color of the border. For example, to set the interior of the states to blue and the borders to black, you can run:

us_map %>% 
  filter(region %in% c("north carolina", "south carolina")) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "lightblue", color = "black")

To get rid of the x- and y-axes and the background grid, you can add the “void” theme to the ggplot output:

us_map %>% 
  filter(region %in% c("north carolina", "south carolina")) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "lightblue", color = "black") + 
  theme_void()

To extend this code to map the full continental U.S., just remove the line of the pipe chain that filtered the state mapping data to North and South Carolina:

us_map %>% 
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "lightblue", color = "black") + 
  theme_void()

In the previous few graphs, we used a constant aesthetic for the fill color. However, you can map a variable to the fill to create a choropleth map with a ggplot object. For example, the votes.repub dataset in the maps package gives some voting data by state and year:

data(votes.repub)
head(votes.repub)
            1856  1860  1864  1868  1872  1876  1880  1884  1888  1892
Alabama       NA    NA    NA 51.44 53.19 40.02 36.98 38.44 32.28  3.95
Alaska        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
Arizona       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
Arkansas      NA    NA    NA 53.73 52.17 39.88 39.55 40.50 38.07 32.01
California 18.77 32.96 58.63 50.24 56.38 50.88 48.92 52.08 49.95 43.76
Colorado      NA    NA    NA    NA    NA    NA 51.28 54.39 55.31 41.13
            1896  1900  1904  1908  1912  1916  1920  1924  1928  1932
Alabama    28.13 34.67 20.65 24.38  8.26 21.97 30.98 27.01 48.49 14.15
Alaska        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
Arizona       NA    NA    NA    NA 12.74 35.37 55.41 41.26 57.57 30.53
Arkansas   25.11 35.04 40.25 37.31 19.73 28.01 38.73 29.28 39.33 12.91
California 49.13 54.48 61.90 55.46  0.58 46.26 66.24 57.21 64.70 37.40
Colorado   13.84 42.04 55.27 46.88 21.88 34.75 59.32 57.02 64.72 41.43
            1936  1940  1944  1948  1952  1956  1960 1964 1968 1972  1976
Alabama    12.82 14.34 18.20 19.04 35.02 39.39 41.75 69.5 14.0 72.4 43.48
Alaska        NA    NA    NA    NA    NA    NA 50.94 34.1 45.3 58.1 62.91
Arizona    26.93 36.01 40.90 43.82 58.35 60.99 55.52 50.4 54.8 64.7 58.62
Arkansas   17.86 20.87 29.84 21.02 43.76 45.82 43.06 43.9 30.8 68.9 34.97
California 31.70 41.35 42.99 47.14 56.39 55.40 50.10 40.9 47.8 55.0 50.89
Colorado   37.09 50.92 53.21 46.52 60.27 59.49 54.63 38.7 50.5 62.6 55.89

To create a choropleth for one of the years, you can tidy the data, join it with the U.S. data by state, and then map the voting percentages to the fill aesthetic:

library(dplyr)
library(viridis)

votes.repub %>%
  tbl_df() %>%
  mutate(state = rownames(votes.repub),
         state = tolower(state)) %>%
  right_join(us_map, by = c("state" = "region")) %>%
  ggplot(aes(x = long, y = lat, group = group, fill = `1976`)) +
  geom_polygon(color = "black") + 
  theme_void() + 
  scale_fill_viridis(name = "Republican\nvotes (%)")

A few interesting things about this example are:

  • This code uses piping and tidyverse functions to clean the data, merge it with the geographic data, and pipe to ggplot. See earlier sections of this book to find out more about tidying data.
  • The votes.repub data initially is a matrix. The tbl_df function from dplyr is used to convert it to a tibble.
  • The state names were originally in the row names of votes.repub. The mutate function is used to move these into a column of the dataframe. The names are then converted to lowercase to allow easier merging with the geographic data.
  • The voting data includes Alaska and Hawaii, but the geographic data does not. Therefore, we’ve used right_join to join the two datasets, so only non-missing values from the us_map geographic data will be kept.
  • Because the column names for the years do not follow the rules for naming R objects (“1976” starts with a number), we’ve surrounded the column name with backticks when calling it in the aesthetic statement.
  • We want the borders of the states to always be black, so we’ve set that aesthetic as a constant rather than mapping it to a variable by including it in an aes call.
  • We’ve added a void theme (theme_void) to take out axes and background, and we added a custom color scale from the viridis package (scale_fill_viridis) to customize the colors used in the choropleth.

If you have data with point locations, you can add those points to a map created with ggplot, too, by adding a point geom. As an example, we’ll use some data related to the popular “Serial” podcast. The podcast covered a murder in Baltimore. David Robinson posted a dataset of locations related to the show on GitHub, which you can read in directly to R to use for some of the mapping examples in this subset:

library(readr)
serial <- read_csv(paste0("https://raw.githubusercontent.com/",
                          "dgrtwo/serial-ggvis/master/input_data/",
                          "serial_podcast_data/serial_map_data.csv"))
head(serial, 3)
# A tibble: 3 x 5
      x     y      Type  Name Description
  <int> <int>     <chr> <chr>       <chr>
1   356   437 cell-site  L688        <NA>
2   740   360 cell-site  L698        <NA>
3   910   340 cell-site  L654        <NA>

He figured out a way to convert the x and y coordinates in this data to latitude and longitude coordinates, and the following code cleans up the data using that algorithm. The murder occurred when cell phones were just becoming popular, and cell phone data was used in the case. The code also adds a column for whether of not the location is a cell tower.

serial <- serial %>%
    mutate(long = -76.8854 + 0.00017022 * x,
           lat  = 39.23822 + 1.371014e-04 * y,
           tower = Type == "cell-site")
serial %>%
  slice(c(1:3, (n() - 3):(n())))
# A tibble: 7 x 8
      x     y          Type                            Name
  <int> <int>         <chr>                           <chr>
1   356   437     cell-site                            L688
2   740   360     cell-site                            L698
3   910   340     cell-site                            L654
4   960   830 base-location Campfield Early Learning Center
5   580  1230 base-location               Owings Mills Mall
6   720   496 base-location                   Adnan's house
7   954   410 base-location                    Jenn's house
# ... with 4 more variables: Description <chr>, long <dbl>, lat <dbl>,
#   tower <lgl>

We can use ggplot to map these data on a base map of Baltimore City and Baltimore County in Maryland. To do so, use the map_data function from ggplot2 to pull the “county” map. By specifying the region parameter as “maryland”, you can limit the returned geographic polygon data Maryland counties.

maryland <- map_data('county', region = 'maryland')
head(maryland)
       long      lat group order   region subregion
1 -78.64992 39.53982     1     1 maryland  allegany
2 -78.70148 39.55128     1     2 maryland  allegany
3 -78.74159 39.57420     1     3 maryland  allegany
4 -78.75878 39.58566     1     4 maryland  allegany
5 -78.74732 39.61430     1     5 maryland  allegany
6 -78.74732 39.63149     1     6 maryland  allegany

This data includes a column named subregion with the county. You can use that column to filter to just the data for Baltimore City (“baltimore city”) or Baltimore County (“baltimore”):

baltimore <- maryland %>%
  filter(subregion %in% c("baltimore city", "baltimore"))
head(baltimore, 3)
       long      lat group order   region subregion
1 -76.88521 39.35074     3   114 maryland baltimore
2 -76.89094 39.37939     3   115 maryland baltimore
3 -76.88521 39.40804     3   116 maryland baltimore

If you create a ggplot object with this data and add a polygon geom, you will have a base map of these two counties:

ggplot(baltimore, aes(x = long, y = lat, group = group)) + 
  geom_polygon(fill = "lightblue", color = "black") + 
  theme_void()

To add the locations from the serial data to this map, you just need to add a point geom, specifying the dataframe to use with the data parameter:

ggplot(baltimore, aes(x = long, y = lat, group = group)) + 
  geom_polygon(fill = "lightblue", color = "black") + 
  geom_point(data = serial, aes(group = NULL, color = tower)) + 
  theme_void() + 
  scale_color_manual(name = "Cell tower", values = c("black", "red"))

When you add a geom to a ggplot object with mapped aesthetics, the geom will inherit those aesthetics unless you explicitly override them with an aes call in the geom function. That is why we did not have to explicitly map longitude to x and latitude to y in the aes call when adding the points to the map (although, as a note, if the column names for the longitude and latitude columns had been different in the baltimore and serial dataframes, we would have needed to reset these aesthetics when adding the points).

Further, we mapped the group column in the geographic data to the group aesthetic, so the polygons would be plotted correctly. However, the serial dataframe does not have a column for group. Therefore, we need to unset the group aesthetic mapping in the point geom. We do that by specifying group = NULL in the aes statement of the point geom.

Note that we’ve also customized the map a bit by setting constant colors for the fill for the two counties (fill = "lightblue") and by setting the colors and legend name for the points using scale_color_manual. By mapping the color of the points to the tower column in the dataframe, we show points that are cell towers in a different color than all other points.

The ggplot function requires that you input data in a dataframe. In the examples shown in this section, we started with dataframes that included geographic locations (latitude and longitude) as columns. This is the required format of data for mapping with ggplot2 (or with extensions like ggmap). Sometimes, however, you will want to plot geographic data in R that is in a different format. In particular, most R functions that read shapefiles will read the data into a spatial object rather than a dataframe. To map this data with ggplot2 and related packages, you will need to transform the data into a dataframe. You can do this using the fortify function from ggplot2. We’ll cover this process in more detail in a later section, when we present spatial objects.

For a great step-by-step example of creating a beautiful map of age distribution in Switzerland with ggplot2, see this blog post by Timo Grossenbacher. This tutorial also provides a good example of customizing ggplot output using grid graphics, which we cover in a later subsection.

4.3.2 ggmap, Google Maps API

In the previous subsection, we used built-in datasets of geographic data to create the background when mapping point locations. This works well for locations for which map_data datasets exist at the appropriate aggregation level. However, there are only a few countries with data at a sub-country level available through this function. Further, you may want to use a base map that includes elements like roads or terrain.

The ggmap package allows you to do this by using tools from Google Maps directly from R. This package allows you to pull a base map from Google Maps and a few other map servers, which you can then plot and to which you can add points, polygons, lines, and other elements using ggplot2 functions. This package uses the Google Maps API, so you should read their terms of service and make sure you follow them if using this package.

You use the get_map function from this package to get base maps for different locations. To tell the function where you would like the map centered, you can either use the longitude and latitude of the center point of the map or you can use a character string to specify a location. If you do the second, get_map will use the Google Maps API to geocode the string to a latitude and longitude and then get the map (think of searching in Google Maps in the search box for a location). This will work well for most cities, and you can also use it with landmarks, but it might fail to geocode less well-known locations. You can also input an address as a character string when pulling a base map and Google will usually be able to successfully geocode and pull the right map. You can use the zoom parameter to set the amount the map is zoomed in on that location; this value should be between 3 and 20, with lower values more zoomed out and higher values more zoomed in.

For example, you can use the following code to pull a map of Beijing:

## install.packages("ggmap")
library(ggmap)
beijing <- get_map("Beijing", zoom = 12)

If you find that you’re getting an error like “Error: GeomRasterAnn was built with an incompatible version of ggproto” when you try to use the ggmap package, try reinstalling the package directly from GitHub using install_github(“dkahle/ggmap”) (you’ll need the devtools package installed and loaded to use install_github).

The get_map function returns a ggmap object. You can plot this object using the ggmap function:

ggmap(beijing)

The output of ggmap is a ggplot object, so you can add elements to it in the same way you would work with any other ggplot object. For example, to set the void theme and add a title, you could run:

ggmap(beijing) + 
  theme_void() + 
  ggtitle("Beijing, China")

While the default source for maps with get_map is Google Maps, you can also use the function to pull maps from OpenStreetMap and Stamen Maps. Further, you can specify the type of map, which allows you to pull a variety of maps including street maps and terrain maps. You specify where to get the map using the source parameter and what type of map to use with the maptype parameter.

Here are example maps of Estes Park, in the mountains of Colorado, pulled using different map sources and map types. Also, note that we’ve used the option extent = "device" when calling ggmap, which specifies that the map should fill the whole plot area, instead of leaving room for axis labels and titles. Finally, as with any ggplot object, we can save each map to an object. We do that here so we can plot them together using the grid.arrange function, which we’ll describe in more detail in a later section in this course.

map_1 <- get_map("Estes Park", zoom = 12,
                 source = "google", maptype = "terrain") %>%
  ggmap(extent = "device")

map_2 <- get_map("Estes Park", zoom = 12,
                 source = "stamen", maptype = "watercolor") %>%
  ggmap(extent = "device")

map_3 <- get_map("Estes Park", zoom = 12,
                 source = "google", maptype = "hybrid") %>%
  ggmap(extent = "device")

library(gridExtra)
grid.arrange(map_1, map_2, map_3, nrow = 1) 

The get_map function is sending a request and getting a response from the Google API. Therefore, this function will not work if your computer is offline or if your computer is not able to access the Google API (for example, because of a firewall).

In the above examples, get_map is pulling a map based on a character string of the location (e.g., “Estes Park”). The get_map function also allows you to request a map based on latitude and longitude. For example, to get a map centered at a longitude of 2.35 degrees and a latitude of 48.86 degrees, you can run:

get_map(c(2.35, 48.86), zoom = 10) %>%
  ggmap(extent = "device")
Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
instead

Once you have pulled one of these base maps into R, you can add ggplot elements to them, including point and polygon geoms for locations. For example, you could pull in a base map of the Baltimore County area and add the elements we plotted from the serial dataframe in the last subsection:

get_map("Baltimore County", zoom = 10, 
        source = "stamen", maptype = "toner") %>%
  ggmap() + 
  geom_polygon(data = baltimore, aes(x = long, y = lat, group = group),
               color = "navy", fill = "lightblue", alpha = 0.2) + 
  geom_point(data = serial, aes(x = long, y = lat, color = tower)) + 
  theme_void() + 
  scale_color_manual(name = "Cell tower", values = c("black", "red"))

Note that we used alpha to add some transparency to the polygons so you could see the base map through them.

Now that we’ve gone through some examples, here is a step-by-step review of how the mapping process works with ggmap:

  1. The get_map function pulls in a base map from the Google Maps API (or another map server like Stamen Maps). The returned value is a ggmap object.
  2. The ggmap function plots this ggmap object and returns a ggplot object. You can use this resulting ggplot object as you would any other ggplot object (e.g., add geoms, change theme).
  3. Call other ggplot2 functions on this output to add locations and customize the map. Map longitude in the data to the x aesthetic and latitude to the y aesthetic. Note that you are adding locations using a new dataframe for the geom. Just as with regular ggplot objects, if you use a new dataframe for a geom, you must specify it with the data parameter for that geom. Because geoms do not take dataframes as their first arguments, you can’t specify the dataframe first without “data =” and rely on position with geoms. (By contrast, the ggplot function does take the data parameter as its first argument, so that’s why you can get away with not using “data =” when specifying a dataframe in the original ggplot call for a regular ggplot object.)

You can use the ggmap package to do a number of other interesting tasks related to geographic data. For example, the package allows you to use the Google Maps API, through the geocode function, to get the latitude and longitude of specific locations based on character strings of the location or its address. For example, you can get the location of the Supreme Court of the United States by calling:

geocode("Supreme Court of the United States")
        lon      lat
1 -77.00444 38.89064

You could also get its location by calling its address:

geocode("1 First St NE, Washington, DC")
        lon      lat
1 -77.00465 38.89051

You can compute map distances, too, using the mapdist function with two locations:

mapdist("Baltimore, MD",
        "1 First St NE, Washington, DC") %>%
  select(from, to, miles)
           from                            to    miles
1 Baltimore, MD 1 First St NE, Washington, DC 37.89608

To find out more about how Google Maps is performing this and other tasks, you can read its API documentation.

For these GIS-style tasks, the ggmap package is not running its own algorithms but rather using the Google Maps API. This package cannot do other GIS tasks, like finding the centroids or areas of spatial polygons. To use R as a GIS for more substantive tasks, you’ll need to use other R packages, like sp and rgdal.

4.3.3 Mapping US counties and states

If you need to map US states and counties, the choroplethr and choroplethrMaps packages offer functions for fast and straightforward mapping. This package also offers an interesting example of incorporating mapping functions within an R package. You can explore the code for the package, as well as some documentation, at the choroplethr package’s GitHub page: https://github.com/trulia/choroplethr.

This subsection is included to illustrate an interesting mapping package that may prove useful to readers mapping US-based data. The package itself is also an interesting example of a package built to facilitate mapping. However, the details of this subsection are not necessary to master to understand the rest of the material on mapping, so you can skip this section if the applications do not appear relevant to your work.

As an example, we’ll use data on county-level population in 2012 that comes as the dataset df_pop_county with the choroplethr package. This dataset gives the population of each county (value) and the county FIPS number (region), which is a unique identification number for each US county.

library(choroplethr)
library(choroplethrMaps)
data(df_pop_county)
df_pop_county %>% slice(1:3)
# A tibble: 3 x 2
  region  value
   <dbl>  <dbl>
1   1001  54590
2   1003 183226
3   1005  27469

As long as you are using a dataframe with a numeric column named region with each county’s FIPS code and a column named value with the value you’d like to map (population in this case), you can create a choropleth just by running the county_choropleth function on the dataframe.

county_choropleth(df_pop_county)

If you want to only plot some of states, you can use the state_zoom argument:

county_choropleth(df_pop_county, state_zoom = c("colorado", "wyoming"))

To plot values over a reference map from Google Maps, you can use the reference_map argument:

county_choropleth(df_pop_county, state_zoom = c("north carolina"),
                  reference_map = TRUE)

This example is using one of the datasets that comes with the choroplethr package, but you can map any dataset that includes a column with county FIPS and a column with the value you would like to plot. All you have to do is (1) make sure the county FIPS is in a numeric class and (2) name the columns for FIPS and the value to plot as “region” and “value”, respectively (the rename function from dplyr is useful here). For example, here is a dataframe giving storm events that were listed in NOAA’s Storm Events database near Hurricane Floyd’s track:

library(readr)
floyd_events <- read_csv("data/floyd_events.csv") 
floyd_events %>% slice(1:3)
# A tibble: 3 x 4
  begin_date   end_date  fips       type
      <date>     <date> <chr>      <chr>
1 1999-09-16 1999-09-17 25011 Heavy Rain
2 1999-09-16 1999-09-17 25001 Heavy Rain
3 1999-09-16 1999-09-17 25015 Heavy Rain

You can use the following code to plot the number of events listed for each US county by cleaning and summarizing the data in a pipe chain and then piping the output to the county_choropleth function. The choropleth mapping functions require that each county is included only once, so we used group_by and summarize to collapse the dataframe to have only a single observation for each county.

floyd_events %>% 
  group_by(fips) %>%
  dplyr::summarize(n_events = n()) %>%
  mutate(fips = as.numeric(fips)) %>%
  dplyr::rename(region = fips, 
         value = n_events) %>%
  county_choropleth(state_zoom = c("north carolina", "virginia"),
                    reference_map = TRUE)

The map created by county_choropleth (and the other maps created by functions in the choroplethr package) is a ggplot object, so you can add elements to it. For example, to create a map of flood events that includes the track of Hurricane Floyd on the map, you can run:

floyd_track <- read_csv("data/floyd_track.csv")

floyd_events %>% 
  dplyr::group_by(fips) %>%
  dplyr::summarize(flood = sum(grepl("Flood", type))) %>%
  dplyr::mutate(fips = as.numeric(fips)) %>%
  dplyr::rename(region = fips, 
                value = flood) %>%
  county_choropleth(state_zoom = c("north carolina", "maryland", 
                                   "delaware", "new jersey",
                                   "virginia", "south carolina",
                                   "pennsylvania", "new york",
                                   "connecticut", "massachusetts",
                                   "new hampshire", "vermont",
                                   "maine", "rhode island"),
                    reference_map = TRUE) + 
  geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
                                    group = NA),
            color = "red")

To create county choropleths with the choroplethr package that are more customized, you can use the package’s CountyChoropleth, which is an R6 object for creating custom county choropleths. To create an object, you can run CountyChoropleth$new with the data you’d like to map. As with county_choropleth, this data should have a column named “region” with county FIPS codes in a numeric class and a column named “values” with the values to plot. To map counties in which a flood event was reported around the time of Floyd, you can start by cleaning your data and then creating an object using CountyChoropleth$new:

floyd_floods <- floyd_events %>% 
  filter(grepl("Flood", type)) %>%
  mutate(fips = as.numeric(fips)) %>%
  group_by(fips) %>%
  dplyr::summarize(value = 1) %>%
  ungroup() %>%
  dplyr::rename(region = fips) 

floyd_map <- CountyChoropleth$new(floyd_floods)

As a note, in cleaning the data here, we wanted to limit the dataset to only observations where the event type included the word “Flood” (this will pull events listed as “Flood” or “Flash Flood”), so we’ve used the grepl function to filter to just those observations.

Once you have created a basic object using CountyChoropleth, you can use the methods for this type of object to customize the map substantially. For example, you can set the states using the set_zoom method:

floyd_map$set_zoom(c("north carolina", "maryland", 
                     "delaware", "new jersey",
                     "virginia", "south carolina",
                     "pennsylvania", "new york",
                     "connecticut", "massachusetts",
                     "new hampshire", "vermont",
                     "maine", "rhode island"))

At any point, you can render the object using the render method (or render_with_reference_map, to plot the map with the Google reference map added):

floyd_map$render()

To find out what options are available for this object type, in terms of methods you can use or attributes you can change, you can run:

names(floyd_map)
 [1] ".__enclos_env__"           "add_state_outline"        
 [3] "ggplot_polygon"            "projection"               
 [5] "ggplot_scale"              "warn"                     
 [7] "legend"                    "title"                    
 [9] "choropleth.df"             "map.df"                   
[11] "user.df"                   "clone"                    
[13] "clip"                      "initialize"               
[15] "set_zoom"                  "render_state_outline"     
[17] "render_helper"             "render"                   
[19] "set_num_colors"            "get_zoom"                 
[21] "format_levels"             "theme_inset"              
[23] "theme_clean"               "get_scale"                
[25] "prepare_map"               "bind"                     
[27] "discretize"                "render_with_reference_map"
[29] "get_choropleth_as_polygon" "get_reference_map"        
[31] "get_y_scale"               "get_x_scale"              
[33] "get_bounding_box"          "get_max_lat"              
[35] "get_min_lat"               "get_max_long"             
[37] "get_min_long"             

The following code shows an example of customizing this county choropleth map:

floyd_map$add_state_outline <- TRUE
floyd_map$ggplot_scale <- scale_fill_manual(values = c("yellow"),
                                            guide = FALSE)
floyd_xlim <- floyd_map$get_bounding_box()[c(1, 3)]
floyd_ylim <- floyd_map$get_bounding_box()[c(2, 4)]

a <- floyd_map$render() + 
  geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
                                    group = NA),
            color = "red", size = 2, alpha = 0.6) + 
            xlim(floyd_map$get_bounding_box()[c(1, 3)]) + 
            ylim(floyd_map$get_bounding_box()[c(2, 4)])
            
b <- floyd_map$render_with_reference_map() + 
  geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
                                    group = NA),
            color = "red", size = 2, alpha = 0.6) + 
            xlim(floyd_xlim) + 
            ylim(floyd_ylim)
            
library(gridExtra)
grid.arrange(a, b, ncol = 2)

Here, we’ve used $add_state_outline to change the object to include state outlines (this only shows up when you render the map without a background reference map). We’ve also used $ggplot_scale to customize the colors used for plotting counties with flood events and to remove the map legend. The get_bounding_box method pulls the range of latitudes and longitudes for the current map. We’re planning to add the hurricane track to the map, and the hurricane track extends well beyond this range. To later limit the map to these states, we’re using this get_bounding_box method to get these boundaries, and then we’ve used those values for the xlim and ylim functions when we create the final ggplot objects. Finally, the rendered maps are ggplot objects, so to include the hurricane track, we can add ggplot elements to the map using +, as with any ggplot object. We used the grid.arrange function from the gridExtra package to put the two maps (with and without the background Google map) side-by-side.

4.3.4 More advanced mapping – Spatial objects

So far, we have relied on ggplot and related packages for mapping. However, there are other systems for mapping in R. In particular, geographic data in R is often stored in spatial objects (e.g., SpatialPolygons, SpatialPointsDataframe), particularly when it is read in from the shapefiles commonly used to store spatial data outside of R.

In this subsection we will introduce these spatial objects, show how to work with them in R (including how to convert them to dataframes so they can be used with the ggplot-based mapping covered in earlier subsections), and briefly describe shapefiles.

4.3.4.1 Spatial objects in R

R has a series of special object types for spatial data. For many mapping / GIS tasks, you will need your data to be in one of these objects. These spatial objects include objects that just contain geographies (e.g., locations along the borders of countries) or objects that contain geographies and associated attributes of each element of the geography (e.g., county boundaries as well as the population of each country). The most common spatial objects in R are:

  • SpatialPolygons
  • SpatialPoints
  • SpatialLines
  • SpatialPolygonsDataFrame
  • SpatialPointsDataFrame
  • SpatialLinesDataFrame

The tigris package lets you pull spatial data directly from the US Census. This data comes into R as a spatial object. To provide a basic overview of working with spatial object in R, we will use an example spatial object pulled with this package.

The tigris package offers a very convenient way to pull a variety of geographic data for the United States. To find out more, check out the article “tigris: An R Package to Access and Work with Geographic Data from the US Census Bureau” in The R Journal.

The tigris package includes a function called tracts that allows you to pull the geographic data on boundaries of U.S. Census tracts. You can use the state and county parameters to limit the result to certain counties, and you can set cb = FALSE if a lower-resolution (and smaller) file is adequate. To pull census tract boundaries for Denver, CO, you can run:

library(tigris)
library(sp)
denver_tracts <- tracts(state = "CO", county = 31, cb = TRUE)

By running class on the returned object, you can see that this function has returned a SpatialPolygonsDataFrame object.

class(denver_tracts)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"

Spatial objects like this have a plot methods that can be called to plot the object. This means that you can map these census tract boundaries by calling:

plot(denver_tracts)

There number of other methods for this specific object type. For example, bbox will print out the bounding box of the spatial object (range of latitudes and longitudes covered by the data).

bbox(denver_tracts)
         min        max
x -105.10993 -104.60030
y   39.61443   39.91425

The is.projected and proj4string functions give you some information about the current Coordinate Reference System of the data (we describe more about Coordinate Reference Systems later in this subsection).

is.projected(denver_tracts)
[1] FALSE
proj4string(denver_tracts)
[1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"

If a spatial object includes attribute data (i.e., is an object type *DataFrame), you can access a “slot” in the spatial object to pull out an attribute dataframe using @. For example, here’s the beginning of the dataframe for the denver_tracts spatial object:

head(denver_tracts@data)
   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID  NAME LSAD
25      08      031  000201 1400000US08031000201 08031000201  2.01   CT
26      08      031  000302 1400000US08031000302 08031000302  3.02   CT
27      08      031  001101 1400000US08031001101 08031001101 11.01   CT
28      08      031  002802 1400000US08031002802 08031002802 28.02   CT
29      08      031  003300 1400000US08031003300 08031003300    33   CT
30      08      031  004006 1400000US08031004006 08031004006 40.06   CT
     ALAND AWATER
25 2084579      0
26 1444043      0
27  898885      0
28  886798      0
29 1288718      0
30 1953041      0

For this spatial object, the data includes identifying information (state, county, tract), but also some attribute data (area of the tract that is land, area of the tract that is water).

You can add different layers of spatial objects onto the same plot. To do that, just use add = TRUE for added layers. For example, to add primary roads to the Denver census tract map, you can pull a spatial object with roads using the primary_roads function from the tigris package (note: this data includes roads across the U.S. and so might take a few seconds to download or render) and then use plot with add = TRUE to add the roads to the map:

roads <- primary_roads()

plot(denver_tracts, col = "lightblue")
plot(roads, add = TRUE, col = "darkred")

Geographic data outside of R is often stored in shapefiles. If you read in a shapefile from a file (more details on shapefiles later in this section), it will automatically be read in as one of these shape objects. However, you can also convert other data into shape objects. For example, you could convert the dataframes with spatial data on U.S. state boundaries that we used earlier in this book into spatial objects. Functions from sp package (e.g., SpatialPoints, SpatialPointsDataFrame) can be used to convert data stored in dataframes into spatial objects.

There likely will also be cases when you have a spatial object but would like to plot it using ggplot-type mapping. In this case, you need to convert the spatial object to a dataframe, since ggplot will only input dataframes. You can use the fortify function from ggplot2 to do this. For example, to convert the Denver census tracts spatial object to a dataframe, you could call:

denver_tracts_df <- fortify(denver_tracts)
denver_tracts_df %>%
  dplyr::select(1:4) %>% dplyr::slice(1:5)
# A tibble: 5 x 4
       long      lat order  hole
      <dbl>    <dbl> <int> <lgl>
1 -105.0251 39.79400     1 FALSE
2 -105.0213 39.79398     2 FALSE
3 -105.0208 39.79109     3 FALSE
4 -105.0158 39.79107     4 FALSE
5 -105.0064 39.79105     5 FALSE

Now you can use the data in to create a map using ggplot2 functions:

denver_tracts_df %>%
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(fill = "lightblue", color = "black") + 
  theme_void()

4.3.4.2 Coordinate reference systems

Spatial objects will have a Coordinate Reference Systems (CRSs), which specifies how points on a curved earth are laid out on a two-dimensional map. A CRS can be geographic (e.g., WGS84, for longitude-latitude data) or projected (e.g., UTM, NADS83). The full details of map projections is beyond the scope of this course, but if you’d like to find out more details, this section of the documentation for QGIS is very helpful. For working with spatial objects in R, it is important to realize that spatial objects have a Coordinate Reference System attribute and that you can run into problems if you try to work directly with two spatial objects with different Coordinate Reference Systems.

To find out the CRS for a spatial object (like a SpatialPoints object) that already has a CRS, you can use proj4string. For example, to get the CRS of the Denver census tract data, you can run:

proj4string(denver_tracts)
[1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"

If you create a spatial object from a dataframe in R, rather than reading it in from a shapefile, you will need to specify the data’s Coordinate Reference System. You can use proj4string to set this attribute by calling it on the left of the assignment arrow:

## Generic code
proj4string(my_spatial_object) <- "+proj=longlat +datum=NAD83"

Note that this call does not create a projection or reproject the data. Rather, this call is specify to R the CRS the data currently is in.

The CRS function from the sp package creates CRS class objects that can be used in this specification. You input projection arguments into this function as a character string (for example, CRS("+proj=longlat +datum=NAD27")). You can also, however, use a shorter EPSG code for a projection (for example, CRS("+init=epsg:28992")). The http://www.spatialreference.org website lists these projection strings and can be useful in determining a string to use when setting projection information or re-projecting data.

library(sp)
CRS("+proj=longlat +datum=NAD27")
CRS arguments:
 +proj=longlat +datum=NAD27 +ellps=clrk66
+nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat 
CRS("+init=epsg:28992")
CRS arguments:
 +init=epsg:28992 +proj=sterea +lat_0=52.15616055555555
+lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000
+ellps=bessel
+towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725
+units=m +no_defs 

If a spatial object has a CRS and you want to change it, you should do so using the spTransform function from the rgdal package. You input the spatial object whose CRS you want to change as well as the CRS object to which to change it:

## Generic code
my_spatial_object <- spTransform(my_spatial_object,
                                 CRS = CRS("+init=epsg:4267"))

If you want to ensure that the CRS of two spatial objects agree, you can use proj4string to pull the CRS from one of the spatial objects and specify that as the output CRS for an spTransform call on the other object, like:

## Generic code
my_spatial_object <- spTransform(my_spatial_object,
                                 CRS = proj4string(another_sp_object))

If you are interested in finding out more, Melanie Frazier has created an excellent resource on Coordinate Reference Systems and maps in R: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf

The coord_map function in ggplot2 can help you in plotting maps with different projections. This function does not change any aspect of the data being mapped, but rather changes the projection when mapping the data. In fact, since this function is used with ggplot-style mapping, all data being mapped will be in dataframes rather than spatial objects and so will not have specifications for CRS. The following examples, which are adapted from the help file for the coord_map function, show example output when the coord_map element is added to a map of the United States:

usamap <- map_data("state") %>%
  ggplot(aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

map_1 <- usamap + coord_map() + ggtitle("default") 
map_2 <- usamap + coord_map("gilbert") + ggtitle("+ coord_map('gilbert')")
map_3 <- usamap + coord_map("conic", lat0 = 30) + 
  ggtitle("+ coord_map('conic', lat0 = 30)")

grid.arrange(map_1, map_2, map_3, ncol = 1)

4.3.4.3 Shapefiles

Shapefiles are a file format that is often used for saving and sharing geographic data, particularly when using GIS software. This format is not R-specific, but R can read in and write out shapefiles. The format is typically not a single file, but rather a directory of related files in a directory. Shapefiles often include both geographic information and also data describing attributes (for example, a shapefile might include the locations of country borders as well as the population of each of the countries).

To read shapefiles into R, use the readOGR function from the rgdal package. You can also write out spatial objects you’ve created or modified in R to shapefiles using the writeOGR from the same package. The readShape* family of functions from the maptools package can also be used to read shapefiles into R. These functions all read the spatial data in as a spatial object. For example, the shapefiles of country borders and populations would be read in as a SpatialPolygonsDataFrame object. Once you read a shapefile into R, you can work with the spatial object in the same way we showed how to work with the Denver census tracts spatial object earlier in this subsection.

To find out more about shapefiles, R, and ggplot2, check out the wiki listing at https://github.com/tidyverse/ggplot2/wiki/plotting-polygon-shapefiles.

4.3.4.4 R as GIS

In addition to mapping, you can also use R for a number of GIS-style tasks, including:

  • Clipping
  • Creating buffers
  • Measuring areas of polygons
  • Counting points in polygons

These tasks can be done with GIS software, and if you are doing extensive GIS work, it may be worthwhile to use specialized software. However, if you just need to do a few GIS tasks as part of a larger workflow, you should consider using R for these steps. Some advantages to using R for GIS tasks are:

  • R is free
  • You can write all code in a script, so research is more reproducible
  • You save time and effort by staying in one software system, rather than moving data between different software

To show some of the GIS-style tasks that can be done from R, we’ll use some driver-level data from the Fatality Analysis Reporting System (FARS) for 2001–2010, which we have saved as fars_colorado.RData:

load("data/fars_colorado.RData")
driver_data %>% 
  dplyr::select(1:5) %>% dplyr::slice(1:5)
# A tibble: 5 x 5
  state st_case county                date latitude
  <dbl>   <dbl>  <dbl>              <dttm>    <dbl>
1     8   80001     51 2001-01-01 10:00:00 39.10972
2     8   80002     31 2001-01-04 19:00:00 39.68215
3     8   80003     31 2001-01-03 07:00:00 39.63500
4     8   80004     31 2001-01-05 20:00:00 39.71304
5     8   80005     29 2001-01-05 10:00:00 39.09733

This data is currently in a dataframe, which means we can map it using ggplot2:

map_data("county", region = "Colorado") %>%
  ggplot(aes(x = long, y = lat, group = subregion)) + 
  geom_polygon(color = "gray", fill = NA) + 
  theme_void() + 
  geom_point(data = driver_data,
             aes(x = longitud, y = latitude, group = NULL),
             alpha = 0.5, size = 0.7) 

The dataset includes a column for the county in which each accident occurred, so you can also aggregate the data by county and use a function from the choroplethr package to quickly create a county-specific choropleth of accident counts (note that, because the data is driver specific, this will count every car in an accident):

library(stringr)
county_accidents <- driver_data %>%
  dplyr::mutate(county = str_pad(county, width = 3,
                          side = "left", pad = "0")) %>%
  tidyr::unite(region, state, county, sep = "") %>%
  dplyr::group_by(region) %>%
  dplyr::summarize(value = n()) %>%
  dplyr::mutate(region = as.numeric(region))
county_accidents %>% slice(1:4)
# A tibble: 4 x 2
  region value
   <dbl> <int>
1   8001   617
2   8003    77
3   8005   522
4   8007    41
county_choropleth(county_accidents, state_zoom = "colorado")

As a note, this code uses the str_pad function from the stringr package to pad 1- or 2-digit county FIPS codes with leading zeros before pasting them to the state FIPS code and uses the n function from dplyr with summarize to count the number of observations ins each county.

This technique of creating a choropleth only worked because we had a column in the data linking accidents to counties. In same cases, you will want to create a choropleth based on counts of points but will not have this linking information in the data. For example, we might want to look at accident counts by census tract in Denver. To do this, we’ll need to link each accident (point) to a census tract (polygon), and then we can count up the number of points linked to each polygon. We can do this with some of the GIS-style tools available in R.

To start, we’ve created a dataframe with only accidents in Denver (based on the county column in the accident data):

denver_fars <- driver_data %>% 
  filter(county == 31)

Now, we want to count up how many of these accidents occurred in each of the census tract shapes we pulled from the US Census earlier in this section using the tigris package. We can do this using the poly.counts function from GISTools package. However, before we can use that function, we need to make sure both objects are spatial objects, rather than dataframes, and that they have the same CRS.

The census tract data is already in a spatial object. We can change put the denver_fars object in a spatial object by resetting its coordinates attribute to specify which of its columns show longitude and latitude. Next, we need to specify what CRS the data has. Because it is coming from a dataframe of longitude and latitude data, the WSG84 system should be reasonable:

library(sp)
denver_fars_sp <- denver_fars 
coordinates(denver_fars_sp) <- c("longitud", "latitude")
proj4string(denver_fars_sp) <- CRS("+init=epsg:4326")

This object is now also a spatial object:

class(denver_fars_sp)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"

To be able to pair up polygons and points, the spatial objects need to have the same CRS. To help later with calculating the area of each polygon, we’ll use a projected CRS that is reasonable for Colorado and reproject the spatial data using the spTransform function:

denver_tracts_proj <- spTransform(denver_tracts, 
                                  CRS("+init=epsg:26954"))
denver_fars_proj <- spTransform(denver_fars_sp, 
                                CRS(proj4string(denver_tracts_proj)))

Now that the objects with the accident locations and with the census tracts are both spatial objects with the same CRS, we can combine them on a map. Because they are spatial objects, we can do that using plot:

plot(denver_tracts_proj)
plot(denver_fars_proj, add = TRUE, col = "red", pch = 1)

Now, that both datasets are in spatial objects and have the same CRS, you can use the poly.counts function to count how many of the accidents are in each census tract. This function inputs a spatial points object and a spatial polygons object and outputs a numeric vector with the count of points in each polygon:

library(GISTools)
tract_counts <- poly.counts(denver_fars_proj, denver_tracts_proj)
head(tract_counts)
25 26 27 28 29 30 
 7  2  2  0  0  4 

You can use a choropleth to show these accident counts. In this case, the quickest way to do this is probably to use the choropleth function in the GISTools package.

choropleth(denver_tracts, tract_counts)

There are other functions in R that do other GIS tasks. For example, There is function in the GISTools package that calculates the area of each polygon.

head(poly.areas(denver_tracts_proj))
       25        26        27        28        29        30 
2100172.2 1442824.1  897886.3  881530.5 1282812.2 1948187.1 

You can use this functionality to create a choropleth of the rate of fatal accidents per population in Denver census tracts:

choropleth(denver_tracts, 
           tract_counts / poly.areas(denver_tracts_proj))

4.3.4.5 Raster data

When mapping in R, you may also need to map raster data. You can think of raster data as data shown with pixels— the graphing region is divided into even squares, and color is constant within each square.

There is a function in the raster package that allows you to “rasterize” data. That is, you take spatial points data, divide the region into squares, and count the number of points (or other summary) within each square. When you do this, you need to set the x- and y-range for the raster squares. You can use bbox on a spatial object to get an idea of its ranges to help you specify these limits. You can use the res parameter in raster to set how large the raster boxes should be. For example, here is some code for rasterizing the accident data for Denver:

library(raster)
bbox(denver_fars_sp)
                min       max
longitud -105.10973 -104.0122
latitude   39.61715   39.8381
denver_raster <- raster(xmn = -105.09, ymn = 39.60,
                        xmx = -104.71, ymx = 39.86,
                        res = 0.02)
den_acc_raster <- rasterize(geometry(denver_fars_sp),
                            denver_raster,
                            fun = "count")

You can use the image function to plot this raster alone:

image(den_acc_raster, col = terrain.colors(5))

You can use plot with add = TRUE to add the raster to a base plot of Denver. In this case, you will likely want to set some transparency (alpha) so you can see the base map through the raster:

plot(denver_tracts)
plot(den_acc_raster, add = TRUE, alpha = 0.5)

4.3.5 Where to find more on mapping with R

There is a lot more you can learn about mapping in R than we could cover here. Here are some good resources if you would like to learn more: