Module 3 Data Wrangling

In this chapter we summarize the various packages and functions that can be used to manipulate and map spatial data in R.

Note: We reference R documentation to each package below. Click on the package name highlighted in blue to review the CRAN and help documentation for each package.

Additional Resources

dplyr cheatsheet

sf cheatsheet

sf introduction

3.1 Introduction to DPLYR

The dplyr package is part of the tidyverse (a collection of R packages) and provides a suite of tools for manipulating and summarizing data tables and spatial data. It contains a number of functions for data transformations. DPLYR tools can be in a piping syntax that allows for a sequence of functions to happen at the same time. Think of dplyr functions as transformation spells that transform data from one thing to another.

Here are a few useful functions:

  • select(c()): selects the subset of specified columns and reorders them vertically.
  • filter(): selects a subset of rows in a data frame based on user defined conditional statements.
  • mutate(): creates and adds new columns based on a given set of operations.
  • recode(): recodes a current column into a different columns based on a set of operations.
  • arrange(): changes order of rows according to the columns specified.
  • desc(): orders a column in descending order.
  • summarize(): summarizes values with function that are passed, i.e., mean, max.
  • group_by(): groups values for a given category.
  • full_join, right_join, left_join: Merges data set based on either joining all or a subset of observations.

The dplyr functions take a data frame as the first argument and layers functions using what we refer to as piping operator \(%>%\). It condenses many lines of code into a single chunk. To illustrate dplyr tools, we use the flights data set that is available from dplyr.

First we must install the R packages if they have not be already installed.

Next we call the tidyverse and dplyr packages

# Call libraries
library(tidyverse)
library(dplyr)
library(tidyr)

# To explore more of dplyr documentation and examples use code below.
#vignette("dplyr", package = "dplyr")

Next we read in the Air Passengers data using the \(data()\) call function. This data records monthly totals of the number of international airline passengers from 1949-1960. Using the class() function, you can see what type of R object. This R object is a time-series (ts) and needs to be converted to a data frame.

#Read in the starwars data
data("AirPassengers")
#Determine the type of R object
class(AirPassengers)
[1] "ts"

We transform the time series into a data frame with the below commands. The rows are the different years and the columns are months Jan-Dec.

#Transform the time series data into a data frame
air_passengers_df <- matrix(AirPassengers, ncol = 12, nrow = 12, byrow = T)
rownames(air_passengers_df) <- 1949:1960
colnames(air_passengers_df) <- c("Jan", "Feb", "Mar", "Apr", "May",
                                 "Jun", "Jul", "Aug", "Sep", "Oct",
                                 "Nov", "Dev")

air_passengers_df <- data.frame(air_passengers_df)
head(air_passengers_df)
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dev
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229

What is piping? A piping syntax allows us to manipulate/mutate the data using a sequence of transformations, i.e., operations are chained together. The pipeline operator is denoted using %>%.

For example, in many cases we need to switch from wide data to long data or vice-versa. In this case, we want a monthly count as a unique row, so we want to transform this data from wide (going across columns) to long (going down rows).

Sequence of operations:

  • Create a year variable in the data because we want each row to be a unique month-year combination. The \(rownames()\) function extracts the names of each row (i.e., the year corresponding to the row). The \(as.numeric()\) function turns characters into numbers.
air_passengers_df$year <- as.numeric(rownames(air_passengers_df))
  • Create a new data set called air_passengers_long and assign it <- to the old data set.
  • Use the gather() function from the tidyr package to convert the data to long.
air_passengers_long <- air_passengers_df %>%
  gather(key= "Month", value = "count", -year)

head(air_passengers_long)
  year Month count
1 1949   Jan   112
2 1950   Jan   115
3 1951   Jan   145
4 1952   Jan   171
5 1953   Jan   196
6 1954   Jan   204

If we want to filter the data to only include the month of January, we use the \(filter()\) function shown below in the first line of code. If we want to filter to Jan and the year must be greater than 1955, we can use the second line of code.

#Use the filter function to subset only to data for Jan.
jan_only_df <- air_passengers_long %>%
  filter(Month == "Jan")

#Use the filter function to subset only to data for Jan AND for years > 1955
jan1955_only_df <- air_passengers_long %>%
  filter(Month == "Jan" & year > 1955)

We can arrange the dataset in order of a variable (e.g., by year) using the arrange() function. \(arrange()\) orders the rows by the values selected in the columns. It sorts from smallest to largest.

#Arrange the data by year
arrange_df <- air_passengers_long %>%
  arrange(year)
head(arrange_df)
  year Month count
1 1949   Jan   112
2 1949   Feb   118
3 1949   Mar   132
4 1949   Apr   129
5 1949   May   121
6 1949   Jun   135

Two important functions are the mutate() and summarize() functions. We create new variables (not already defined) using the \(mutate()\) function. For example, if we want to create a new logical variable that is TRUE if year > 1955, and FALSE otherwise, we can use the code below, which uses the combination of the \(mutate()\) and \(ifelse()\) functions.

#Create a new logical variable using the mutate function, if year > 1955 then TRUE else FALSE.
mutate_df <- air_passengers_df %>%
  mutate(year_indicator <- ifelse(year > 1955, TRUE, FALSE))

If we want to summarize the data, we can use the summarize() function. For example, if we want the mean number of passengers by year, we first use the group_by() function to define which groups we want to summarize, in this case it is year. Then we use the summarize() function to define what summaries we want, e.g., mean, sd, median, and others.

# First group_by year to group the data by year variable.
#Then summarize the mean number of passengers to get the mean BY year.
summary_df <- air_passengers_long %>%
  group_by(year) %>%
  summarize(mean_number = mean(count))

head(summary_df)
# A tibble: 6 × 2
   year mean_number
  <dbl>       <dbl>
1  1949        127.
2  1950        140.
3  1951        170.
4  1952        197 
5  1953        225 
6  1954        239.

The piping syntax allows us to perform commands in sequence in one code chunk versus multiple transformations. The code below shows how we can combine comands into one executable function. Note: The sequence of commands is important. For example, we first must create the year variable using the \(mutate()\) function before we filter by year.

library(dplyr)
data("AirPassengers")
air_passengers_df <- matrix(AirPassengers, ncol = 12, nrow = 12, byrow = T)
rownames(air_passengers_df) <- 1949:1960
colnames(air_passengers_df) <- c("Jan", "Feb", "Mar", "Apr", "May",
                                 "Jun", "Jul", "Aug", "Sep", "Oct",
                                 "Nov", "Dev")

air_passengers_mean_data <- data.frame(matrix(AirPassengers, ncol = 12, nrow = 12, byrow = T)) %>%
  dplyr::rename(c("Jan" = "X1", "Feb" = "X2", "Mar" = "X3", "Apr" = "X4", "May" = "X5",
                                 "Jun" = "X6", "Jul" = "X7", "Aug" = "X8",
                                  "Sep" = "X9", "Oct" = "X10",
                                 "Nov" = "X11", "Dec" = "X12")) %>%
  dplyr::mutate(year= c(1949:1960),
                year_indicator <- ifelse(year > 1955, TRUE, FALSE)) %>%
   tidyr::gather(key= "Month", value = "count", -year) %>%
    dplyr::filter(Month == "Jan") %>%
      dplyr::arrange(year) %>%
        dplyr::group_by(year) %>%
           dplyr::summarize(mean_number = mean(count))

air_passengers_mean_data
# A tibble: 12 × 2
    year mean_number
   <int>       <dbl>
 1  1949         112
 2  1950         115
 3  1951         145
 4  1952         171
 5  1953         196
 6  1954         204
 7  1955         242
 8  1956         284
 9  1957         315
10  1958         340
11  1959         360
12  1960         417

3.2 Introduction to sf

A new class of spatial objects has been defined called simple features sf. \(Sf\) objects appear as a data frame with an extra column named geometry that contains the geometrical information for the spatial part of the object. For the rest of the chapter, we will learn how to manipulate spatial features.

Illustration by Allison Horst, see Simple Features Blog

What is a feature?

A feature is thought of as a thing, or an object in the real world, such as a building or a tree. Features have geometry describing where on Earth the feature is located, and they have attributes, which describe other properties.

Simple feature geometry types

The following seven simple feature types are most common.

Type Description
POINT zero-dimensional geometry containing single point
LINESTRING sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
POLYGON geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring
MULTIPOINT set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal
MULTILINESTRING set of linestrings
MULTIPOLYGON set of polygons

We are going to use data from New Haven CT to work with different kinds of spatial data. First we need to run the commands below to use the \(sf\) and \(GISTools\) packages and the New Haven data. The New Haven data contains several \(sf\) data frames:

  1. blocks is a data frame with population demographics broken down by block. These attributes are saved within spatial polygons (boundaries of the blocks).

  2. breach is a data frame with location data for public disturbances (crimes). These attributes are saved with spatial points (x,y coordinates of the breaches).

  3. roads contains spatial lines that mark road trajectories.

#Install and Call in the sf and GISTools libraries.
library(sf)
library(dplyr)
library(GISTools) #has been deprecated

#Call in the newhaven data
data(newhaven)

Below shows the blocks sf data frame which gives an example of how a \(sf\) data frame contains both spatial locations/features and the attributes of those features. Below the geometry type is listed as a polygon which captures the boundaries of the census blocks. It also shows the bounding box of the minimum and maximum (x,y) coordinates. Each column denotes the spatial attributes or characteristics for each block (row), e.g., The POP1990 column reports the total population in 1990 for block 1 (pop=3071). The last column geometry holds the spatial features of each block meaning the geographic boundaries of the polygon.

Simple feature collection with 6 features and 28 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 534687.6 ymin: 177306.3 xmax: 569625.3 ymax: 188464.6
Projected CRS: SOURCECRS
  NEWH075H_ NEWH075H_I HSE_UNITS OCCUPIED VACANT  P_VACANT P_OWNEROCC
0         2         69       763      725     38  4.980341   0.393185
1         3         72       510      480     30  5.882353  20.392157
2         4         64       389      362     27  6.940874  57.840617
3         5         68       429      397     32  7.459207  19.813520
4         6         67       443      385     58 13.092551  80.361174
5         7        133       588      548     40  6.802721  52.551020
  P_RENTROCC NEWH075P_ NEWH075P_I POP1990  P_MALES P_FEMALES   P_WHITE
0  94.626474         2        380    2396 40.02504  59.97496  7.095159
1  73.725490         3        385    3071 39.07522  60.92478 87.105177
2  35.218509         4        394     996 47.38956  52.61044 32.931727
3  72.727273         5        399    1336 42.66467  57.33533 11.452096
4   6.546275         6        404     915 46.22951  53.77049 73.442623
5  40.646259         7        406    1318 50.91047  49.08953 87.784522
    P_BLACK P_AMERI_ES P_ASIAN_PI  P_OTHER  P_UNDER5    P_5_13  P_14_17
0 87.020033   0.584307   0.041736 5.258765 12.813022 24.707846 7.888147
1 10.452621   0.195376   0.521003 1.725822  1.921198  2.474764 0.814067
2 66.265060   0.100402   0.200803 0.502008 10.441767 13.554217 5.722892
3 85.553892   0.523952   0.523952 1.946108 10.853293 17.739521 7.709581
4 24.371585   0.327869   1.420765 0.437158  6.229508  8.633880 2.950820
5  7.435508   0.758725   0.834598 3.186646  8.725341  8.194234 3.641882
    P_18_24   P_25_34   P_35_44  P_45_54  P_55_64   P_65_74   P_75_UP
0 12.479132 16.026711  8.555927 5.759599 4.924875  4.048414  2.796327
1 71.149463  7.359166  4.037773 1.595571 1.758385  3.712146  5.177467
2  8.835341 17.670683 17.871486 8.734940 5.923695  7.931727  3.313253
3 12.425150 18.113772 10.853293 9.056886 6.287425  4.266467  2.694611
4  7.103825 17.267760 16.830601 8.415301 7.431694 14.426230 10.710383
5 10.091047 29.286798 12.898331 7.814871 7.814871  6.904401  4.628225
                        geometry
0 POLYGON ((540989.5 186028.3...
1 POLYGON ((539949.9 187487.6...
2 POLYGON ((537497.6 184616.7...
3 POLYGON ((537497.6 184616.7...
4 POLYGON ((536589.3 184217.5...
5 POLYGON ((568032.4 183170.2...

Many of the dplyr functions described above can be used on an sf data frame. For example, we can use the \(filter()\) function to only include larger populations with POP1990> 1000. Additionally, we can create a new variable called RentOcc that calculates the number of housing units that are renter occupied from to current variables: (a) the of housing units in \(HES\_UNITS\), and (b) the percent of renter occupied units \(P\_RENTROCC\).

#Use the filter function to subset to blocks with population >1000
# Use mutate function to create RentOcc = number of renter occ housing units.
blocks_subset <- blocks_sf %>%
  filter(POP1990>1000) %>%
  mutate(RentOcc = HSE_UNITS * P_RENTROCC/100)

Spatial Joins combine multiple spatial data sets in the case where both data sets are spatial features. To join a \(sf\) data set, use the \(st\_join(x,y)\) function. The data in data set \(y\) are merged in with the data in \(x\). If \(y\) contains values outside of \(x\) boundaries, these are discarded. For example, if there is a location in \(y\) outside of boundaries in \(x\) it will be removed.

In the below code, we merge together the data from the block_sf data set (population attributes of the census blocks) and the breach_sf data set (locations of crimes). The new data set called block_breach_sf is a polygon object, with attributes of the polygon containing each particular point being joined as columns.

Note: If you want to join a spatial data set with a non-sf object, you can use the \(left\_join()\), \(right\_join()\) or \(full\_join()\) functions. The \(left\_join(x,y)\) function joins all observations in \(y\) that match observations in \(x\). The \(right\_join(x,y)\) functions joins all observations in \(x\) that match observations in \(y\). The observations that are unmatched are discarded.

temp <- breach_sf %>%
  st_join(blocks_sf)

3.3 An Introduction to TMAP

This section will concentrate on learning the functions in the tmap (Tennekes, 2015). This package focuses on mapping the spatial distribution of data attributes. It has a similar grammar to plotting with the *ggplot package in that it seeks to handle each element of the map separately in a series of layers.

The process of making maps using tmap is one in which a series of layers are added to the map. First \(tm_shape()\) is specified, followed by an aesthetic function that specifies what is to be plotted. The input to \(tm_shape()\) is an sf object.

3.3.1 Layer 1

For example, we may want to plot the New Haven crime data. The first layer of the plot is the geographic polygons (census block boundaries) of New Haven. We input the \(blocks\_sf\) object in the \(tm\_shape()\) function. We plot the borders spatial layer of the data which outlines census block boundaries. To add custom styles to the background, type \(?tm\_style\) and \(?tm\_layout\) in the console pane to see possible options in the documentation.

?tm_style
?tm_layout

tm_shape(blocks_sf)+
  tm_borders(col = "red", lty = "dashed") +
  tm_style("natural", bg.color = "lightblue") +
  tm_layout(title = "New Haven CT",
            title.size = 1,
            title.position = c(0.55, "top"))

So what you can see in the above code is one set of layers for the borders where we customize the look, color, and outline.

3.3.2 Layer 2

A plot of geographic boundaries alone does not give much information. We add a second layer where we can include the spatial attributes of the areas (polygons). To add the second layer to the current plot, add another \(tm\_shape()\) function, and add the second \(sf\) object input that contains the spatial area data. For example, here we will plot the percent of renter occupied housing units.

tm_shape(blocks_sf)+
  tm_borders(col = "red", lty = "dashed") +
  tm_shape(blocks_sf) +
  tm_polygons(col = "P_RENTROCC", palette = "Blues") +
  tm_layout(title = "New Haven CT",
            title.size = 1,
            title.position = c(0.55, "top"))

3.3.3 Layer 3

The third layer of the New Haven plot may include the point layer which plots the geographic coordinates of the breaches of the peace. Extending upon layers 1 and 2, for layer 3, we use the \(breach\_sf\) data as our data layer and add point data using the \(tm\_dots()\) command.

?tm_style
?tm_layout

tm_shape(blocks_sf)+
  tm_borders(col = "red", lty = "dashed") +
  tm_polygons(col = "P_RENTROCC", palette = "Blues") +
    tm_shape(breach_sf) +
      tm_dots(col = "blue", size = 0.1) +
  tm_layout(title = "New Haven CT",
            title.size = 1,
            title.position = c(0.55, "top"))

3.3.4 Layer 4

The fourth layer of the New Haven plot may include the lines layer which plots the roads in New Haven. Again, add another \(tm\_shape()\) function, and add the third \(sf\) object input that contains the lines data (roads_sf). We add line data using the \(tm\_lines()\) command.

?tm_style
?tm_layout
roads_sf <- st_as_sf(roads)
st_crs(roads_sf) <- st_crs(blocks_sf)

tm_shape(blocks_sf)+
  tm_borders(col = "red", lty = "dashed") +
  tm_polygons(col = "P_RENTROCC", palette = "Blues", alpha = 0.5) +
    tm_shape(breach_sf) +
      tm_dots(col = "blue", size = 0.1) +
        tm_shape(roads_sf) +
          tm_lines(col = "purple", alpha = 0.5) +
  tm_layout(title = "New Haven CT",
            title.size = 1,
            title.position = c(0.55, "top"))

3.4 Projections

The elliptical Earth can be projected onto a flat surface, i.e., a map. A map is a flattened, filtered, and scaled representation of the world balancing reality with generality. Map coordinates of a point are computed from its ellipsoidal latitude and longitude by a standard formula known as a map projection. It is impossible to flatten a round object without distortion, and this results in a trade-off betweem direction, shape, and distance. An ellipse is a simple model describing the basic shape of the Earth. All mapping coordinate reference systems (CRS) are based on this shape. A particular CRS can be referenced by its EPSG code. Useful information on projections can be found at (https://www.nceas.ucsb.edu/sites/default/files/2020-04/OverviewCoordinateReferenceSystems.pdf).

There are two options

Unprojected/Geographic: Locations on Earth’s surface are referenced using latitude and longitude. The latitude and longitude coordinates will differ depending on the CRS. Three most common in the U.S.

  • Projected: The elliptical Earth projected onto a flat surface using a standard formula derived from latitude and longitude. The common projections are given below.
  • WGS84 (EPSG: 4326): used by Google Earth, and US Dept of Defense. Tends to be used for global reference systems.
  • NAD83 (EPSG: 4269): used commonly by US federal agencies.

The UTM projection (applied with either WGS84 or NAD83) is commonly used in research because it tends to be more locally accurate, and furthermore, it has attributes that make the estimating distance easy and accurate. Positions are described using Easting and Northing coordinates. Easting: x-coordinate which is the eastward-measured distance. Northing: y-coordinate which is a northward measured distance. It is important to note the UTM projection results in coordinates measured in meters. When layering multiple spatial features, you must have all spatial data set to the same CRS.

When we look at the \(blocks\_sf\) and \(breach\_sf\) spatial data, we see they are set to the Projected CRS “SOURCECRS”. This is fine if we just want these layers because they are set to be equal. But if we want to add other layers from outside sources, we need to transform the CRS to be the same across sources. For example, the \(places\) data set does not have a CRS system assigned, i.e. CRS:NA. In this case, we can use the \(st\_crs()\) function to set the CRS to be the same as the \(blocks\_sf\).

places_sf <- st_as_sf(places)
places_sf$geometry #Uh Oh no CRS is defined
Geometry set for 9 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 539618.6 ymin: 149437.8 xmax: 561995.3 ymax: 181792.2
CRS:           NA
First 5 geometries:
#Set CRS equal to the CRS for blocks_sf.
st_crs(places_sf) <- st_crs(blocks_sf)
places_sf$geometry #Check again.
Geometry set for 9 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 539618.6 ymin: 149437.8 xmax: 561995.3 ymax: 181792.2
Projected CRS: SOURCECRS
First 5 geometries:

In contrast, if you have another data set that uses a different projection, e.g, google maps and OpenStreetMaps may use a different projection, then you can use the \(st\_transform()\) function to transform the CRS from one type to another. Below we transform both \(blocks\_sf\) and \(breach\_sf\) to have the WGS84 projection described above.

#To use Open Street Map, transform first to projections to EPSG 4326.
"+proj=longlat + ellps=WGS84"
[1] "+proj=longlat + ellps=WGS84"
blocks_transform <- st_transform(blocks_sf, crs = 4326)
breach_transform <- st_transform(breach_sf, crs = 4326)

3.4.1 Adding Context

In some situations a map with background context may be more informative. There are a number of options for doing this, including OpenStreetMap, GoogleMaps, and Leaflet. This requires some additional packages to be downloaded and installed in R.

install.packages("OpenStreetMaps", dep=T)

Using OpenStreetMap, the approach is to define the area of interest, to download and plot the map tile from OpenStreetMap and then to plot your data over the tiles. First we define the bounding box that consists of latitude and longitude minimums and maximums. Using the bounding box, we can extract a raster data layer using the $read_osm() and $ function. This raster layer is mapped using tmap and we include a second layer of breach locations.

# library(OpenStreetMap)
library(tmaptools)

#Set to plot mode
tmap_mode("plot")

#Set the geographic bounding box
box <- st_bbox(blocks_transform)
box
     xmin      ymin      xmax      ymax 
-72.99850  41.23903 -72.86060  41.35040 
#return a spatial raster data based on bounding box to include landscape map.
osm_data <- read_osm(box, type = "osm")

#Plot the raster data as first layer and breach locations as second layer.
om_map <- tm_shape(osm_data) +
  tm_rgb() +
  tm_shape(breach_transform) +
  tm_dots(col = "red", size = 0.1)

print(om_map)

3.4.2 Saving Your Map

Having created a map in a window, you may now want to save the map for either printing or incorporating into a document. There are a few simple steps to save a map in multiple formats.

  1. If you plot the map in the Viewer pane (bottom right), you can click on the Export button to save as a pdf or copy to clipboard.

  2. To directly save a map as a pdf/png or jep. Above the open street map object was assigned to “om_map”. Taking a pdf export for example, print the map after using the \(pdf()\) command, which lists the filepath and the pdf file name and extension.

#Save as a pdf
pdf("filepath/mymap.pdf")
print(om_map)
dev.off()

#Save as a jpeg
jpeg("filepath/mymap.jpg")
print(om_map)
dev.off()

3.4.3 Interactive Mapping

Interactive maps are a great method to include multiple pieces of information within one map. To have multiple pieces of information contained within the map, the \(mapview\) package allows you to create topological maps of landscape.

# #Change tmap to give interactive view
# library(mapview)
#  
# #Plot an interactive map using mapview
# mapview(breach_transform)

3.5 Introduction to TIDYCENSUS

The tidycensus package by Kyle Walker is an R package designed to facilitate the process of acquiring and working with US Census Bureau (USCB) population data in the R environment. It is designed to streamline the data wrangling process for spatial Census data analysis. With tidycensus R users can acquire census reported variables of interest AND geometric and spatial attributes which helps facilitate mapping and spatial analysis. US Census Bureau data consists of:

  1. US decennial census data from 2010 and 2020 requested using \(get\_decennial()\).
  2. 1-year and 5-year American Community Survey (ACS) population and demographic data requested using \(get\_acs()\).
  3. Population Projection Estimates (PEP) requested using \(get\_estimates()\).
  4. ACS Public Use Microdata requested using \(get\_pums()\).
  5. ACS Migration Flows requested using \(get\_flows()\).

Note this is not a comprehensive tutorial on tidycensus. It will give you an introduction to the capabilities of the package. For more information on USCB data products visit https://www.census.gov/data.html.

3.5.1 Geography in Tidycensus

The USCB reports data from the national level to county to census block level. Importantly, not all data sources report data at all geographies. The decennial census and 5-year ACS estimates report down to census block. However, PEP only reports down to county level and metro/micro-politan areas.

3.5.2 Searching for variables in Tidycensus

One challenge is extracting variable names from the USCB data sources. The tidycensus package contains a \(load\_variable()\) function that can list variable names and their descriptions. By browsing or subsetting the table, you can search for the relevant variables needed for your analysis.

# install.packages("tidycensus")
library(tidycensus)
#Read in ACS variables
vars <- load_variables(2020, "acs5")
head(vars)

3.5.3 Extracting the data in Tidycensus

Let’s work through examples on how to extract ACS. The parameters passed through the \(get\_acs()\) function require the: 1. geography which the the spatial resolution you want reported. Here we want county level data. 2. state is an optional parameter if you want a specific state. The default is NULL if you want the US. 3. variables listed the needed variables from the list above. Here we chose the variable for median household income. 4. year (self-explanatory?) 5. geometry is an important parameter. When set to \(T\) the data frame will include the spatial boundaries of the counties in your data frame and will produce an \(sf\) object.

#Extract ACS median household income at county level for GA. 2020
ga_hinc <- get_acs(
  geography = "county",
  state = "Georgia",
  variables= c("B19013_001"),
  year = 2020,
  geometry = T)

head(ga_hinc)

The resulting data frame a row for each county. Each row lists the county name, the estimate of the variable, and the associated margins of error (otherwise referred to as sampling error from ACS). Additionally, we have a geographic identification (GEOID) and the geometric features column.

Next let’s work through examples on how to extract PEP. The parameters passed through the \(get\_estimate()\) function require the:

  1. geography which the the spatial resolution you want reported. Here we want county level data.
  2. state is an optional parameter if you want a specific state. The default is NULL if you want the US.
  3. product lists three optioins: (1) population totals (“population”), (2) populations by characteristics (“characteristics”), and (3) housing units (“housing”). Here we request populations by sex and race breakdowns.
  4. **breakdown* lists RACE, SEX, AGE (for years 2020+), and HISPANIC.
  5. year (self-explanatory?)
  6. geometry again set to \(T\) the data frame will include the spatial boundaries of the counties in your data frame and will produce an \(sf\) object.
#Extract PEP populations by Race-Sex at county level for GA. 2019
ga_sexrace <- get_estimates(
  geography = "county",
  state = "Georgia",
  product= c("characteristics"),
  breakdown = c("SEX","RACE"),
  breakdown_labels = T,
  year = 2019,
  geometry = T)

head(ga_sexrace)

The resulting data frame a row for each county. Each row lists the county name, the value of the variable. For example, the population estimate for 2019 for the White population in Fulton County is 483838 across both sexes.

3.6 Introduction to Shapefile and Tigris

Shapefiles are a popular method of representing geospatial vector data in GIS. The shapefile can store data in the form of points, lines, or polygons. They can be used to represent administrative boundaries (think country/state/county borders) or roads, rivers, lakes etc. These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data. Three mandatory files related to shapefiles use the extensions (.shp, .shx, .dbf). The actual shapefile relates to the.shp file.

Shapefiles can be found on the internet with various publishers. For example,the US census bureau publishes shapefiles for the entire US. The USCB’s TIGER/Line database includes high-quality series of geographic datasets suitable for spatial and cartographic analyses. The TIGER/Line shapefiles include three general types of data:

  1. Legal entities: Legal geographic entities including county and state boundaries.
  2. Statistical entities: Non legal entities but boundaries with statistical meaning, i.e., census tracts and. block groups.
  3. Geographic features: Geographic datasets linked with aggregrate demographic information including water and road fetures.

From the USCB website, we downloaded the shapefile that contains all roads for the state of GA. Using the \(sf\) package, we read in the shapefile using the \(st\_read\) function. This gives an \(sf\) data frame that contains the geometric features from the shapefile. Using the \(tmap\) package, we can plot the \(sf\) object.

#Read in a shapefile
shp <- st_read("data/shapefiles/tl_2019_13_prisecroads.shp")
class(shp)
head(shp)

#Plot boundaries contained in shapefile.
tm_shape(shp) +
  tm_lines(col = "black")

The tigris package simplifies the process for R users of obtaining and using Census geographic datasets. The full list of shapefiles in tigris are available here

For example, the \(states()\) function can be used to download the boundaries of U.S. states. This function produces an \(sf\) data frame with geometric features.

# install.pakcages("tigris", dep =T)
library(tigris)

#Call in state boundaries for US
st <- states()
head(st)

#Plot the boundaries contained in the shapefile
tm_shape(st$geometry) +
  tm_borders(col = "black")

Additionally, we can extract shapefiles corresponding to specific counties within a state.

#Extract county borders for GA from the tigris package
ga_counties <- counties("GA")

#Plot the county borders
tm_shape(ga_counties) +
  tm_borders(col = "black")

Let’s add to this map by adding roads and water areas. We get the roads for GA using the \(primary\_secondary\_roads()\) function that requires the state parameter to be set. To obtain the water areas, we must obtain water geometries for each county within the state of GA. We use the \(area\_water()\) function that has state and county parameters.

#Extract the FIPS identifier for each county in GA
fips <- unique(ga_counties$COUNTYFP)

#Extract the primary and secondary roads for GA
road_data <- primary_secondary_roads("GA")

#To create areas of water, we must extract for each FIPS ID separately.
#Create a blank list
water_list <- list()

#For each county c, extract the water area using area_water()
#This takes a while!!!
for(c in 1:length(fips)){
water_list[[c]] <- area_water("GA", fips[c])
}

#Use the do.call and rbind functions to compile the list into a data frame
water_data <- do.call("rbind", water_list)


#Map the county borders, areas of water, and roads.
tm_shape(ga_counties) +
  tm_borders(col = "black") +
  tm_shape(water_data) +
  tm_borders(col = "blue") +
  tm_shape(road_data) +
  tm_lines(col = "orange")

3.7 Introduction to Tidygeocoder

In many cases, data consist of physical addresses. In order to analyze physical address data, it is necessary to convert these locations to latitude and longitude coordinate pairs. The tidygeocoder package includes functions that take physical addresses as inputs and convert these to geographic locations. In addition to the usage examples below, see the [Getting Started Vignette] (https://jessecambon.github.io/tidygeocoder/articles/tidygeocoder.html) and blog posts on tidygeocoder.

For example, below we create a data frame (tibble) of addresses which we name \(some\_addresses\). Using the code \(geocode()\) function in the tidygeocoder package, we set the following parameters:

  • address: a single or columns of street addresses. Alternatively, you can specify street, city, county, state and postal code separately.
  • method: This determines the map API to be used for finding locations. Options include OpenStreetMap (osm), US Census (census), ARCGIS (arcgis), Nominatim, and Google (google).
library(tidygeocoder)
#To explore tidygeocoder, execute command below
vignette("tidygeocoder")

# create a dataframe with addresses
some_addresses <- tibble::tribble(
~name,                  ~addr,
"White House",          "1600 Pennsylvania Ave NW, Washington, DC",
"Transamerica Pyramid", "600 Montgomery St, San Francisco, CA 94111",
"Willis Tower",         "233 S Wacker Dr, Chicago, IL 60606"
)

# geocode the addresses
lat_longs <- some_addresses %>%
  geocode(address = addr,
          method = 'osm')
head(lat_longs)
# A tibble: 3 × 4
  name                 addr                                         lat   long
  <chr>                <chr>                                      <dbl>  <dbl>
1 White House          1600 Pennsylvania Ave NW, Washington, DC    38.9  -77.0
2 Transamerica Pyramid 600 Montgomery St, San Francisco, CA 94111  37.8 -122. 
3 Willis Tower         233 S Wacker Dr, Chicago, IL 60606          41.9  -87.6

Below is an example of using the \(geocode()\) function to extract US census data including coordinates and census identifiers such as census block number and FIPS code. Note how we have changed \(method = "census"\) and we have set the \(api\_options\) to list census geographies.

census_full1 <- some_addresses %>% geocode(
  address = addr,
  method = "census",
  full_results = TRUE,
  api_options = list(census_return_type = 'geographies'))

census_full1
# A tibble: 3 × 15
  name         addr    lat   long    id input_address match_indicator match_type
  <chr>        <chr> <dbl>  <dbl> <int> <chr>         <chr>           <chr>     
1 White House  1600…  38.9  -77.0     1 "1600 Pennsy… Match           Exact     
2 Transameric… 600 …  37.8 -122.      2 "600 Montgom… Match           Exact     
3 Willis Tower 233 …  41.9  -87.6     3 "233 S Wacke… Match           Exact     
# ℹ 7 more variables: matched_address <chr>, tiger_line_id <chr>,
#   tiger_side <chr>, state_fips <chr>, county_fips <chr>, census_tract <chr>,
#   census_block <chr>

Alternatively, we can use the \(reverse\_geocode()\) function to extract physical addresses from geographical pairs of latitude and longitude. We create a data frame with \(lat\) and \(long\) values. We input these coordinates into the \(reverse\_geocode()\) function and set the method to \(osm\).

lat_longs1 <- tibble(
  latitude = c(38.895865, 43.6534817),
  longitude = c(-77.0307713, -79.3839347)
)

rev1 <- lat_longs1 %>%
  reverse_geocode(lat = latitude, long = longitude, address = addr, method = "osm")

rev1
# A tibble: 2 × 3
  latitude longitude addr                                                       
     <dbl>     <dbl> <chr>                                                      
1     38.9     -77.0 Freedom Plaza, 1455, Pennsylvania Avenue Northwest, Ward 2…
2     43.7     -79.4 Toronto City Hall, 100, Queen Street West, Financial Distr…

Lab 2 Activity

  1. For this lab, you will be reading in your project data which will contain different data tables dependent on your particular project. Read in your data and explore the data using the \(View()\) and \(head()\) functions.

In-person Each group will develop codes to extract and read in the different project data in class.

Asynchronous Below we give code to read in the cholera case data from 1894 London and the Soho Neighborhood polygon boundaries. Run the code below but change that data directory \(data.dir\) to your local folder which houses the London data.

### CHANGE THIS FILEPATH TO YOUR OWN.
data.dir <- "/Users/emilypeterson/Library/CloudStorage/OneDrive-EmoryUniversity/INFO 532/Projects2023/Cholera/"

# You should have all these files housed in one folder.
file_names <- c("Cholera_Deaths_period_0_MOR.shp",
                "Cholera_Deaths_period_1_JC.shp",
                "Cholera_Deaths_period_2_ECB.shp",
                "Cholera_Deaths_period_3_LW.shp",
                "Cholera_Deaths_period_4.shp")



cholera_data_list <- list()


for(i in 1:length(file_names)){
  cholera_data_list[[i]] <- read_sf(paste0(data.dir, file_names[i]))
}

#Create point data for cases of cholera
cholera_data <- do.call("rbind", cholera_data_list)

#Read in shape file of London-Soho neighborhood
soho_boundaries <- read_sf(paste0(data.dir, "Soho__345782_Neighborhoods.shp"))

#Read in shape file of London-Soho neighborhood
soho_streets <- read_sf(paste0(data.dir, "Soho_Streets.shp"))

#Read in Soho census file in csv format.

soho_census <- read.csv(paste0(data.dir, "Soho_Neighborhood_Census_Data.csv"))

#Merge soho census and soho boundary data
soho_areas <- left_join(soho_boundaries, soho_census)

head(cholera_data)
head(soho_areas)
head(soho_streets)
  1. Write a paragraph with a section header “Data” that describes the above to an audience not familiar with your project. Describe the data reported in each data table. This will include the following:
    1. The “relevant” variables contained in the data table.
    2. The geographical units captured in the data. Include projection information.
    3. The type of data (points, lines, polygons) and what they represent.

Hint: Write this as a data paragraph in a report for a reader not familiar with this project. A brief summary of the data you will be analyzing.

  1. Use the \(filter()\) function to subset your data based on a given rule. For example, using the cholera data, we can subset cases to only include those during period 2.

In-person Pick a subset of the data and filter based on a particular variable of interest.

Asynchronous Run the code below but change the filter rule to only include cases during period 4.

#Filter to only period 2 cases
cholera_data_subset <- cholera_data %>%
  filter(Period == 2)
  1. Use the \(mutate()\) function to create a new variable based on a given rule. For example, using the cholera data, we can create a new logical variable that is called period2_indicator. This variable equals 1 if in period 2 and 0 otherwise. Show the new variable using the \(head()\) function. Then use the \(group\_by()\) and \(summarize()\) functions to show a summary of your data based on a group attribute. For example, below we summarize number of cases per period.

In-person Create a new variable based on the variables in your current data, then summarize the number of cases/points based on a grouping variable.

Asynchronous Run the code below but change the mutate rule to create a new variable that is 1 if in period 4, and 0 otherwise. Summarize the number of cases grouped by your new variable.

#Create new variable period2_indicator =1 if in period 2 otherwise 0.
cholera_data2 <- cholera_data %>%
  mutate(period2_indicator = ifelse(Period == 2, 1 ,0))

#Show the new variable
head(cholera_data2)

#Summarize number of cases per period.
summary_output <- cholera_data %>%
  group_by(Period) %>%
  summarize(count = n())

#Show summary
summary_output
  1. Use the \(st\_join()\) function to create a new data frame that contains the point data and the area data corresponding to those points. Hint: pseudo code has been given to you below tomerge the two data sets.
spatial_join_data <- st_join("put your points data set here", "put your polygon data set here")
head(spatial_join_data)
  1. Use layering in \(tmap\) to produce a preliminary plot of both a chloropleth (polygon) map of area attributes with points representing location data. Below is an example of pseudo code to help you.
tm_shape("ADD NAME FOR POLYGON DATASET") +
  tm_polygons(col = "ADD VARIABLE NAME THAT CAPTURES THE AREA ATTRIBUTES, (ex. POPULATION)") +
  tm_shape("ADD NAME OF DATASET WITH POINTS DATA") +
    tm_dots("ADD COLOR FOR DOTS") +
      tm_shape("ADD LINES DATA SET NAME") +
        tm_lines("ADD LINES COLOR")
  1. Use tmaptools to create a plot of your point data against a raster landscape background. Using the cholera point case data. We use the \(st\_bbox()\) function to get the bounding box. Now use the \(read\_osm(box, type = "osm")\) code to get the landscape data. Plot both the raster landscape and the points together.
tmap_mode("plot")
#Get the bounding box
box <- st_bbox(cholera_data)

raster_map <- tmaptools::read_osm(box, type = "osm")

#Add TMAP code here.
  1. Write a descriptive paragraph called Exploratory Analysis that illustrates the maps and the various patterns from the observed data. Based on these maps, add descriptions of spatial relationships you observe based on maps from 5 and 6.

  2. Compile the Rmarkdown file with you code and text descriptions. Convert it to a pdf and submit to Canvas. Remember to save it with you name and lab number so we can easily identify you.