7 Spatial Analysis
Spatial analysis provides an expansive analytical landscape with theoretical underpinnings dating to at least the 1950’s (Berry and Marble (1968)), contributing to the development of geographic information systems as well as spatial tools in more statistically oriented programming environments like R. We won’t attempt to approach any thorough coverage of these methods, and we would refer the reader for more focused consideration using R-based methods to sources such as Lovelace, Nowosad, and Muenchow (2019) and Hijmans (n.d.).
In this chapter, we’ll continue working with spatial data, and explore spatial analysis methods. We’ll look at a selection of useful geospatial abstraction and analysis methods, what are also called geoprocessing tools in a GIS. The R Spatial world has grown in recent years to include an increasingly good array of tools. This chapter will focus on vector GIS methods, and here we don’t mean the same thing as the vector data objects in R nomenclature (Python’s pandas package which uses a similar data structure calls these “series” instead of vectors), but instead on feature geometries and their spatial relationships based on the coordinate referencing system.
7.1 Data frame operations
But before we get into those specifically spatial operations, it’s important to remember that feature data at least are also data frames, so we can use the methods we already know with them. For instance, we can look at properties of variables and then filter for features that meet a criterion, like all climate station points at greater than 2000 m elevation, or all above 38° N latitude. To be able to work with latitude as a variable, we’ll need to use
remove=FALSE (the default is to remove them) to retain them when we use
st_as_sf. In Appendix A7.1, you can see more of the exploratory work of how we chose our values.
Adding a basemap with maptiles: We’d like to have a basemap, so we’ll create one with the
maptiles package (install if you don’t have it.) Warning: the
get_tiles function goes online to get the basemap data, so if you don’t have a good internet connection or the site goes down, this may fail. We can then display the basemap with
For temperature, we’ll reverse a RColorBrewer palette to show a reasonable color scheme by reversing its
RdBu palette with
rev (which took me a long time to figure out – color schemes are much more challenging than you might think because there are many highly varied uses of color.)
library(tmap); library(RColorBrewer); library(sf); library(tidyverse); library(maptiles); library(igisci) tmap_mode("plot") <- unique(str_sub(sierraFeb$STATION_NAME, 1, str_locate(sierraFeb$STATION_NAME, ",")-1)) newname <- st_as_sf(bind_cols(sierraFeb,STATION=newname) %>% sierraFeb2000 ::select(-STATION_NAME) %>% dplyrfilter(ELEVATION >= 2000, !is.na(TEMPERATURE)), coords=c("LONGITUDE","LATITUDE"), crs=4326) <- get_tiles(sierraFeb2000) sierraBase tm_shape(sierraBase) + tm_rgb() + tm_shape(sierraFeb2000) + tm_symbols(col = "TEMPERATURE", midpoint=NA, palette=rev(brewer.pal(8,"RdBu"))) + tm_text(text = "STATION", size=0.5, auto.placement=T, xmod=0.5, ymod=0.5) + tm_graticules(lines=F)
Now let’s include LATITUDE. Let’s see what we get by filtering for both
ELEVATION >= 2000 and
LATITUDE >= 38:
%>% sierraFeb filter(ELEVATION >= 2000 & LATITUDE >= 38)
## # A tibble: 1 x 7 ## STATION_NAME COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATURE ## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 BODIE CALIFORNI~ Mono 2551. 38.2 -119. 39.6 -4.4
The only one left is Bodie. Maybe that’s why Bodie, a ghost town now, has such a reputation for being so friggin’ cold (at least for California). I’ve been snowed on there in summer.
7.1.1 Using grouped summaries, and filtering by a selection
We’ve been using February Sierra climate data for demonstrating various things, but this wasn’t the original form of the data downloaded, so let’s look at the original data and use some dplyr methods to restructure it, in this case to derive annual summaries. We’ll use the monthly normals to derive annual values.
We looked at the very useful
summarize() process earlier. We’ll use this and a selection process to create an annual Sierra climate dataframe we can map and analyze. The California monthly normals were downloaded from the National Centers for Environmental Information at NOAA, where you can select desired parameters, monthly normals as frequency, and limit to one state.
First we’ll have a quick look at the data to see how it’s structured. These were downloaded as monthly normals for the State of California, as of 2010, so there are 12 months coded
201012, with obviously the same
LONGITUDE, but monthly values for climate data like
## # A tibble: 15 x 11 ## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE `MLY-PRCP-NORMA~ ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201001 13.2 ## 2 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201002 15.0 ## 3 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201003 11.4 ## 4 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201004 3.3 ## 5 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201005 2.29 ## 6 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201006 0.25 ## 7 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201007 12.2 ## 8 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201008 20.6 ## 9 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201009 10.2 ## 10 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201010 5.08 ## 11 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201011 5.08 ## 12 GHCND:US~ TWENTYNINE PA~ 602 34.12806 -116.036~ 201012 14.7 ## 13 GHCND:US~ BUTTONWILLOW ~ 82 35.4047 -119.4731 201001 32.5 ## 14 GHCND:US~ BUTTONWILLOW ~ 82 35.4047 -119.4731 201002 30.0 ## 15 GHCND:US~ BUTTONWILLOW ~ 82 35.4047 -119.4731 201003 30.7 ## # ... with 4 more variables: MLY-SNOW-NORMAL <dbl>, MLY-TAVG-NORMAL <dbl>, ## # MLY-TMAX-NORMAL <dbl>, MLY-TMIN-NORMAL <dbl>
To get just the Sierra data, it seemed easiest to just provide a list of relevant county names to filter the counties, do a bit of field renaming, then read in a previously created selection of Sierra weather/climate stations.
<- st_make_valid(CA_counties) %>% sierraCounties filter(NAME %in% c("Alpine","Amador","Butte","Calaveras","El Dorado", "Fresno","Inyo","Kern","Lassen","Madera","Mariposa", "Mono","Nevada","Placer","Plumas","Sacramento","Shasta", "Sierra","Tehama","Tulare","Tuolumne","Yuba")) <- read_csv(ex("sierra/908277.csv")) %>% normals mutate(STATION = str_sub(STATION,7,str_length(STATION))) <- read_csv(ex("sierra/sierraStations.csv"))sierraStations
To get annual values, we’ll want to use the stations as groups in a
summarize process. For values that stay the same for a station (
ELEVATION), we’ll just use
first() to get just one of the 12 identical monthly values. For values that vary monthly, we’ll
sum() the monthly precipitations and get the
mean() of monthly temperatures to get appropriate annual values. We’ll also use a
right_join to keep only the stations that are in the Sierra. Have a look at this script to make sure you understand what it’s doing; it’s got several elements you’ve been introduced to before, and that you should understand. At the end, we’ll use
st_as_sf to make sf data out of the data frame, and retain the
LATITUDE variables in case we want to use them as separate variables.
<- right_join(sierraStations,normals,by="STATION") %>% sierraAnnual filter(!is.na(STATION_NA)) %>% ::select(-STATION_NA) %>% dplyrgroup_by(STATION_NAME) %>% summarize(LONGITUDE = first(LONGITUDE), LATITUDE = first(LATITUDE), ELEVATION = first(ELEVATION), PRECIPITATION = sum(`MLY-PRCP-NORMAL`), TEMPERATURE = mean(`MLY-TAVG-NORMAL`)) %>% mutate(STATION_NAME = str_sub(STATION_NAME,1,str_length(STATION_NAME)-6)) %>% filter(PRECIPITATION > 0) %>% filter(TEMPERATURE > -100) %>% st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs=4326, remove=F) ggplot() + geom_sf(data=CA_counties, aes(), fill="gray") + geom_sf(data=sierraCounties, aes(), fill="white") + geom_sf(data=sierraAnnual, aes(col=PRECIPITATION)) + scale_color_gradient(low="orange", high="blue")
7.2 Spatial analysis operations
Again, there is a lot more to spatial analysis than we have time to cover. But we’ll explore some particularly useful spatial analysis operations, especially those that contribute to statistical analysis methods we’ll be looking at soon. We’ll start by continuing to look at subsetting or filtering methods, but ones that use spatial relationships to identify what we want to retain or remove from consideration.
7.2.1 Using topology to subset
Using spatial relationships can be useful in filtering our data, and there are quite a few topological relationships that can be explored. See Lovelace, Nowosad, and Muenchow (2019) for a lot more about topological operations. We’ll look at a relatively simple one that identifies whether a feature is within another one, and apply this method to filter for features within a selection of 5 counties, which we’ll start by identifying by name.
<- CA_counties %>% nSierraCo filter(NAME %in% c("Plumas","Butte","Sierra","Nevada","Yuba"))
Then we’ll use those counties to select towns (places) that occur within them…
<- st_read(ex("sierra/CA_places.shp")) CA_places <- lengths(st_within(CA_places, nSierraCo)) > 0 # to get TRUE & FALSE nCAsel <- CA_places[nCAsel,]nSierraPlaces
… and do the same for the
sierraFeb weather stations:
<- st_as_sf(read_csv(ex("sierra/sierraFeb.csv")), coords=c("LONGITUDE","LATITUDE"), crs=4326) sierra <- lengths(st_within(sierra, nSierraCo)) > 0 # to get TRUE & FALSE nCAselSta <- sierra[nCAselSta,]nSierraStations
library(maptiles) <- get_tiles(nSierraStations, provider="OpenTopoMap") nsierraBase
library(tmap) tmap_mode("plot") tm_shape(nsierraBase) + tm_rgb(alpha=0.5) + tm_shape(nSierraCo) + tm_borders() + tm_text("NAME") + tm_shape(nSierraStations) + tm_symbols(col="blue") + tm_shape(nSierraPlaces) + tm_symbols(col="red", alpha=0.5) + tm_text("AREANAME")
So far, the above is just subsetting for a map, which may be all we’re wanting to do, but we’ll apply this selection to a distance function in the next section to explore a method using a reduced data set.
Related to the topological concept of relating features inside other features is creating a new feature in the middle of an existing one, specifically a point placed in the middle of a polygon: a centroid. Centroids are useful when you need point data for some kind of analysis, but that point needs to represent the polygon it’s contained within. The methods we’ll see in the code below include:
st_centroid(): to derive a single point for each tract that should be approximately in its middle (but must be topologically contained within it, not fooled by an annulus (“doughnut”) shape for example.)
st_make_valid(): to make an invalid geometry valid (or just check for it). This function that has just become essential now that
sfsupports spherical instead of just planar data, which is the case
BayAreaTractswhich ends up containing “slivers” where boundaries slightly under- or over-lap since they were originally built from a planar projection. The
st_make_valid()appears to make minor (and typically invisible) corrections to these slivers.
st_bbox: reads the bounding box, or spatial extent, of any dataset, which we can then use to set the scale to focus on that area. In this case, we’ll focus on the census centroids instead of the statewide
Let’s see the effect of the centroids and bbox. First just centroids:
library(tidyverse); library(sf); library(igisci) <- st_make_valid(CA_counties) # needed fix for dataCA_counties
ggplot() + geom_sf(data=CA_counties) + geom_sf(data=st_centroid(CA_counties))
Here’s an example that also applies the bounding box to establish a mapping scale that covers the Bay Area while some of the data (TRI locations and CAfreeways) are state-wide:
library(tidyverse) library(sf) library(igisci) <- st_make_valid(BayAreaTracts) BayAreaTracts <- st_centroid(BayAreaTracts) censusCentroids <- st_as_sf(read_csv(ex("TRI/TRI_2017_CA.csv")), TRI_sp coords = c("LONGITUDE", "LATITUDE"), crs=4326) # simple way to specify coordinate reference <- st_bbox(censusCentroids) bnd ggplot() + geom_sf(data = BayAreaCounties, aes(fill = NAME)) + geom_sf(data = censusCentroids) + geom_sf(data = CAfreeways, color = "grey", alpha=0.5) + geom_sf(data = TRI_sp, color = "yellow") + coord_sf(xlim = c(bnd, bnd), ylim = c(bnd, bnd)) + labs(title="Bay Area Counties, Freeways and Census Tract Centroids")
Distance is a fundamental spatial measure, used to not only create spatial data (distance between points in surveying or distance to GNSS satellites) but also to analyze it. Note that we can either be referring to planar (from projected coordinate systems) or spherical (from latitude and longitude) great-circle distances. Two common spatial operations involving distance in vector spatial analysis are (a) deriving distances among features, and (b) creating buffer polygons of a specific distance away from features. In raster spatial analysis, we’ll look at deriving distances from target features to each raster cell.
18.104.22.168 Distances among features
st_distance and terra
distance functions derive distances between features, either among features of one data set object or between all features in one and all features in another data set.
To see how this works, we’ll look at a purposefully small dataset: a selection of Marble Mountains soil CO2 sampling sites and in-cave water quality sampling sites. We’ll start by reading in the data, and filtering to just get cave water samples and a small set of nearby soil CO2 sampling sites:
library(igisci); library(sf); library(tidyverse) <- st_read(ex("marbles/co2july95.shp")) soilCO2all <- st_read(ex("marbles/samples.shp")) %>% cave_wsamples filter((SAMPLES_ID >= 50) & (SAMPLES_ID < 60)) # these are the cave samples <- soilCO2all %>% filter(LOC > 20) # soil CO2 samples in the same general areasoilCO2
Here are the six selected soil CO2 samples:
library(tmap); library(maptiles) <- get_tiles(soilCO2, provider="OpenTopoMap") marblesTopoBase tmap_mode("plot") tm_shape(marblesTopoBase) + tm_rgb(alpha = 0.6) + tm_shape(soilCO2) + tm_symbols(col="CO2_", palette="Reds", size=4) + tm_shape(soilCO2) + tm_text("LOC")
If you just provide the 6 soil CO2 sample points as a single feature data set input to the st_distance function, it returns a matrix of distances between each, with a diagonal of zeros where the distance would be to itself:
## Units: [m] ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 0.0000 370.5894 1155.64981 1207.62172 3333.529 1439.2366 ## [2,] 370.5894 0.0000 973.73559 1039.45041 3067.106 1248.9347 ## [3,] 1155.6498 973.7356 0.00000 75.14093 2207.019 283.6722 ## [4,] 1207.6217 1039.4504 75.14093 0.00000 2174.493 237.7024 ## [5,] 3333.5290 3067.1061 2207.01903 2174.49276 0.000 1938.2985 ## [6,] 1439.2366 1248.9347 283.67215 237.70236 1938.299 0.0000
Then we’ll look at distances between this same set of soil CO2 samples with water samples collected in caves, where the effect of elevated soil CO2 values might influence solution processes reflected in cave waters.
library(tmap) tmap_mode("plot") <- get_tiles(soilCO2, provider="OpenTopoMap") marblesTopoBase tm_shape(marblesTopoBase) + tm_rgb() + tm_shape(cave_wsamples) + tm_symbols(col="CATOT", palette="Blues") + tm_shape(cave_wsamples) + tm_text("SAMPLES_ID") + tm_shape(soilCO2) + tm_symbols(col="CO2_", palette="Reds") + tm_shape(soilCO2) + tm_text("LOC")
<- st_distance(soilCO2, cave_wsamples) soilwater soilwater
## Units: [m] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 2384.210 2271.521 2169.301 1985.6093 1980.0314 2004.1697 1905.9543 ## [2,] 2029.748 1920.783 1826.579 1815.2653 1820.0496 1835.9465 1798.1216 ## [3,] 1611.381 1480.688 1334.287 842.8228 846.3210 863.1026 849.7313 ## [4,] 1637.998 1506.938 1357.582 781.9741 782.4215 801.6262 776.0732 ## [5,] 1590.469 1578.730 1532.465 1526.3315 1567.5677 1520.1399 1820.3651 ## [6,] 1506.358 1375.757 1220.000 567.8210 578.1919 589.1443 638.8615
In this case, the 6 soil CO2 samples are the rows, and the 7 cave water sample locations are the columns. We aren’t really relating the values but just looking at distances. An analysis of this data might not be very informative because the caves aren’t very near the soil samples, and conduit cave hydrology doesn’t lend itself to looking at euclidean distance, but the purpose of this example is just to comprehend the results of the
sf::st_distance or similarly the
22.214.171.124 Distance to the nearest feature, abstracted from distance matrix
However, let’s process the matrix a bit to find the distance from each soil CO2 sample to the closest cave water sample:
<- soilCO2 %>% mutate(dist2cave = min(soilwater[1,])) soilCO2d for (i in 1:length(soilwater[,1])) soilCO2d[i,"dist2cave"] <- min(soilwater[i,]) %>% select(DESCRIPTIO, dist2cave) %>% st_set_geometry(NULL)soilCO2d
## DESCRIPTIO dist2cave ## 1 Super Sink moraine, low point W of 241 1905.9543 [m] ## 2 Small sink, first in string on hike to SuperSink 1798.1216 [m] ## 3 meadow on bench above Mbl Valley, 25' E of PCT 842.8228 [m] ## 4 red fir forest S of upper meadow, 30' W of PCT 776.0732 [m] ## 5 saddle betw Lower Skyhigh (E) & Frying Pan 1520.1399 [m] ## 6 E end of wide soil-filled 10m fissure N of lower camp 567.8210 [m]
… or since we can also look at distances to lines or polygons, we can find the distance from CO2 sampling locations to the closest stream:
library(igisci); library(sf); library(tidyverse) <- st_read(ex("marbles/co2july95.shp")) soilCO2all <- st_read(ex("marbles/streams.shp"))streams
<- st_distance(soilCO2all, streams) strCO2 <- soilCO2all %>% mutate(dist2stream = min(strCO2[1,])) strCO2d for (i in 1:length(strCO2[,1])) strCO2d[i,"dist2stream"] <- min(strCO2[i,])
<- get_tiles(soilCO2, provider="OpenTopoMap") marblesTopoBase tm_shape(marblesTopoBase) + tm_rgb() + tm_shape(streams) + tm_lines(col="blue") + tm_shape(strCO2d) + tm_symbols(col="dist2stream")
126.96.36.199 Nearest feature detection
As we just saw, the matrix derived from
st_distance when applied to feature data sets could be used for further analyses, such as this distance to the nearest place. But there’s another approach to this using a specific function that identifies the nearest feature:
st_nearest_feature, so we’ll look at this with the previously filtered northern Sierra places and weather stations. We start by using
st_nearest_feature to create an index vector of the nearest place to each station, and grab its location geometry:
<- st_nearest_feature(nSierraStations, nSierraPlaces) nearest_Place <- nSierraPlaces$geometry[nearest_Place]near_Place_loc
Then add the distance to that nearest place to our
sf. Note that we’re using
st_distance with 2 vector inputs as before, but ending up with just one distance per feature (instead of a matrix) with the setting
<- nSierraStations %>% nSierraStations mutate(d2Place = st_distance(nSierraStations, near_Place_loc, by_element=TRUE), d2Place = units::drop_units(d2Place))
… and of course map the results:
<- get_tiles(nSierraStations, provider="OpenStreetMap") sierraStreetBase tm_shape(sierraStreetBase) + tm_rgb() + tm_shape(nSierraCo) + tm_borders() + tm_text("NAME") + tm_shape(nSierraStations) + tm_symbols(col="d2Place") + tm_shape(nSierraPlaces) + tm_symbols(col="blue", alpha=0.5)
That’s probably enough examples of using distance to nearest feature so we can see how it works, but to see an example with a larger data set, an example in Appendix A7.2 looks at the proximity of TRI sites to health data at census tracts.
Creating buffers, or polygons defining the area within some distance of a feature, is commonly used in GIS. Since you need to specify that distance (or read it from a variable for each feature), you need to know what the horizontal distance units are for your data. If GCS, these will be decimal degrees and 1 degree is a long way, about 111 km (or 69 miles), though that differs for longitude where 1 degree of longitude is 111 km * cos(latitude). If in UTM, the horizontal distance will be in metres, but in the US, state plane coordinates are typically in feet. So let’s read the trails shapefile from the Marble Mountains and look for its units:
library(igisci) library(sf); library(tidyverse) <- st_read(ex("marbles/trails.shp"))trails
Then we know we’re in metres, so we’ll create a 100-m buffer this way.
<- st_buffer(trails,100) trail_buff0 ggplot(trail_buff0) + geom_sf()
Since the spatial data are in UTM, we can just specify the distance in metres. However if the data were in decimal degrees of latitude & longitude (GCS), we would need to let the function know that we’re providing it metres so it can transform that to work with the GCS, using
units::set_units(100, "m") instead of just
100, for the above example where we are creating a 100 m buffer.
7.2.5 Spatial Overlay: Union and Intersection
Overlay operations are described in the
sf cheat sheet under “Geometry operations.” These are useful to explore, but a couple we’ll look at are union and intersection.
Normally these methods have multiple inputs, but we’ll start with one that can also be used to dissolve boundaries,
st_union – if only one input is provided it appears to do a dissolve:
<- st_union(trail_buff0) trail_buff ggplot(trail_buff) + geom_sf()
For a clearer example with the normal multiple input, we’ll intersect a 100m buffer around streams and a 100m buffer around trails …
<- st_read(ex("marbles/streams.shp")) streams <- st_buffer(trails, 100) trail_buff <- st_buffer(streams,100) str_buff <- st_intersection(trail_buff,str_buff)strtrInt
…to show areas that are close to streams and trails:
ggplot(strtrInt) + geom_sf(fill="green") + geom_sf(data=trails, col="red") + geom_sf(data=streams, col="blue")
Or how about a union of these two buffers? We’ll also dissolve the boundaries using union with a single input (the first union) to dissolve those internal overlays:
<- st_union(st_union(trail_buff,str_buff)) strtrUnion ggplot(strtrUnion) + geom_sf(fill="green") + geom_sf(data=trails, col="red") + geom_sf(data=streams, col="blue")
7.2.6 Clip with st_crop
Clipping a GIS layer to a rectangle (or to a polygon) is often useful. We’ll clip to a rectangle based on latitude and longitude limits we’ll specify, however since our data is in UTM, we’ll need to use
st_transform to get it to the right coordinates:
=-123.21; xmax=-123.18; ymin=41.55; ymax=41.57 xmin<- rbind(c(xmin,ymin),c(xmin,ymax),c(xmax,ymax),c(xmax,ymin),c(xmin,ymin)) clipMatrix <- st_sfc(st_polygon(list(clipMatrix)),crs=4326) clipGCS <- st_crop(strtrUnion,st_transform(clipGCS,crs=st_crs(strtrUnion))) bufferCrop <- st_bbox(bufferCrop) bnd ggplot(bufferCrop) + geom_sf(fill="green") + geom_sf(data=trails, col="red") + geom_sf(data=streams, col="blue") + coord_sf(xlim = c(bnd, bnd), ylim = c(bnd, bnd))
188.8.131.52 sf or terra for vector spatial operations?
Note: there are other geometric operations in
sf beyond what we’ve looked at. See the cheat sheet and other resources at https://r-spatial.github.io/sf .
terra system also has many vector tools (such as
terra::buffer), that work with and create SpatVector spatial data objects. As noted earlier, these S4 spatial data can be created from (S3) sf data with
terra::vect and vice versa with
sf::st_as_sf. We’ll mostly focus on sf for vector and terra for raster, but to learn more about SpatVector data and operations in terra, see https://rspatial.org .
For instance, there’s a similar
terra operation for one of the things we just did, but instead of using
union to dissolve the internal boundaries like
st_union did, uses
aggregate to remove boundaries between polygons with the same codes:
<- vect(ex("marbles/trails.shp")) trails <- aggregate(buffer(trails,100)) trailbuff plot(trailbuff)
7.2.7 Spatial join with
A spatial join can do many things, but we’ll just look at its simplest application – connecting a point with the polygon that it’s in. In this case, we want to join the attribute data in
BayAreaTracts with EPA Toxic Release Inventory (TRI) point data (at factories and other point sources) and then display a few ethnicity variables from the census, associated spatially with the TRI point locations. We’ll be making better maps later; this is a quick
plot() display to show that the spatial join worked.
library(igisci); library(tidyverse); library(sf) <- st_as_sf(read_csv(ex("TRI/TRI_2017_CA.csv")), TRI_sp coords = c("LONGITUDE", "LATITUDE"), crs=4326) %>% st_join(st_make_valid(BayAreaTracts)) %>% filter(!is.na(TRACT)) %>% # removes points not in the Bay Area -- not in BayAreaTracts ::select(POP2012:HISPANIC) dplyrplot(TRI_sp)
7.3 Exploring methods
We’ve looked at a selection of feature-based spatial analysis functions mostly in the
sf package. There are a lot more discussed in the “Spatial data operations” and “Geometry operations” chapters of Geocomputation (https://geocompr.robinlovelace.net/, Lovelace, Nowosad, and Muenchow (2019)) that are worth exploring (and still others in the
terra package describe at https://rspatial.org, Hijmans (n.d.)), especially when you run into a task that the tools we’ve learned won’t do. There is also good coverage of transformation and reprojection methods in the “Reprojecting geographic data” and data access methods in the “Geographic data I/O” chapter of Geocomputation. And the Simple Features for R reference (Simple Features for r (n.d.)) site includes a complete reference for all of its functions, as does https://rspatial.org for
- (Assuming you wrote them in the last chapter,) read in your western states “data/W_States.shp” and peaks “data/peaks.shp” data, then use a spatial join to add the peak points to the states to provide a new attribute maximum elevation, and display that using geom_sf_text() with the state polygons. Note that joining from the states to the peaks works because there’s a one-to-one correspondence between states and maximum peaks. (If you didn’t write them, you’ll need to repeat your code that built them from the previous chapter.)
- Also using the western states data, change the map to the view mode, but don’t use the state borders since the basemap will have them. Just before adding shapes, set the basemap to leaflet::providers$Esri.NatGeoWorldMap, then continue to the peaks after the + to see the peaks on a National Geographic basemap.
Using two shape files of your own choice, not from igisci, read them in as sf data, and perform a spatial join to join the polygons to the point data. Then display the point feature data frame, and create a ggplot scatter plot or boxplot graph where the two variables are related, preferably with a separate categorical variable to provide color.
From the above spatially joined data create a map where the points are colored by data brought in from the spatial join; can be either a categorical or continuous variable.
Create an sf that represents all areas within 50 km of a TRI facility in California that has >10000 pounds of total stack air release for all gases, clipped (intersected) with the state of California as provided in the CA counties data set. Then use this to create a map of California showing the clipped buffered areas in red, with California counties and those selected large-release TRI facilities in black dots, as shown here. To create a variable representing 50 km to provide to the dist parameter in st_buffer [this is needed since the spatial data are in GCS], use the units::set_units function, something like this:
D50km <- units::set_units(50, "km")