Chapter 6 Spatial Models: spatial autocorrelation and cluster analysis

6.1 Introduction

This practical focusses on cluster analysis (ie determining whether your data are are spatially clustered, also known as spatially autocorrelated) and autoregressive models (optional). These are given a comprehensive treatment (equations references etc).

Essentially the practical describes how to undertake cluster analyses using different methods for statistically testing for the similarity of nearby observations (i.e. spatial autocorrelation). It extends the practicals from GEOG2150 to show other techniques. It then introduces autoregressive models as an optional practical. These are statistical models like regression that explicitly accommodate observation location in their input parameters. Underpinning all of these is some concept of nearby, either through adjacency or through a some distance specification.

Recall that Tobler’s First Law of Geography posits that:

everything is related to everything else, but near things are more related than distant things. (Tobler 1970)

Here, Tobler is arguing (although by his own admission he used the term in a tongue in cheek way) that geographically located phenomena that are near to one another tend to have similar characteristics - or that if this were not the case -

“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (De Smith, Goodchild, and Longley 2007, p44)

I think this is saying there would be no such thing as Geography!!!

Autocorrelation is the similarity between attribute values of nearby observations when examined over the entire study region. Measures spatial autocorrelation describe the degree of attribute similarity and essentially allow Tobler’s Law to be evaluated.

There are a number of approaches for doing this, as described in Chapters 7 and 8 of Brunsdon and Comber (2018) and in some of the relevant contributions in the GIS Body of Knowledge (see AM-22 - Global Measures of Spatial Association http://gistbok.ucgis.org/bok-topics/global-measures-spatial-association and FC-37 - Spatial Autocorrelation http://gistbok.ucgis.org/bok-topics/spatial-autocorrelation).

Here we will use a spatially Lagged Mean, Moran’s I (Moran 1950), Local Indicators of Spatial Association (LISA) (Anselin 1995) and Getis and Ord’s G Statistic (Getis and Ord 1992) to determine whether there is clustering in our data. We can undertake these tests in R very easily using tools in the spdep package.

6.2 Data and Packages

For this section we will work with the Brexit referendum data summarised over the 380 UK Local Authority Districts (LADs) with socio-economic data collected from the 2011 Census. The data were made available in by The Electoral Commission and used by Beecham, Williams, and Comber (2020) for some advanced visual analysis of the trends.It was used in Week 2 for the EDA work.

The code below loads the data, transforms it from WG84 (in degrees) to OSGB projection (in metres) and generates a map of the leave share.

# package installation checks
if (!is.element("sf", installed.packages()))
    install.packages("sf", dep = T)
if (!is.element("tmap", installed.packages()))
    install.packages("tmap", dep = T)
if (!is.element("spdep", installed.packages()))
    install.packages("spdep", dep = T)
if (!is.element("tidyverse", installed.packages()))
    install.packages("tidyverse", dep = T)
# load packages
library(sf)         # for spatial data
library(tmap)       # for mapping
library(spdep)      # for spatial analysis
library(spatialreg) # for spatial regression 
library(tidyverse)  # for everything else! 
# load the data
gb = st_read('data_gb.geojson', quiet = T)
# transform to OSGB projection
gb <- st_transform(gb, 27700)
# create the map
p1 = 
  tm_shape(gb) + 
  tm_polygons("share_leave", midpoint = 0.5, lwd = 0.1) + 
  tm_layout(frame = F)
p1 

The map shows some distinct and well-reported patterns. However, a key question a geographer might ask is whether the spatially patterns of the leave vote can be examined in a more quantifiable way. Lagged mean plots can do this.

6.3 Spatial Autocorrelation and Clusters

6.3.1 Neighbours and Lagged Mean Plots

One simple approach for determining clusters is to compare the value in each county with the average values of its neighbours. This can be achieved via the lag.listw function in the spdep library. This library provides a number of tools for handling spatial data.

Neighbours can be defined in several ways, but a common definition is that a pair of areas (or polygons) which share some part of their boundaries are neighbours. If this is the “queen’s case” definition, then even a pair of areas meeting at a single corner point are considered neighbours. The more restrictive “rook’s case” requires that the common boundary must be a linear feature.

Neighbour lists of either case can be extracted using the poly2nb function in the spdep package. These are stored in an nb object – basically a list of neighbouring polygons for each LAD polygon. Note that in this case Queen’s case neighbours are computed. This is the default. To compute the Rook’s case, the optional argument queen=FALSE would be added to the call to poly2nb below.

# determine adjacency 
g.nb <- poly2nb(gb) 
# examine
g.nb
## Neighbour list object:
## Number of regions: 380 
## Number of nonzero links: 1888 
## Percentage nonzero weights: 1.307479 
## Average number of links: 4.968421 
## 6 regions with no links:
## 101 125 126 171 174 179

Now from this it is clear that some LADs have no neighbour: observations 101, 125, 126, 171, 174 and 179. On inspection these are island LADs. You can examine these visually using the code below and then zooming in - have a look at the Isle of Wight or Anglesey, for example.

# examine zero links locations
gb$rn = rownames(gb) 
tmap_mode("view")
tm_shape(gb) + 
  tm_borders() +
  tm_text(text = "rn") +
  tm_basemap("OpenStreetMap")
tmap_mode("plot")

Now in order to complete the adjacency analysis, the list needs to be updated with links for island LADs. The table below shows the links and record numbers (IDs) of adjacent LADs:

Table 6.1: The Islands LADs.
ID.code Authority.Name Adjacent.to Adjacent.ID.codes
101 Scilly Isles Cornwall 100
125 Isle of Wight New Forest, Southampton, Portsmouth 247, 124, 123
126 Anglesey Gwynedd 127
171 Orkneys Highland, Shetland 165, 174
174 Shetland Islands Orkneys 171
179 Outer Hebredes Highland 165

We can can use this information (discovered by examining the data using the zoomable online maps) to edit adjacency list as below. Notice the double bracket syntax that is used with list format data, g.nb and the use of asinteger to force the numeric values to be integers:

## edit the links
g.nb[[101]] = as.integer(100)
g.nb[[125]] = as.integer(c(247, 124, 123))
g.nb[[126]] = as.integer(127)
g.nb[[171]] = as.integer(c(165, 174))
g.nb[[174]] = as.integer(171)
g.nb[[179]] = as.integer(165)

We can re-examine the adjacency list and see that there are now no zero links.

g.nb
## Neighbour list object:
## Number of regions: 380 
## Number of nonzero links: 1897 
## Percentage nonzero weights: 1.313712 
## Average number of links: 4.992105

It is also possible to plot an nb object – this represents neighbours as a network. The nb2lines function in spdep requires a coords parameter of the centroids of the (LAD) polygons regarded as neighbours. In the code below, this is created by the st_centroid function nested inside the st_geometry function. You should examine the help for these to see what they do if it is not apparent by their naming. The results may be plotted as a map as in the Figure below.

# Create a line layer showing Queen's case contiguities
gg.net <- nb2lines(g.nb,coords=st_geometry(st_centroid(gb)), as_sf = F) 
## Warning: st_centroid assumes attributes are constant over geometries
# Plot the contiguity and the GB layer
tm_shape(gb) + tm_borders(col='grey') + 
  tm_shape(gg.net) + tm_lines(col='red', alpha = 0.5)

The next stage is to consider the lagged mean plot – as discussed above, this is a plot of the value of \(z_i\) for each polygon \(i\) against the mean of the \(z\)-values for the neighbours of polygon \(i\). If \(\delta_i\) is the set of indices of the neighbours of polygon \(i\), and \(|\delta_i|\) is the number of elements in this set, then this mean (denoted as \(\tilde{z_i}\) is defined by

\[ \tilde{z_i} = \sum_{j \in \delta_i} \frac{1}{|\delta_i|} z_i \label{laggedmean} \]

Thus, the lagged mean is a weighted combination of values of the neighbours. In this case, the weights are the same within each neighbour list, but in some cases they may differ (for example, if weighting were inversely related to distance between polygon centres).

Neighbour weights can be generated using the nb2listw which results in listw objects. The function lag.listw then computes a spatially lagged mean (i.e. a vector of \(z_i\) values) – here, these are calculated, and then mapped as in the figure below, along side the original map. What this shows is a smoother general pattern of the spatial trends in the data.

# compute the weighted list from the nb object
g.lw = nb2listw(g.nb)
# compute the lagged means
gb$lagged.means <- lag.listw(g.lw, gb$share_leave)
p2 = 
  tm_shape(gb) + 
  tm_polygons(col='lagged.means', midpoint = 0.5, lwd = 0.1) + 
  tm_layout(frame = F)
tmap_arrange(p1, p2)

A lagged mean plot is produced – as described this is a scatter plot of \(z_i\) (share_leave) against the lagged mean, \(\tilde{z_i}\) . Here the line \(x = y\) is added to the plot as a point of reference. The idea is that when nearby polygons tend to have similar \(z_i\) values, there should be a linear trend in the plots. However, if each \(z_i\) is independent, then \(\tilde{z_i}\) will be uncorrelated to \(z_i\) and the plots will show no pattern. Below, code is given to produce this plot shown:

ggplot(data = gb, aes(x = share_leave, y = lagged.means)) +
  geom_point(shape = 1, alpha = 0.5) +
  geom_hline(yintercept = mean(gb$lagged.means), lty = 2) +
  geom_vline(xintercept = mean(gb$share_leave), lty = 2) +
  geom_abline() +
  coord_equal()

The key reason for this is plot is to see whether there is a linear pattern relationship, suggests autocorrelation is present, i.e. that there is a fair some degree of positive association between \(z_i\) and \(\tilde{z_i}\). This means generally that when \(z_i\) is above average, so is \(\tilde{z_i}\) , and when one is below average, so is the other.

In this case we have the first piece of evidence that there is spatially autocorrelation on the leave share vote: the lagged mean plot indicates that some spatial autocorrelation is present. However, a key question a geographer might ask is whether the spatial patterns of the leave vote are statistically (significantly) clustered.

6.3.2 Moran’s I for determining Clusters and Spatial Autocorrelation

Having found spatial autocorrelation in the lagged man plots, we can test for this statistically to determine whether nearby observations (such as the share_leave values in the LAS data) are spatially clustered. That is, co-dependency between pairs of adjacent or neighbouring values is present. Moran’s I, LISA and G statistics allow us to test for this, and provide a measure of the degree of clustering amongst neighbouring observations. The Moran’s I coefficient (Moran 1950) is defined as follows::

\[ I = \frac{n}{\sum_i \sum_j w_{ij}} \frac{\sum_i \sum_j w_{ij} (z_i - \bar{z}) (z_j - \bar{z})}{\sum_i (z_i - \bar{z})^2} \label{morani} \]

where \(w_{ij}\) is the \(i,j\)th element of a neighbour or adjacency weights matrix \(\mathbf{W}\), as defined above using the nb2listw function, specifying the degree of dependency between polygons \(i\) and \(j\), \(z\) is the attribute examined for autocorrelation and \(\bar{z}\) is the mean of that attribute.

From the above equation description it is clear that in order to compute some measure of autocorrelation, information about observation nearness or adjacency is needed: for example what LADs in the UK are adjacent to, or neighbouring, what other LADs. We can used the weighted list g.lw created above. The code below uses the weighted list to compute Moran’s I for the vote leave share variable (share_leave) with the moran.test() function as in the code below :

# do a Moran's I
moran.test(x = gb$share_leave, listw = g.lw) 
## 
##  Moran I test under randomisation
## 
## data:  gb$share_leave  
## weights: g.lw    
## 
## Moran I statistic standard deviate = 18.268, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.626791902      -0.002638522       0.001187123

This tells us that it is highly correlated: Moran’s I has theoretical limits from -1 for perfect dispersion or repulsion, to +1 for perfect clustering, with 0 indicting a perfectly random pattern.

In order to explore this a bit more deeply we can use a Moran plot or Moran scatterplot – see Anselin (1995); Anselin (2019). This generates a scatter plot of the average of the neighbouring LADs leave share (referred to as spatially lagged) against the leave share value recorded for each LAD. The four quadrants are determined by the mean for each of these and indicate high-high values (top right) and low-low vales (bottom left). So the top right shows LADs which have high leave share rates and are surrounded with LADS which also have high leave share rates.

This is the same as the lagged mean plot above, but addition to the ggplot version earlier, this approach also identifies points with a high influence in providing a best-fit line to the plot.

# create and assign the Moran plot
moran.plot(x = gb$share_leave, listw = g.lw, asp = 1)

This shows a strong positive linear trend reflecting the autocorrelation of the share_leave attribute. The labelled points are those with a high influence in providing a best-fit line to the plot.

Returning to the Moran’s I value, this suggests large amount of spatial autocorrelation in the share_leave attribute. This is fine, but one problem is deciding whether the above value is sufficiently high to suggest that an autocorrelated process model is a plausible alternative to an assumption that the voting rates are independent. There are two issues here:

  1. Is this value of \(I\) a relatively large level on an absolute scale?
  2. How likely is the observed \(I\) value, or a larger value, to be observed if the rates were independent?

The first of these is a benchmarking problem. Like correlation, Moran’s I is a dimensionless property so that, for example, with a given set of polygons and associated matrix, area-normalised rates of rainfall would have the same Moran’s I regardless of whether rainfall was measured in millimetres or inches. However, while standard correlation is always restricted to lie within the range [-1, +1], making, say, a value of 0.8 reasonably easy to interpret, the range of Moran’s I varies with the nature of the lw object, the weighted adjacency matrix, \(\mathbf{W}\).

To overcome, this the function below determines the range of possible I values for any given lw object:

moran.range <- function(lw) {
  wmat <- listw2mat(lw)
  return(range(eigen((wmat + t(wmat))/ 2) $values))
  } 
moran.range(g.lw)
## [1] -0.8657588  1.1079643

Here, \(I\) can range between -0.87 and 1.11. Thus on an absolute scale the reported value suggests a high level of spatial autocorrelation: i.e. that the leave share of the vote is strongly clustered. However, another key question a geographer might ask is where is the leave vote significantly clustered. This can be done with a LISA statistic.

Task 1

Undertake a Moran’s I test and Moran plot of the private_transport_to_work variable. Comment on how it compares to the share_leave attribute

6.3.3 LISA for determining local clusters and spatial autocorrelation

The value of \(I\) above is a global value - it summarises the whole study area. In reality some regions of the study area may be more clustered than others. Anselin (1995) proposed the idea of local indicators of spatial association (LISAs). He states two requirements for a LISA:

  • The LISA for each observation gives an indication of the extent of significant spatial clustering of similar values around that observation.
  • The sum (or mean) of LISAs for all observations is proportional to a global indicator of spatial association.

It is possible to apply a statistical test to the LISA for each observation, and thus test whether the local contribution to clustering around observation is significantly different from zero. This provides a framework for identifying localities where there is a significant degree of spatial autocorrelation or clustering (or repulsion). An initial example of a LISA may be derived from the Moran’s I index.

This uses the mean neighbourhood values of adjacent polygons to detect clustering, \(I_i\), a local measure of clustering. It indicates repulsion when \(I_i\) is a large negative value, clustering when it is a large positive value and if the magnitude is not particularly large (for either positive or negative values) this suggests that there is little evidence for either clustering or repulsion. For each \(I_i\), a significance test may be carried out against a hypothesis of no spatial association.

The code snippets below undertake this

  1. Compute the local Moran’s I (and map the result)
# note how the result is assigned directly to gb
gb$lI <- localmoran(x = gb$share_leave, listw = g.lw)[, 1] 
# create the map
p3 = tm_shape(gb) +
  tm_polygons(col= 'lI',title= "Local Moran's I", lwd = 0.1,
              legend.format=list (flag= "+"), midpoint = 0) +
  tm_style('col_blind') +
  tm_layout(frame = F)
# print the map
p3

The map shows there is some evidence for both positive and negative \(I_i\) values - i.e. that there is some clustering and some dispersion. However, it is useful to consider the p-values for each of these values, as described above.

  1. Compute the p-values

These are calculated and mapped below. In this case a manual shading scheme (i.e. one in which the shading interval breaks are specified directly) is used, based on conventional critical p-values.

# Create the local p values
gb$pval <- localmoran(gb$share_leave,g.lw)[, 5]

The p-values can be plotted. Here the palette has been selected manually, via the palette parameter. The palette shades higher numbers in a lighter colour to reflect the fact that lower p-values are more important.

# create a palette
my.pal = c("#31A354", "#A1D99B","#E5F5E0",   "lightgrey")
# create the map
p4 = tm_shape(gb) +
  tm_polygons(col= 'pval',title= "p-value",breaks= c(0, 0.01, 0.05, 0.10, 1), 
              border.col = "black", lwd = 0.1,
              palette = my.pal) +
  tm_layout(frame = F) 
# plot it
p4

From the resultant map, a number of places can be identified where the p-value is notably low (dark greens) and therefore statistically surprising, suggesting the possibility of a cluster of either high or low values, in this case in highly localised areas around in Scotland, the East Midlands, Birmingham and around London.

It is instructive to examine the maps together:

tmap_arrange(p1, p3, p4,ncol = 3)

And we can use the p-values to significantly spatially autocorrelated local highlight areas as in:

index  = gb$pval <= 0.05
p1 + tm_shape(gb[index,]) +tm_borders()

Task 2

Undertake a LISA analysis of the degree_educated variable and map the significantly autocorrelated LADs as in the figure above (no need to use midpoint = 0.5).

6.3.4 Getis-Ord G statistic

An alternative to the analyses and maps above is to use a Getis and Ord \(G\) Statistic (Getis and Ord 1992). Instead of similarity in adjacent areas, this examines similarity within a pre-defined distance in order to identify whether either high or low values cluster spatially. Statistically significant clusters are those with high values, where other areas within the predefined distance also share high values too.

Here we will use the LADs centroids and a search radius of 100 km, and as before some of the islands need to be connected in the network:

g.nb2 <- dnearneigh(st_centroid(gb), 0, 100000)
## Warning: st_centroid assumes attributes are constant over geometries
g.nb2
## Neighbour list object:
## Number of regions: 380 
## Number of nonzero links: 26724 
## Percentage nonzero weights: 18.50693 
## Average number of links: 70.32632 
## 4 regions with no links:
## 101 171 174 179
# connect islands as before
g.nb2[[101]] = as.integer(100)
g.nb2[[171]] = as.integer(c(165, 174))
g.nb2[[174]] = as.integer(171)
g.nb2[[179]] = as.integer(165)
g.nb2
## Neighbour list object:
## Number of regions: 380 
## Number of nonzero links: 26729 
## Percentage nonzero weights: 18.51039 
## Average number of links: 70.33947

Having defined a new set of neighbours, the weighted matrix can be created as before and we can use this to calculate the \(G\) Statistic:

g.lw2 = nb2listw(g.nb2, style = 'B')
gb$gstat <- localG(gb$share_leave, g.lw2)

This can be mapped in the same way as the LISA (with a subtly different syntax). This shows clusters of high leave votes and low ones.

p5 = tm_shape(gb) + 
  tm_polygons(col= 'gstat',title= "G Statistic", lwd = 0.1, 
              legend.format=list (flag= "+"), palette = "RdBu", midpoint = 0) + 
  tm_layout(frame = F)
p5

The impacts (benefits?) of incorporating a broader idea of Neighbourhood can be seen here: the key leave areas have a positive \(G\) (in blue) and the key remain areas a negative one in red. So the interpretation of the \(G\) is slightly different:

  • in a LISA negative values indicate dispersion and positive ones clustering. But we have no idea about the values of the variable itself;
  • in a \(G\) analysis the results are given as z-scores (a way of standardising the distribution of numeric variables to the same scale) and larger values indicate high clustering and the direction (positive or negative) indicates high or low value clusters.

Task 3

Recreate the \(G\) Statistic, but this time using neighbourhood defined over 60 km and compare the mapped results with those for a 100 km radius. You will need to connect the islands in the same ways as for the 100 km.

6.3.5 Answers to Tasks

Task 1: Undertake a Moran’s I test and Moran plot of the private_transport_to_work variable. Comment on how it compares to the share_leave attributes.

# Task 1
# moran test
moran.test(gb$private_transport_to_work, g.lw) 
## 
##  Moran I test under randomisation
## 
## data:  gb$private_transport_to_work  
## weights: g.lw    
## 
## Moran I statistic standard deviate = 21.754, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.741982709      -0.002638522       0.001171646
# moran plot
moran.plot(gb$private_transport_to_work, g.lw)

There are a number of things to note from this plot:

  • the spatial autocorrelation is stronger (0.74) indicating a stronger degree of clustering
  • the Moran plot reflects this with a tighter scatterplot
  • there is an easier explanation: levels of public transport to work will be highly related to urban / rural context and these by definition have a strong geographical component: nearby areas are likely to have similar levels of this variable just because of what it captures (from high levels of transport infrastructure in rural areas to low levels in rural ones).

Task 2: Undertake a LISA analysis of the degree_educated variable and map the significantly autocorrelated LADs as in the figure above (no need to use midpoint = 0.5).

The results are shown in below. They show some areas significant spatial autocorrelation of low proportions in the East, East Midlands, South Wales and the Black Country, and significant area of high proportions around London and Cambridge.

# Task 2
# create the p-value
pval <- localmoran(gb$degree_educated,g.lw)[, 5]
index  = pval <= 0.05
tmap_mode("plot")
## tmap mode set to plotting
p0 = tm_shape(gb) + tm_polygons("degree_educated", lwd = 0.1) +
  tm_layout(legend.position = c("left", "top")) 
p0 + tm_shape(gb[index,]) +tm_borders()

Task 3: Recreate the \(G\) Statistic, but this time using neighbourhood defined over 60 km and compare the mapped results with those for a 100 km radius. You will need to connect the islands in the same ways as for the 100 km.

# Task 3
# create mew neighbourhood layer
g.nb3 <- dnearneigh(st_centroid(gb), 0, 60000)
## Warning: st_centroid assumes attributes are constant over geometries
# examine empty slots
g.nb3
## Neighbour list object:
## Number of regions: 380 
## Number of nonzero links: 12462 
## Percentage nonzero weights: 8.630194 
## Average number of links: 32.79474 
## 5 regions with no links:
## 101 165 171 174 179
# connect islands as before
g.nb3[[101]] = as.integer(100)
g.nb3[[165]] = as.integer(171)
g.nb3[[171]] = as.integer(c(165, 174))
g.nb3[[174]] = as.integer(171)
g.nb3[[179]] = as.integer(165)
# convert to weighted neighbourhood list 
g.lw3 = nb2listw(g.nb3, style = 'B')
# calculate the G stat
gb$gstat2 <- localG(gb$share_leave, g.lw3)
# plot
p6 = tm_shape(gb) + 
  tm_polygons(col= 'gstat2',title= "G Statistic", lwd = 0.1, 
              legend.format=list (flag= "+"), palette = "RdBu", midpoint = 0) + 
  tm_layout(frame = F)
# compare, adding titles
tmap_arrange(p5 + tm_layout(title = "100 km"), 
             p6 + tm_layout(title = "60 km"))

This demonstrates subtly different spatial patterns and hints at the different ways that the concept of a neighbourhood can be defined.

6.4 Optional: Autoregressive models

In the previous sections different measures of spatial autocorrelation were calculated. However, up to this point no consideration has been given to a model of a spatially autocorrelated process. In this section, two spatial models will be considered referred to as spatial autoregressive models. What they do essentially is to regress the \(z_i\) value for any given polygon on values of \(z_j\) for neighbouring polygons. The two models that will be considered are the simultaneous autoregressive (SAR) and conditional autoregressive (CAR) models. In each case, the models can also be thought of as multivariate distributions for \(\mathbf{z}\), with the variance/covariance matrix being dependent on the \(\mathbf{W}\) matrix considered earlier.

This section is based on Chapter 7 of Brunsdon, Fotheringham, and Charlton (1996).

6.4.1 Data and Packages

For this section we will work with the Pennsylvania lung cancer data. This is a set of county-level lung cancer counts for 2002. The counts are stratified in ethnicity (with rather broad categories ‘white’ and ‘other’), gender, and age (‘under 40’, ‘40 to 59’, ‘60 to 69’ and ‘over 70’). In addition, a table of pro-portion of smokers per county is provided. This is included in the SpatialEpi package.

The code below loads the packages and data, and converts the data so an sf object. calculates the smoking rate and maps it.

# package installation checks
if (!is.element("sf", installed.packages()))
    install.packages("sf", dep = T)
if (!is.element("tmap", installed.packages()))
    install.packages("tmap", dep = T)
if (!is.element("spdep", installed.packages()))
    install.packages("spdep", dep = T)
if (!is.element("spatialreg", installed.packages()))
    install.packages("spatialreg", dep = T)
if (!is.element("SpatialEpi", installed.packages()))
    install.packages("SpatialEpi", dep = T)
if (!is.element("plyr", installed.packages()))
    install.packages("plyr", dep = T)
# load packages
library(sf)         # for spatial data
library(tmap)       # for mapping
library(spdep)      # for spatial analysis
library(spatialreg) # for spatial regression 
library(SpatialEpi) # for the lung cancer data
library(plyr)       # for some data wrangling! 
# load the data
data(pennLC)
# Extract the spatial info
penn.sp <- pennLC$spatial.polygon
# create the SpatialPolygonDataFrame - sp format
penn.sp = SpatialPolygonsDataFrame(penn.sp, 
                                   data = data.frame(pennLC$smoking), 
                                   match.ID = F)
# create the sf object
penn.sf = st_as_sf(penn.sp)
# convert to utm projection
penn.sf = st_transform(penn.sf, 3724)
# Obtain the smoking rates
penn.sf$smk <- pennLC$smoking$smoking * 100
# create the map
tm_shape(penn.sf) + tm_polygons("smk", palette = "YlOrRd", title= "% of Popn.") + 
  tm_layout(legend.outside = T)

6.4.2 Simultaneous AutoRegressive models

The SAR model may be specified as \[ z_i = \mu + \sum_{j=1}^{j=n} b_{ij} \left(z_j - \mu \right) + \epsilon_i \]

where \(\epsilon_i\) has a Gaussian distribution with mean 0 and variance \(\sigma_i^2\) (often \(\sigma_i^2 = \sigma^2 \ \forall \ i\), so that the variance of \(\epsilon_i\) is constant across zones), \(\textrm{E}(z_i) = \mu\) and \(b_{ij}\) are constants, with \(b_{ii} = 0\) and usually \(b_{ij} = 0\) if polygon \(i\) is not adjacent to polygon \(j\) - thus, one possibility is that \(b_{ij}\) is \(\lambda w_{ij}\). Here, \(\lambda\) is a parameter specifying the degree of spatial dependence. When \(\lambda=0\) there is no dependence, when it is positive then positive autocorrelation exists, and when it is negative, negative correlation exists. \(\mu\) is an overall level constant (as it is in a standard normal distribution model). If the rows of \(\mathbf{W}\) are normalised to sum to one, then the deviation from \(\'mu\) for \(z_i\) is dependent on the deviation from \(\mu\) for the \(z_j\) values for its neighbours.

The SAR model may be calibrated using the spautolm function from the spatialreg package. This uses the same notation as other regression functions to specify models, such as lm.

An initial model can be specified by using the notation for a linear model with just a constant term for the mean of the predicted variable - this is \(\mu\) in the equation above. This is simply Var.Name ~ 1 with Var.Name replaced with the actual variable name of interest (for example smk in the smoking rate example. A further parameter family, specifies whether a SAR or a CAR model is fitted. The function returns a regression model object - among other things this allows the values of coefficients, fitted values and so on to be extracted. The example below first calculates the neighbour or adjacency weights matrix \(\mathbf{W}\) as before and then the SAR model.

# find neighbours
penn.nb <- poly2nb(penn.sf)
# neighbour weights matrix
penn.lw <- nb2listw(penn.nb)
sar.res <- spautolm(smk~1,listw=penn.lw,data=penn.sf)
sar.res
## 
## Call:
## spautolm(formula = smk ~ 1, data = penn.sf, listw = penn.lw)
## 
## Coefficients:
## (Intercept)      lambda 
##  23.8102179   0.6258136 
## 
## Log likelihood: -142.7405

This is a simple model of neighbouring values of \(z_i\). From this it can be seen that \(\lambda =\) 0.626, and \(\mu\) = 23.810. While the estimate for \(\mu\) is easily interpretable, deciding where the reported level of \(\lambda\) is of importance is harder. One possibility is to find the standard error of \(\lambda\) - this is reported as the lambda.se component of the spatial autoregression object:

sar.res$lambda.se
## [1] 0.1132383

An approximate 5% confidence interval can be found in the standard way - by finding a band given by the estimate of \(\lambda\) plus or minus two times the standard error:

sar.res$lambda + c(-2,2)*sar.res$lambda.se
## [1] 0.3993369 0.8522903

As before, this suggests that a null hypothesis of \(\lambda=0\) is highly unlikely.

In the next section, the SAR and CAR models will be expanded to consider further predictor variables, rather than just neighbouring values of \(z_i\).

6.4.3 Conditional AutoRegressive models

The CAR model is specified by

\[ z_i| \left\{ z_j : j \ne i \right\} \sim N \left( \mu + \sum_{j=1}^{j=n} c_{ij} \left(z_j - \mu \right), \tau_i^2 \right) \]

where, in addition to the above definitions, \(N(.,.)\) denotes a normal distribution with the usual mean and variance parameters, \(\tau_i^2\) is the conditional variance of \(z_i\) given \(\left\{ z_j : j \ne i \right\}\) and \(c_{ij}\) are constants such that \(c_{ii} = 0\) and, as with \(b_{ij}\) in the SAR model, typically \(c_{ij} = 0\) if polygon \(i\) is not adjacent to polygon \(j\). Again, a common model is to set \(c_{ij} = \lambda w_{ij}\). \(\mu\) and \(\lambda\) have similar interpretations to the SAR model. A detailed discussion in Cressie (2015) refers to the matrices \(\mathbf{B} = [b_{ij}]\) and \(\mathbf{C}=[c_{ij}]\) as `spatial dependence’ matrices. Note that this model can be expressed as a multivariate normal distribution in \(\textbf{z}\) as

\[ \textbf{z} \sim N(\mu \mathbf{1}, (\mathbf{I} - \mathbf{C})^{-1}\mathbf{T}) \label{carmat} \]

where \(\mathbf{1}\) is a column vector of 1’s (of size \(n\)) and \(\mathbf{T}\) is a diagonal matrix composed of the \(\tau_i\)’s - see Besag (1974) for example. Note that this suggests that the matrix \((\mathbf{I} - \mathbf{C})^{-1}\mathbf{T}\) must be symmetrical and also positive definite. If the \(\mathbf{W}\) matrix is row-normalised, and the \(c_{ij} = \lambda w_{ij}\) model is used, then this implies that \(\tau_i\) must be proportional to \(\left[ \sum_j c_{ij} \right]^{-1}\).

It is also possible to calibrate CAR models in the same way, and similarly obtain an approximate confidence interval for \(\lambda\). This is achieved - in our example - via the parameter to :

car.res <- spautolm(smk~1, listw=penn.lw, family="CAR", data=penn.sf)
## Warning in spautolm(smk ~ 1, listw = penn.lw, family = "CAR", data = penn.sf):
## Non-symmetric spatial weights in CAR model
car.res
## 
## Call:
## spautolm(formula = smk ~ 1, data = penn.sf, listw = penn.lw, 
##     family = "CAR")
## 
## Coefficients:
## (Intercept)      lambda 
##  24.0981682   0.9199742 
## 
## Log likelihood: -142.4412

This is another simple model of neighbouring values of \(z_i\). From this it can be seen that \(\lambda =\) 0.920, and \(\mu\) = 24.098.

Again the lambda.se component of the spatial autoregression object can be examined and an approximate 5% confidence interval can be found, which again suggest that the null hypothesis of \(\lambda=0\) is highly unlikely.

car.res$lambda.se
## [1] 0.08976478
car.res$lambda + c(-2,2)*car.res$lambda.se
## [1] 0.7404447 1.0995038

6.4.4 Models with predictors: A Bivariate Example

Both the CAR and SAR models can be modified to include predictor variables in as well as incorporate autocorrelation effects. The example below expands the SAR model above, but a similar approach can be taken with CAR models.

This is achieved by replacing a constant \(\mu\) by an observation specific \(\mu_i\) for each \(z_i\), where \(\mu_i\) is some function of a predictor variable (say \(P_i\)). If the relationship between \(\mu_i\) and \(P_i\) is linear, we can write, for the SAR case:

\[ z_i = a_0 + a_1 P_i + \sum_{j=1}^{j=n} b_{ij} \left(z_j - a_0 - a_1 P_i \right) + \epsilon_i \]

where \(a_0\) and \(a_1\) are effectively intercept and slope terms in a regression model. The key difference between this kind of model and a standard ordinary least squares (OLS) model is that for the OLS case the \(z_i\) values are assumed to be independent, whereas here nearby \(z_j\) values influence \(z_i\) as well as the predictor variable.

Calibrating models such as that in equation above in R is straightforward, and involves including predictor variables in the model argument for spautolm. In the following example, a new data item, the per-county lung cancer rate for Penn State in 2002 is computed, and used as the \(z_i\) variable. This time the role of the smoking uptake variable is changed to that of the predictor variable, \(P_i\). This is achieved via a two-stage process:

  1. Compute the per-county lung cancer rates
  2. Compute the regression model

Recall that the pennLC object is a list, and one of the elements (called data) is a data frame giving the counts of population, and lung cancer incidence, for each county in Penn State subdivided by race (white' orother’), gender (male' orfemale’), and age (Under 40',40 to 59’,60 to 69', andOver 70’). The format of the data frame uses a county column and three substrata columns - together specifying a combination of county, age, gender and ethnicity. (NB now we use the term ethnicity but unfortunately the supplied data uses race). Two further columns then specify the count of cases for that county/substrata combination, and also the overall population for the same county substrata combination:

head(pennLC$data)
##   county cases population race gender      age
## 1  adams     0       1492    o      f Under.40
## 2  adams     0        365    o      f    40.59
## 3  adams     1         68    o      f    60.69
## 4  adams     0         73    o      f      70+
## 5  adams     0      23351    w      f Under.40
## 6  adams     5      12136    w      f    40.59

For example, it may be seen that Adams County has 0 incidents of lung cancer for non-white. Here o denotes Other - that is, non-white females under 40 out of a total population of 1492 female non-white people under 40 in Adams County. Using the plyr component of the tidyverse package, it is possible to create a data frame showing the total number of cases over all combinations of age, ethnic and gender for each county:

totcases <- ddply(pennLC$data,c("county"),numcolwise(sum))
head(totcases)
##      county cases population
## 1     adams    55      91292
## 2 allegheny  1275    1281666
## 3 armstrong    49      72392
## 4    beaver   172     181412
## 5   bedford    37      49984
## 6     berks   308     373638

Having created a data frame of county-based lung cancer incident and population counts, the cancer rates per 10,000 population are computed. These are added as a new column to the totcases data frame:

totcases <- transform(totcases,rate=10000*cases/population)

Thus, totcases now has three columns, and is ready to provide input to the regression model - below this variable is inspected (using head) and a boxplot drawn in in the figure below:

head(totcases)
##      county cases population     rate
## 1     adams    55      91292 6.024624
## 2 allegheny  1275    1281666 9.947990
## 3 armstrong    49      72392 6.768704
## 4    beaver   172     181412 9.481181
## 5   bedford    37      49984 7.402369
## 6     berks   308     373638 8.243273
# Check the distribution of rates
boxplot(totcases$rate,horizontal=TRUE,
        xlab='Cancer Rate (Cases per 10,000 Popn.)')

It is now possible to calibrate the spatial regression model. As stated earlier, the \(z_i\) variable here is related to the cancer rate, and the predictor is smoking uptake. Note that in this case an additional weighting variable is added - based on the population variable, and also that \(z_i\) is actually the square root of the cancer rate. This allows for the fact that the random variable here is actually the count of cancer cases - and that this is possibly a Poisson distributed variable and the square root transform can stabilise variance of Poisson count data. Since the square root rate is essentially

\[ \sqrt{\frac{\textrm{No. of Cases}}{\textrm{Population}}} \]

and population is assumed a fixed quantity - then the numerator above will have an approximately fixed variance - and be reasonably approximated by a normal distribution. Dividing this by the square root of population then makes the variance inversely proportional to the population. Hence, weighting by population is also appropriate here. Taking these facts into account, the SAR model may be calibrated and assessed:

sar.mod <- spautolm(rate~sqrt(penn.sf$smk), listw=penn.lw,
                    weight=population, data=totcases)
summary(sar.mod)
## 
## Call: spautolm(formula = rate ~ sqrt(penn.sf$smk), data = totcases, 
##     listw = penn.lw, weights = population)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5.56117 -1.11970 -0.33268  0.50932  5.01263 
## 
## Coefficients: 
##                   Estimate Std. Error z value  Pr(>|z|)
## (Intercept)       -0.50976    2.31802 -0.2199    0.8259
## sqrt(penn.sf$smk)  1.84445    0.47170  3.9102 9.222e-05
## 
## Lambda: 0.3335 LR test value: 4.0998 p-value: 0.042889 
## Numerical Hessian standard error of lambda: 0.15562 
## 
## Log likelihood: -124.4119 
## ML residual variance (sigma squared): 218030, (sigma: 466.94)
## Number of observations: 67 
## Number of parameters estimated: 4 
## AIC: NA (not available for weighted model)

The ‘coefficients’ section in the output may be interpreted in a similar way to a standard regression model. From this it can be seen that the rate of smoking does influence the rate of occurrence of lung cancer - or at least that there is evidence to reject a null hypothesis that it does not effect cancer rates, with \(p\)=9.22 \(\times\) 10\({}^{-5}\). The lambda section provides a \(p\)-value for the null hypothesis that \(\lambda=0\) - that is, that there is a degree of spatial autocorrelation in the cancer rates. Here, \(p\)=0.043, so that at the 5% level there is evidence to reject the null hypothesis.

Thus, the analysis here suggests that smoking is linked to lung cancer, but that lung cancer rates are spatially autocorrelated. This is possibly because other factors that influence lung cancer (possibly age, or risk associated with occupation) are geographically clustered. Since these factors are not included in the model, information about their spatial arrangement might be inferred via nearby occurrence of lung cancer.

This simple bivariate example can be expanded with other factors simply by including them in the formula passed to the spautolm function. The coefficients can be interpreted in the usual way with the spatial component in the model described by the interpretation of lambda.

References

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115.
———. 2019. “The Moran Scatterplot as an ESDA Tool to Assess Local Instability in Spatial Association.” In Spatial Analytical Perspectives on GIS, 111–26. Routledge.
Beecham, Roger, Nick Williams, and Alexis Comber. 2020. “Regionally-Structured Explanations Behind Area-Level Populism: An Update to Recent Ecological Analyses.” Plos One 15 (3): e0229974.
Besag, Julian. 1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal Statistical Society: Series B (Methodological) 36 (2): 192–225.
Brunsdon, Chris, and Lex Comber. 2018. An Introduction to r for Spatial Analysis and Mapping (2e). Sage.
Brunsdon, Chris, A Stewart Fotheringham, and Martin E Charlton. 1996. “Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity.” Geographical Analysis 28 (4): 281–98.
Cressie, Noel. 2015. Statistics for Spatial Data. John Wiley & Sons.
De Smith, Michael John, Michael F Goodchild, and Paul Longley. 2007. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. Troubador publishing ltd.
Getis, Arthur, and J Keith Ord. 1992. “The Analysis of Spatial Association by Use of Distance Statistics.” Geographical Analysis 24 (3): 189–206.
Moran, Patrick AP. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37 (1/2): 17–23.
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (sup1): 234–40.