Module 7 Spatial Correlations and Regressions

In this tutorial, we will review methods to assess spatial relationships across polygons or areas, i.e., do areas closer together have similar values? We may want to evaluate the relationship between values across areas or the relationship between multiple variables across areas.

Learning Objectives

  • How to assess if there is spatial correlation in a variable across areas (polygons) and how strong is that relationship.

  • How to model this relationship using spatial regression methods to be able to interpolate values.

  • How to evaluate relationships between outcome and predictor variables across areas.

Additional Resources Morans I

spdep package

spatialreg package

Spatial AutoRegression

SAR Model

7.1 Moran’s I: An Index of Autocorrelation

Spatial correlation is how spatial data are different from other kinds of data. There is correlation across areas, i.e., they cannot be treated as independent from each other. Methods we use for independent data are not appropriate in this case. We can assess how much spatially connected neighbors are correlated using the Moran’s I coefficient.

What is the Moran’s I coefficient?

Moran’s I is a value that lies between (-1,1) and indicates a degree of autocorrelation. A high MI 1 to -1 indicates the data are not independent. In contrast a value close to 0 indicates independence. MI is defined as

\[I = \frac{n}{\sum_i\sum_j w_{ij}} \frac{\sum_i\sum_j w_{ij} (z_i - \bar{z}) (z_j - \bar{z})}{\sum_i (z_i - \bar{z})^2}\]

where \(w_{ij}\) is (i,j) elements of a weights matrix \(\mathbf{W}\) which specifies the degree of dependence between areas \(i\) and \(j\).

In summary, we can quantitatively assess the correlation/relationship between area specific values of a variable of interest. For the rest of the lecture, we will go through the steps to calculate Moran’s I. We will use the Newhaven, CT data for our example case. The New Haven census block data contains demographic variables across census blocks including the percent Black population and the percent of renter occupied units.

7.1.1 Step 1: Define Neighbors

We use a definition in which areas are neighbors if they share a common linear boundary or edge. First install and call the spdep package in R. This package can be used to obtain the neighborhood structure and to assess spatial relationships. Use the \(ploy2nb()\) function to calculate neighbors based on common boundaries.

library(sf)
library(GISTools)
library(dplyr)
library(spdep)
library(tmap)

data(newhaven)
blocks_sf <- st_as_sf(blocks)
blocks.nb <- poly2nb(blocks_sf, queen=F)
blocks.nb
Neighbour list object:
Number of regions: 129 
Number of nonzero links: 642 
Percentage nonzero weights: 3.857941 
Average number of links: 4.976744 

The output shows there are 129 regions with 642 links or shared boundaries and the average number of neighbors per region is 4.98.

7.1.2 Step 2: Plotting the neighborhood structure

We can view the neighborhood structure using the \(nb2lines()\) functions. The inputs are the neighborhood objects \(blocks.nb\) and the geometry from the \(sf\) data frame. The blue lines identify the connected neighbors for each region (in this case census tract). It also identifies the number of neighbors for each region.

blocks_net <- nb2lines(blocks.nb, coords = blocks_sf$geometry)

tm_shape(blocks_sf) +
  tm_borders(col = "lightgrey") +
  tm_shape(blocks_net) +
  tm_lines(col = "blue", lwd=1)

7.1.3 Step 3: The W matrix of weights for each area

Use the \(nb2listw()\) function to identify the neighbors and weight matrix for each region. For example, below we input the neighborhood matrix blocks.nb into the function. The function returns a list where each element in the list is specifications for a given region. Looking at census tract 1, we see it’s neighbors are tracts 2 and 7. Additionally, we see the weights for each neighbor is 0.5 meaning they are split equally.

Commonly, we set \(w_{ij}=1\) if areas \(i\) and \(j\) have connecting boundaries, and \(w_{ij}=0\) if they do not connect. But they can also be split equally across areas as long as \(\sum_i w_{ij}=1\).

Can you get the neighbors and weights for census tract 2?

blocks_lw <- nb2listw(blocks.nb)
blocks_lw$neighbours[[1]]
[1] 2 7
blocks_lw$weights[[1]]
[1] 0.5 0.5
blocks_lw$neighbours[[2]]
[1]  1  3  4  7 26 28
blocks_lw$weights[[2]]
[1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

7.1.4 Step 4: Calculate Moran’s I

Using the list object you calculated above, you can calculate Moran’s I using the \(moran.test()\) function shown in the code below. The first input is the original \(sf\) dataframe and we use the “$P_RENTROCC” to identify the variable we want to assess. In our application, we will assess the spatial relationship of the percent of renter occupied units across census tracts. Our hypothesis is that areas closer together have similar values in the density of renter occupied units. The second input is the list object calculated above to specify the neighborhood matrix.

moran.test(blocks_sf$P_RENTROCC, blocks_lw)

    Moran I test under randomisation

data:  blocks_sf$P_RENTROCC  
weights: blocks_lw    

Moran I statistic standard deviate = 7.9159, p-value = 1.228e-15
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.433504542      -0.007812500       0.003108165 

This gives a MI statistics of 0.434. Does this mean the MI value is sufficiently high to suggest that an auto-correlated process is appropriate versus considering the observations independent?

The output gives a p-value \(<0.001\) indicating a high degree of significance, i.e., the null hypothesis that data are independent has been disproved.

7.2 Spatial Regression Models

Moran’s I is a measure of spatial auto-correlation. What it does not do!

  • Model the underlying spatial process
  • Give us a way to interpolate values across areas.

Spatial Autoregression

A spatial method used to describe the spatial relationship between dependent and independent variables.

Spatial autoregression models regress values for any given polygon on the values of their neighbors. We are going to look at a simultaneous autoregressive model (SAR). \[z_i = \mu + \sum_{j=1}^n b_{ij} (z_j - \mu) + \epsilon_i\]

In this model,

\(\mu\) represents the overall mean (average) of \(z's\) across all areas.

The second term gives a weighted mean of \(z's\) from your neighbors, i.e., it will pull the values up or down from the overall mean based on your neighboring values. \(b_{ij}\) are constants where \(b_{ij} = 0\) if the areas are no neighbors, and \(b_{ij} = \lambda \cdot w_{ij}\) where \(\lambda\) is a parameter specifying the degree of spatial dependence. When \(\lambda = 0\) there is no spatial dependence. If \(\lambda>0\) then there is positive correlation, and conversely, if \(\lambda<0\) there is negative correlation.

\(\epsilon_i\) is the “noise” or variance of the distribution.

Below we perform spatial auto-regression of the renter occupied units across census tracts. First call the (spatialreg package)[https://cran.r-project.org/web/packages/spatialreg/spatialreg.pdf]. Then use the \(spautolm()\) function. We specify the regression formula to use percent renter occupied units (P_RENTROCC) and regression this value for area \(i\) on all other areas \(-i\).

library(spatialreg)
sar_reg <- spautolm(formula = P_RENTROCC ~ 1, listw = blocks_lw, family = "SAR", data = blocks_sf)

sar_reg

Call:
spautolm(formula = P_RENTROCC ~ 1, data = blocks_sf, listw = blocks_lw, 
    family = "SAR")

Coefficients:
(Intercept)      lambda 
 58.6585502   0.6516197 

Log likelihood: -549.5625 

From this it can be seen that \(\lambda = 0.65\) and \(\mu = 58.66\). The estimate of the intercept can be interpreted as the overall mean of percent renter occupied units across all census tracts. While the estimate of \(\lambda\) captures the correlation across areas in percent renter occupied units. It is a bit harder to find whether this correlation is significant.

To assess whether this correlation is significant, the below code calculates the standard error of \(\lambda\). This is reported using \(lambda.se\). An approximate 5% confidence interval can be found in the standard way by finding a band given by the estimate of \(\lambda\) plus or minus two times the standard error. If zero is included within this band, then \(\lambda\) is not significantly different from 0. Otherwise, if zero is not included, they \(\lambda = 0\) is highly unlikely.

sar_reg$lambda.se
[1] 0.07772932
sar_reg$lambda + c(-2,2) * sar_reg$lambda.se
[1] 0.4961610 0.8070783

7.2.1 Models with Predictor Variables:

In this case, we want to account for spatial autocorrelation AND for the impact of other predictor variables on our outcome \(z\). We extend the regression model above to include the predictor \(P_i\) within the regression equation.

\[z_i = a_o + a_1P_i + \sum_{j=1}^n b_{ij} (z_j - a_0 - a_1P_i) + \epsilon_i\]

  • \(a_0\) represents the overall mean (average) of \(z's\) across all areas (intercept).

  • \(a_1\) represents the predictor coefficient and captures that change in outcome \(z\) for each unit change in your predictor \(P\) (Slope).

  • \(\epsilon_i\) is the “noise” or variance of the distribution like above.

  • Again the second term is a weighted mean of neighboring values.

  • \(P_i\): the predictor value for area \(i\).

For your purposes we want to assess the relationship/association between the density of the Black population and the percent renter occupied units in each census tract. In other words, we are regressing the Black population density on the percent renter occupied units to assess how much impact the predictor variable has on our outcome of interest.

sar_mod <- spautolm(formula = P_BLACK ~ P_RENTROCC, listw = blocks_lw, family = "SAR", data = blocks_sf)

summary(sar_mod)
## 
## Call: 
## spautolm(formula = P_BLACK ~ P_RENTROCC, data = blocks_sf, listw = blocks_lw, 
##     family = "SAR")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -45.5444 -10.8738  -2.2969  10.1044  51.7011 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  9.272209  10.697035  0.8668    0.3861
## P_RENTROCC   0.419308   0.087585  4.7875 1.689e-06
## 
## Lambda: 0.85432 LR test value: 119.88 p-value: < 2.22e-16 
## Numerical Hessian standard error of lambda: 0.045901 
## 
## Log likelihood: -553.1096 
## ML residual variance (sigma squared): 245.29, (sigma: 15.662)
## Number of observations: 129 
## Number of parameters estimated: 4 
## AIC: 1114.2

What can we say about the relationship between percent of renter occupied units and the Black population density?

The “coefficients” section in the output may be interpreted in a similar way to a standard regression. The intercept captures the overall average proportion Black population. For every one percent increase in the renter occupied units, there is a 41.9 percent increase in the Black proportion of the population. The p-value for the predictor variable is \(<0.001\) showing a significant relationship between predictor and outcome. Lastly, in this output we have the estimated \(\lambda = 0.85\) and the associated p-value \(<0.001\).

In summary, we have learned about spatial auto-correlation and spatial regression. We have assessed spatial relationship across regions in regards to outcomes of interest. We have been able to use these methods to show significant spatial patterns and a lack of independence across census tracts in New Haven.

Lab 7 Activity No specific lab for this lecture. You will work on finalizing your final report due Dec 6. The spatial analyses of KDE (lecture 6) and Spatial autocorrelation (lecture 7) should be included in your report.

Your report must contain the following sections:

  1. Introduction: Your introduction to describe the big picture public health question of your project. State to the reader why this is an important problem to address. Give background on the topic.

  2. Data Description: Describe the data sources used and what consistutes your polygon, points and lines data

  3. Methods: Describe the GIS and spatial analysis methods you will use (please do not list function names). For example, if you use a spatial regression, describe/state the method used is spatial regression and the outcome and the predictor variables.

  4. Results: Show the maps you believe gives a detailed picture of the public health problem. You must choose across all the maps you have made which ones give the most relevant details. Include spatial analyses results from KDE, Moran’s I and spatial regression. Interpret the results from these analyses.

  5. Conclusion: Give a summary of your results and the take-away messages for public health professionals.

  6. Citations

Important note: Write your report for an audience who does not know spatial analysis and GIS. There is no page requirements for this report.