Chapter 4 Regression

Overview

The main part of this practical shows how create models for understanding and for prediction. Confusingly these are both referred to as inference in the literature and so later in the worksheet there is small section distinguishing between these.

After this introduction the practical has the following parts:

standard linear regression, or ordinary least squares (OLS) regression.
generalised linear models (GLMs) with binomial (or logistic) regression
generalised linear models (GLMs) with Poisson regression.
the influence of outliers and robust regression

An understanding these is important as it underpins the ideas behind spatial regressions and spatial models covered in later sessions.

An R script with all the code for this practical is provided.

There is an excellent, recent and in-depth overview here which you may find useful to look at after the practical: https://mlu-explain.github.io/linear-regression/

Which regression approach?

So when do we use the different regression approaches? The key consideration relates to the distribution of the response variable. The response variable is the thing we are trying to make a model of, the $y$. The target or predictor variables, the $x$’s are what we use to model $y$.

Standard OLS Regression Standard OLS (ordinary least squares) regression is used when the response variable is normally distributed. Typically it is a continuous numeric variable that is expected to have a linear relationship with the predictor variables. That is, the response variable increases or decreases linearly with changes in the predictor variables. OLS seeks to determine coefficients for each predictor variable that describe how changes in the predictor variable are associated with changes in the target variable.

GLM Logistic or Binomial regression A Binomial distribution is one with 2 peaks at each end of the distribution. It is typically found when the response variable is binary or dichotomous - a Yes-No or True-False response. Rather than determine coefficients estimates that show how response variable change with predictors, in logistic regression they estimate provide an estimate of the probability one of the two responses using a GLM (generalised linear model).

GLM Poisson regression A Poisson distribution is one that is skewed to one end of the distribution or another. Count variables tend to bunch in this way, but percentages can as well. These are numeric but typically have integer or whole number values (1, 2, 10, 345, etc). Examples include counts of people with a degree in population survey. Variables with a Poisson distribution typically have an equal mean and variance. Similar to logistic regression, the Poisson GLM estimates probabilities of the observing the count.

Robust Regression Outliers are observation in the data that have different characteristics to other observations. They can weaken models and affect the regression coefficient estimates. They can be ignored (the data are the data after all!), removed and a standard regression applied, or their influence in the model can be down weighted using what is called a robust regression

Packages

You will need to load the following packages, testing for the presence on your R installation as before (although in a different way!):

packages <- c("sf", "tidyverse","tmap", "readxl", "rgl", "car", "MASS", "spdep", "readxl", "cowplot")
# check which packages are not installed
not_installed <- packages[!packages %in% installed.packages()[, "Package"]]
# install missing packages
if (length(not_installed) > 1) {
  install.packages(not_installed, repos = "https://cran.rstudio.com/", dep = T)
}
# load the packages
library(sf)         # for spatial data
library(tidyverse)  # for data wrangling & mapping (ggplot2)
library(readxl)     # for loading Excel data
library(rgl)        # for 3d plotting 
## NB If you have a MAc you will need to install XQuartz for the rgl package 
# - see https://www.xquartz.org
library(car)        # for some stats tools
library(MASS)       # for robust regression
library(readxl)     # for loading remote Excel data
library(cowplot)    # for combining plots

4.1 Standard linear regression models (OLS)

Regression models are the standard method for constructing predictive and explanatory models. They tell us how changes in one variable (the target variable or independent variable, $y$) are associated with changes in explanatory variables, or dependent variables, $x_1$, $x_2$, etc. They underpin classic statistical inference and most of the approaches that are commonly referred to as machine learning and AI - these are really just fancy flavours of regression.

Classic linear regression is referred to Ordinary least squares (OLS) regression because they estimate the relationship between one or more independent variables and a dependent variable, $y$ using a hyperplane (i.e. a multi-dimensional line - more on this later!) that minimises the sum of the squared difference between the observed values of $y$ and the values predicted by the model ($\hat{y\ }$, called ‘$y$-hat’).

What we are looking for in the plots and table above are correlations that look like straight lines, showing that values in one variable change linearly with changes in another. The correlations between the variables indicate that a number of the variables are highly correlated (< -0.7 or > +0.7) with each other and with the response variable MedInc, and others less so. This provides us with confidence that we can construct a reasonable model from this data - at least some of the variables are correlated with MedInc.

We will start with a hypothesized model of Median Income in the state of Georgia at a county level, that is associated with rurality (PctRurual), whether people went to university (PctBach), how old the population is (PctEld), the number of people that are foreign born (PctFB), the percentage of the population classed as being in poverty (PctPov) and as being black (PctBlack) (NB I recognise that there are lots of issues with these labels but that is another module / session).

In R we can construct a regression model using the function lm which stands for linear model. This is a standard OLS regression function, and although simple it is very powerful. There are many variants and functions for manipulating lm outputs: remember R was originally a stats package.

The equation for a basic linear regression is:

\[\begin{equation} y_i = \beta_{0} + \sum_{k=1}^{m}\beta_{k}x_{ik} + \epsilon_i \label{eq:lm2} \end{equation}\]

where for observations indexed by $i=1 \dots n$, $y_i$ is the response variable, $x_{ik}$ is the value of the $k^{th}$ predictor variable, $m$ is the number of predictor variables, $\beta_0$ is the intercept term, $\beta_k$ is the regression coefficient for the $k_{th}$ predictor variable and $\epsilon_i$ is the random error term that is independently normally distributed with zero mean and common variance $\sigma^2$. OLS is commonly used for model estimation in linear regression models.

You will need to load the georgia data. This is a commonly used dataset in R (for example, it is contained within the spgwr, GWmodel and GISTools packages and is now being used in the gwverse package (A. Comber et al. 2022). It has census data for the counties in Georgia, USA.

download.file("https://github.com/gwverse/gw/blob/main/data/georgia.rda?raw=True",
              "./georgia.RData", mode = "wb")
load("georgia.RData")

You should examine this in the usual way:

# spatial data
georgia
# sumamry of variables
summary(georgia)
# a map of the counties 
ggplot(georgia) + geom_sf(fill = NA)

And have a look at the target variable:

ggplot(georgia) + geom_sf(aes(fill = MedInc)) +
  scale_fill_distiller(palette = "Blues") +
  theme_minimal()

Figure 4.1: A map of the Median Income variable.

You will see that georgia contains a number of variables for the counties including the percentage of the population in each County that:

is Rural (PctRural)
have a college degree (PctBach)
are elderly (PctEld)
that are foreign born (PctFB)
that are classed as being in poverty (PctPov)
that are black (PctBlack)

and the median income of the county (MedInc in dollars)

And as we have done some EDA now, you might want to explore the properties of the attributes in georgia layer. Note we will be creating models of median income (MedInc). Do these summaries tell us anything useful? (Answer: yes they do!) The code below plots histograms and paired scatterplots of the variables. Note the use of dplyr::select below to force R to use the select function in dplyr. This is because other functions also called select may have been loaded by other packages.

## variable histograms
# set the pipe and drop the sf geometry
georgia |> st_drop_geometry() |> 
  # select the numerical variables of interest
  dplyr::select(ID, TotPop90:PctBlack, MedInc) |>
  # convert to long format
  pivot_longer(-ID) |>
  # pass to a faceted ggplot 
  ggplot(aes(x = value)) + geom_histogram(bins = 30, col = "red", fill = "salmon") + 
  facet_wrap(~name, scales = "free", ncol = 4)
# correlations
cor.tab = georgia %>% st_drop_geometry() %>% 
  dplyr::select(MedInc, TotPop90:PctBlack) %>% cor() 
round(cor.tab, 3)
# paired scatterplots
plot(st_drop_geometry(georgia[, c(14, 3:8)]), 
     pch = 1, cex = 0.7, col = "grey30", upper.panel=panel.smooth)

In our case, the response variable $y$ is median income (MedInc), and the 6 (i.e. $k$) socio-economic variables are PctRural, PctBach, PctEld, PctFB, PctPov, PctBlack are the different predictor variables, the $k$ different $x$’s.

Specifying the model

The basic formula in R to specify a linear regression model, lm is as follows (don’t run this - just look at it!):

# don't run this! 
model = lm(variable_we_are_trying_to_predict ~ predictor_variable1 +
              predictor_variable2...)

where the ~ sign (tilde) separates the variable we are trying to model (the response variable) from the variables we will use to predict it, the predictor variables. The predictor variables are usually separated by a + sign - meaning that they are linear terms in the model added together.

We can specify our first model.

m = lm(formula = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = georgia)

The summary function here provides a print out of the model:

summary(m)

## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = georgia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12420.3  -2989.7   -616.3   2209.5  25820.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52598.95    3108.93  16.919  < 2e-16 ***
## PctRural       73.77      20.43   3.611 0.000414 ***
## PctBach       697.26     112.21   6.214 4.73e-09 ***
## PctEld       -788.62     179.79  -4.386 2.14e-05 ***
## PctFB       -1290.30     473.88  -2.723 0.007229 ** 
## PctPov       -954.00     104.59  -9.121 4.19e-16 ***
## PctBlack       33.13      37.17   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

There are lots of things to look at in the summary of the model but we will focus on 3 aspects:

First, Significance. If we examine the model we can see that nearly all of the dependent variables are significant predictors of Median Income. The level of significance is indicated by the p-value in the column $Pr(>|t|)$ and the asterisks $*$. For each coefficient $\beta_k$ these are the significance levels associated with the NULL hypothesis $\textsf{H}_0 : \beta_k = 0$ against the alternative $\textsf{H}_1 : \beta_k \ne 0$. This corresponds to the probability that one would obtain a $t$-statistic for $\beta_k$ at least as large as the one observed given $\textsf{H}_0$ is true.

The p-value provides a measure of how surprising the relationship is and indicative of whether the relationship between $x_k$ and $y$ found in this data could have been found by chance. Very small p-values suggest that the level of association found here could not have come from a random sample of data. Cutting to the chase, small p-values mean strong evidence for a non-zero regression coefficient.

So in this case we can say:

‘changes in PctFB are significantly associated with changes in MedInc at the $0.001$ level’. This means that the association between the predictor and the response variables is not one that would be found by chance in a series of random samples $99.999\%$ of the time. Formally we can reject the Null hypothesis that there is no relationship between changes in MedInc and PctFB
‘changes in PctRural, PctBach, PctEld, and PctPov are significantly associated with changes in MedInc at the at the $<0.001$ level’, which is actually an indicator of less than $0.001$. So were are even more confident that the observed relationship between these variables and MedInc are not by chance and can again confidently reject the Null hypotheses.
PctBlack was not found to be significantly associated with MedInc.

Second, the Coefficient Estimates. These are the $\beta_{k}$ values that tell you the rate of change between $x_k$ and $y$ in the column ‘Estimate’ in the summary table. The first of these is the Intercept. This is $\beta_0$ in the regression equation and tells you the value of $y$ if $x_1 \cdots x_k$ are all zero.

For the dependent or predictor variables, $x_k$, we can now see how changes of 1 unit for the different predictor variables are associated with changes in MedInc:

The association of PctRural is positive but small: each increase in 1% of PctRural is associated with an increase of $73.7 in median income;
The association of PctBach is positive, with each additional 1% of the population with a bachelors degree associated with an in increase of $697;
each increase of 1% of PctEld is associated with a decrease in median income of $789;
each increase of 1% of PctFB is associated with a decrease in median income of $1290;
each increase of 1% of PctPov is associated with a decrease in median income of $954;

So the regression model m gives us an idea of how different factors are associated with the target variable for Median Income in the counties of the state of Georgia and their interaction. Note the way that this is described: we have no indication of causality between the factors and Median Income, and so we cannot say that changes in coefficient estimates would explain changes in Median Income, rather they are associated with Median Income changes. This syntax is important - these are statistical associations.

Third, the model fit given by the $R^2$ value. This provides a measure of model fit. This is calculated as the difference between the actual values of $y$ (e.g. of MedInc) and the values predicted by the model - sometimes written as $\hat{y\ }$ (called ‘$y$-hat’). A good description is here http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

Generally $R^2$’s of greater than 0.7 indicate a reasonably well fitting model. Some users of this method describe a low $R^2$ as ‘bad’ - they are not - it just means you a have a more imprecise model - the predictions are a bit wrong-er! Essentially there is a sliding scale of reliability of the predictions - based on the relative amounts of deterministic and random variability in the system being analysed - and $R^2$ is a measure of this.

Task 1

Create an OLS model with the PctEld variable as the target variable and PctRural, PctPov, PctBach and PctFB as the target variables. Interpret the model outputs including the coefficient estimates, significance and model fit.

4.2 Binomial Logistic Regression for binary variables

Generalized Linear Models (GLMs) allow regression models to be specified in situations where the response and / or the predictor variables do not have a normal distribution. An example is where the response is a True-False, Yes-No or 1-0. The practical replicates some of the results from A. J. Comber, Brunsdon, and Radburn (2011), evaluating geographic and perceived access to health facilities. It develops an analysis of a binary response variable describing perceptions of difficulty in GP access in Leicestershire using data collected via a survey.

You should load some new data :

data.pt = st_read("place_survey_pts.gpkg", quiet = T)

Have a look at this data and its attributes in the usual way:

# spatial data 
data.pt

## Simple feature collection with 8530 features and 20 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 428996.3 ymin: 278160.6 xmax: 488368.7 ymax: 339447.7
## Projected CRS: OSGB36 / British National Grid
## First 10 features:
##         REF1               GP                    Dentist          Chemist
## 1  705030002      Fairly easy                Fairly easy      Fairly easy
## 2  705030004      Fairly easy Neither easy nor difficult      Fairly easy
## 3  705030005      Fairly easy                Fairly easy      Fairly easy
## 4  705030890        Very easy                  Very easy        Very easy
## 5  705030891   Very difficult             Very difficult   Very difficult
## 6  705030894 Fairly difficult           Fairly difficult Fairly difficult
## 7  705030895      Fairly easy                Fairly easy      Fairly easy
## 8  705030899      Fairly easy                Fairly easy      Fairly easy
## 9  705030095        Very easy                  Very easy        Very easy
## 10 705030101        Very easy                  Very easy        Very easy
##                     LocalHosp    Health     Illness IMD2007ALL      Hosp
## 1                 Fairly easy      Good Do not have  13.506194  9115.503
## 2  Neither easy nor difficult      Good Do not have  13.506194  8227.506
## 3                 Fairly easy Very good Do not have  13.506194  8401.373
## 4  Neither easy nor difficult Very good Do not have   7.728222 13995.208
## 5              Very difficult Very good Do not have   7.728222 14055.528
## 6            Fairly difficult      Good Do not have   7.728222 13905.840
## 7                 Fairly easy Very good Do not have   7.728222 14877.166
## 8  Neither easy nor difficult Very good Do not have   7.728222 15033.763
## 9  Neither easy nor difficult      Good Do not have   5.714707 13837.316
## 10           Fairly difficult      Good Do not have  11.839684 13271.580
##          AE       Doc      label  popeast popnorth                       rural
## 1  13763.71 4097.7625 31UEFU0010 440376.8 300305.5 Hamlet & Isolated Dwellings
## 2  13280.71 3384.7649 31UEFU0010 440376.8 300305.5 Hamlet & Isolated Dwellings
## 3  13454.58 3558.6326 31UEFU0010 440376.8 300305.5 Hamlet & Isolated Dwellings
## 4  18111.41 3135.1070 31UEGA0002 439700.0 305037.9                     Village
## 5  18171.73 3074.7871 31UEGA0002 439700.0 305037.9                     Village
## 6  18345.04 2723.0989 31UEGA0002 439700.0 305037.9                     Village
## 7  19316.37 3694.4257 31UEGA0011 439571.9 306769.6                     Village
## 8  19472.97 3851.0225 31UEGA0011 439571.9 306769.6                     Village
## 9  17467.90 1188.2823 31UEFW0006 443344.2 305529.1 Hamlet & Isolated Dwellings
## 10 18243.05  413.1231 31UEFW0004 442862.8 305615.2             Town and Fringe
##               oac    REF1_1 cars coords_x1 coords_x2                      geom
## 1     Countryside 705030002    4  440376.8  300305.5 POINT (440376.8 300305.5)
## 2     Countryside 705030004    2  440376.8  300305.5 POINT (440376.8 300305.5)
## 3     Countryside 705030005    2  440376.8  300305.5 POINT (440376.8 300305.5)
## 4     Countryside 705030890    2  439700.0  305037.9   POINT (439700 305037.9)
## 5     Countryside 705030891    3  439700.0  305037.9   POINT (439700 305037.9)
## 6     Countryside 705030894    1  439700.0  305037.9   POINT (439700 305037.9)
## 7     Countryside 705030895    2  439571.9  306769.6 POINT (439571.9 306769.6)
## 8     Countryside 705030899    3  439571.9  306769.6 POINT (439571.9 306769.6)
## 9     Countryside 705030095    1  443344.2  305529.1 POINT (443344.2 305529.1)
## 10 Typical Traits 705030101    1  442862.8  305615.2 POINT (442862.8 305615.2)

# summary
summary(data.pt)

##       REF1                GP              Dentist            Chemist         
##  Min.   :705000001   Length:8530        Length:8530        Length:8530       
##  1st Qu.:705012238   Class :character   Class :character   Class :character  
##  Median :705031305   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :705030817                                                           
##  3rd Qu.:705050463                                                           
##  Max.   :705063148                                                           
##   LocalHosp            Health            Illness            IMD2007ALL    
##  Length:8530        Length:8530        Length:8530        Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.795  
##  Mode  :character   Mode  :character   Mode  :character   Median : 8.548  
##                                                           Mean   :10.506  
##                                                           3rd Qu.:13.422  
##                                                           Max.   :50.983  
##       Hosp                AE             Doc          label          
##  Min.   :   35.31   Min.   : 3115   Min.   :   0   Length:8530       
##  1st Qu.: 4754.64   1st Qu.: 7829   1st Qu.: 658   Class :character  
##  Median : 7640.60   Median :13435   Median :1088   Mode  :character  
##  Mean   : 9482.88   Mean   :13803   Mean   :1445                     
##  3rd Qu.:12898.21   3rd Qu.:19023   3rd Qu.:1709                     
##  Max.   :26891.84   Max.   :28811   Max.   :8526                     
##     popeast          popnorth         rural               oac           
##  Min.   :428996   Min.   :278161   Length:8530        Length:8530       
##  1st Qu.:446560   1st Qu.:297519   Class :character   Class :character  
##  Median :455275   Median :304299   Mode  :character   Mode  :character  
##  Mean   :456315   Mean   :306403                                        
##  3rd Qu.:463235   3rd Qu.:316888                                        
##  Max.   :488369   Max.   :339448                                        
##      REF1_1               cars         coords_x1        coords_x2     
##  Min.   :705000001   Min.   :0.000   Min.   :428996   Min.   :278161  
##  1st Qu.:705012238   1st Qu.:1.000   1st Qu.:446560   1st Qu.:297519  
##  Median :705031305   Median :1.000   Median :455275   Median :304299  
##  Mean   :705030817   Mean   :1.359   Mean   :456315   Mean   :306403  
##  3rd Qu.:705050463   3rd Qu.:2.000   3rd Qu.:463235   3rd Qu.:316888  
##  Max.   :705063148   Max.   :4.000   Max.   :488369   Max.   :339448  
##             geom     
##  POINT        :8530  
##  epsg:27700   :   0  
##  +proj=tmer...:   0  
##                      
##                      
##

# map of locations
ggplot(data.pt) + 
  geom_sf() + theme_minimal()

Figure 4.2: A map of the observation locations.

You should see that there are 8530 data points, distributed around Leicestershire and that there are a number of attributes of potential interest.

What we are interested in this analysis is the degree to which different factors can examine respondent perceptions of their difficulty in accessing GP surgeries. The following variables will be used in this analysis:

GP: perceptions of difficulty in their access to GPs;
Health: their declared health status;
Illness: whether they have a limiting long term illness;
Doc: distance from the respondent’s home postcode to the nearest GP surgery - this was calculated after the survey;
cars: the number of cars they own, specifically we can use this to determine non-car ownership.

A bit of pre-processing is needed to get these into a format that is suitable for analysis. First the data is subsetted to include the variables of interest.

data.pt <- data.pt |> dplyr::select(GP, Health, Illness, Doc, cars)

The text descriptions for GP, Health, Illness and cars need to be converted to binary variables and the piping syntax with transmute can be used to wrangle our data into a suitable format:

# convert the data 
data.pt |> 
  transmute(
    DocDiff = (GP == "Fairly difficult" | GP == "Very difficult")+0,
    BadHealth = (Health == "Bad" | Health ==    "Very bad")+0,
    Illness = (Illness == "Do have")+0,
    NoCar = (cars == 0)+0,
    DocDist = Doc/1000) -> data.pt

Note that the code snippet above has converted the logical TRUE and FALSE values to 1s and 0s by adding the +0 to the end of the logical tests, and that distance has been converted to km.

data.pt

## Simple feature collection with 8530 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 428996.3 ymin: 278160.6 xmax: 488368.7 ymax: 339447.7
## Projected CRS: OSGB36 / British National Grid
## First 10 features:
##    DocDiff BadHealth Illness NoCar   DocDist                      geom
## 1        0         0       0     0 4.0977625 POINT (440376.8 300305.5)
## 2        0         0       0     0 3.3847649 POINT (440376.8 300305.5)
## 3        0         0       0     0 3.5586326 POINT (440376.8 300305.5)
## 4        0         0       0     0 3.1351070   POINT (439700 305037.9)
## 5        1         0       0     0 3.0747871   POINT (439700 305037.9)
## 6        1         0       0     0 2.7230989   POINT (439700 305037.9)
## 7        0         0       0     0 3.6944257 POINT (439571.9 306769.6)
## 8        0         0       0     0 3.8510225 POINT (439571.9 306769.6)
## 9        0         0       0     0 1.1882823 POINT (443344.2 305529.1)
## 10       0         0       0     0 0.4131231 POINT (442862.8 305615.2)

Logistic regression should be used when modelling a binary outcome from a set of continuous and / or binary predictor variables. We are now in a position to create a logistic regression model and have a look at the results. Note the specification of the family as binomial with a logit link in the code below.

#create the model
m.glm = glm(formula = DocDiff~Illness + BadHealth + DocDist + NoCar, 
            data = data.pt, 
            family= binomial(link = "logit"))
# inspect the results
summary(m.glm)

## 
## Call:
## glm(formula = DocDiff ~ Illness + BadHealth + DocDist + NoCar, 
##     family = binomial(link = "logit"), data = data.pt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.10048    0.10373 -39.530  < 2e-16 ***
## Illness      0.58943    0.11042   5.338 9.39e-08 ***
## BadHealth    0.52257    0.17616   2.966  0.00301 ** 
## DocDist      0.29524    0.02988   9.879  < 2e-16 ***
## NoCar        1.33701    0.11031  12.121  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3318.6  on 8529  degrees of freedom
## Residual deviance: 3039.5  on 8525  degrees of freedom
## AIC: 3049.5
## 
## Number of Fisher Scoring iterations: 6

This shows that all of the inputs are significant predictors of experiencing difficulty over GP access. This model is of the form:

\[\begin{equation} pr(y_1 = 1) = logit(\beta_0 + \beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}) \label{eq:eglm} \end{equation}\]

where the logit function is defined as

\[\begin{equation} logit(Q) = \frac{exp(Q)}{1 + exp(Q)} \end{equation}\]

and $y_1$ is a 0/1 indicator showing whether the respondent expressed difficulty in their access to GPs, $x_1$ is an 0/1 indicator variable showing whether the respondent said they had a Long Term Illness, $x_2$ is an indicator variable stating whether the respondent considered they were in Bad Health, $x_3$ is the distance from the respondent to their nearest GP surgery based on road network distance and $x_4$ is a 0/1 indication of car ownership.

So what does the model say? We have to use the exponential or antilog of the coefficient estimates using the exp function to generate probabilities (odds ratios). The confidence intervals that show the range of plausible values (i.e. within a 95% probability of not being found by chance), in absence of a model fit statistic.

# odds ratios
exp(coef(m.glm))

## (Intercept)     Illness   BadHealth     DocDist       NoCar 
##  0.01656477  1.80295494  1.68634808  1.34344909  3.80765553

# confidence intervals
exp(confint(m.glm, level = 0.95))

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept) 0.01346685 0.0202271
## Illness     1.45106558 2.2376869
## BadHealth   1.18425653 2.3651566
## DocDist     1.26583777 1.4233458
## NoCar       3.06412578 4.7230562

What these show is that the probability of respondents experiencing difficulty in accessing their GP …

is 1.8 times more likely for those with Long Term Illness than for those without, with a CI or likely range of between 1.5 and 2.2 times;
is 3.8 times more likely for those without a car than for those who have one, with a CI of between 3.1 and 4.7 times;
is 1.7 times more likely for respondents with Bad Health than for those without, with a CI of between 1.2 and 2.4;
increases by 34% (1.34) for each extra km to the nearest GP surgery, with a CI of between 27% and 42% .

The paper by A. J. Comber, Brunsdon, and Radburn (2011) has very slightly different results due to some data cleaning differences. It also shows how these odds ratio (likelihoods) vary spatially and demonstrates that distance does not work in this way for hospitals (one of the main aims of the paper was to unpick access and accessibility in the health and geography literature).

Task 2

Create and interpret a binomial GLM with the target variable of LocalHosp (difficulty in access to a local hospital) using the following target variables:

Health: their declared health status;
Illness: whether they have a limiting long term illness;
Hosp: distance from the respondent’s home postcode to the nearest GP surgery - this was calculated after the survey;
cars: the number of cars they own, specifically we can use this to determine non-car ownership.

A bit of pre-processing is needed to get these into a format that is suitable for analysis. First the data reloaded and then subsetted to include the variables of interest.

data.pt = st_read("place_survey_pts.gpkg", quiet = T)
data.pt <- data.pt %>% dplyr::select(LocalHosp, Health, Illness, Hosp, cars)

Again the text descriptions need to be converted to binary variables:

# convert the data 
data.pt %>% 
  transmute(
    HospDiff = (LocalHosp == "Fairly difficult" | LocalHosp == "Very difficult") + 0,
    BadHealth = (Health == "Bad" | Health == "Very bad") + 0,
    Illness = (Illness == "Do have") + 0,
    NoCar = (cars == 0) + 0,
    HospDist = Hosp/1000) -> data.pt

NB You are advised to attempt the tasks after the timetabled practical session. Worked answers are provided in the last section of the worksheet.

4.3 Poisson Regression for count variables

An important assumption in OLS is that the $y_i$ variable is assumed to have a normal distribution, whose mean is determined by the predictor variables (as $\beta_0 + \sum_k \beta_kx_{ik}$) and whose variance is $\sigma^2$. For many data, this is not the case: counts, binary responses (Yes/No, True/False, etc).

Suppose the value of $y_i$ was a count of some phenomenon - for example the number of crimes committed in a month, or the number of houses sold (rather than the price). Then, the normal distribution is not a sensible model because:

$y_i$ can only take integer values
$y_i$ takes a minimum value of zero

or, more bluntly, $y_i$ is not normally distributed! Typically count data are modelled by the Poisson distribution:

\[ \textsf{Prob}(y_i=j) = \frac{e^{-\lambda}\lambda^j}{j!} \] where $\lambda$ is the mean value of $y_i$. There is a constraint that $\lambda > 0$. Here $\lambda$ is a constant parameter, but if this depended on some predictor variables, we would have a prediction model similar to OLS, but with $y_i$ as a Poisson distributed count variable rather than a normally distributed continuous one. In principal, one could specify the model so that $\lambda_i$, the mean value for observation $i$, is given by $\lambda_i = \beta_0 + \sum_k\beta_kx_{ij}$ however, this could lead to problems if a negative value of $\lambda_i$ were predicted. One way of overcoming this is to specify a log-transformed relationship:

\[ \log(\lambda_i) = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} \]

which is the standard form for Poisson regression.

Taking antilogs on both sides, the effect on the expected value is a multiplicative one:

\[ \lambda_i = e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \]

so that the expected number of observations changes by a scale factor (i.e. a ratio) for a unit change in a predictor variable. In particular for a coefficient $\beta_k$, the factor is $e^{\beta_k}$ - this is useful in interpreting the model.

A final - and extra - feature found in some Poisson regression model is the idea of an offset. For many geographical count-based regressions, there is a ‘population at risk’ - for example the total population of an area when counting people having a particular illness. In this case an expected count is essentially a rate of occurrence multiplied by the population at risk. The model can be written as

\[ \lambda_i = P_i \times e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \] where $P_i$ is the population at risk. Writing $P_i$ as $e^{\log(P_i)}$ this is really just another predictor term (in $\log(P_i)$) but with the coefficient being fixed at the value 1, rather than being calibrated. Such a term can be specified in R - and is referred to as an offset term.

We can download some crime data for Georgia from here https://georgiadata.org/data/data-tables and then read into our session using the readxl package

library(readxl)
crime = read_excel("georgia.xlsx", sheet = "index_crime")

You can examine what is loaded

crime
names(crime)

Here, we will use the 2019 Total Index Crimes Reported variable which you can examine the distribution of after converting to integer (counts) and linking to the georgia layer:

# reformat crime 
# notice the renaming with mutate and the selection of the variables
crime <- 
  crime |> 
  mutate(County = tolower(County),
         Crime = as.integer(`2020 Reported Violent Crime Rate per 100,000 Population`)) |>
  dplyr::select(County, Crime)
# link to georgia 
georgia <- 
  georgia |> 
  mutate(Name = tolower(Name)) |>
  left_join(crime, by = c("Name" = "County"))
# histogram
ggplot(georgia, aes(x = Crime)) +
  geom_histogram(bins = 20, col = "red", fill = "salmon")

Figure 4.3: The distribution of the crime data.

Here the response variable is Crime. This is the $y$-variable. A number of variables are used as predictors:

PctPov: an indicator of economic and social deprivation. Higher values correspond to more deprived areas.
PctBach: the percentage of people with bachelors degrees.
PctFB: the percentage of people with who are foreign born.
TotPop90: the population at risk - the log of this will be the offset variable here.

The regression is set out below:

m.pois <- glm(formula = Crime~PctPov + PctBach + PctFB + offset(log(TotPop90)),
              data=georgia, 
              family=poisson())
summary(m.pois)

## 
## Call:
## glm(formula = Crime ~ PctPov + PctBach + PctFB + offset(log(TotPop90)), 
##     family = poisson(), data = georgia)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.399411   0.027419 -196.92   <2e-16 ***
## PctPov       0.065010   0.000987   65.87   <2e-16 ***
## PctBach     -0.068933   0.001567  -43.99   <2e-16 ***
## PctFB       -0.135629   0.007665  -17.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 45077  on 158  degrees of freedom
## Residual deviance: 22370  on 155  degrees of freedom
## AIC: 23311
## 
## Number of Fisher Scoring iterations: 5

The poverty coefficient (PctPov) is positive and significant. The PctBach (degrees) and PctFB (foreign born) are both negative and and significant, suggesting a link between crime risk with higher levels of poverty, lower levels of people with degrees and who are foreign born. Again the antilogs of the coefficients are used to interpret the coefficients using the function, exp:

# coefficients
coef(m.pois)

## (Intercept)      PctPov     PctBach       PctFB 
## -5.39941086  0.06500999 -0.06893345 -0.13562875

# antilogs
exp(coef(m.pois))

## (Intercept)      PctPov     PctBach       PctFB 
## 0.004519243 1.067169689 0.933388793 0.873166736

These show that an increase of 1% of

poverty corresponds to 7% rise in the crime rate
bachelor degrees corresponds to the 7% decrease in the crime rate
foreign born corresponds to to 13% decrease in the crime rate

Again we generate confidence intervals that show the range of plausible values (i.e. within a 95% probability of not being found by chance), in absence of a model fit statistic.

# confidence intervals
round(exp(confint(m.pois, level = 0.95)), 3)

## Waiting for profiling to be done...

##             2.5 % 97.5 %
## (Intercept) 0.004  0.005
## PctPov      1.065  1.069
## PctBach     0.931  0.936
## PctFB       0.860  0.886

In this case the CIs are very tight, suggesting a strong model fit.

4.4 Outliers and Robust regression

So we have made an OLS model m. We have a measure of model fit and some idea of the relationships between the independent variable (or target variable $y$) and the dependent variables (or explanatory variables, or predictor variables $x_1$, $x_2$, etc) through the coefficient estimates, and we know the degree to which those relationships might be found by chance through the p-values indicating significance.

Now the model has been fitted to find the line in multivariate feature space that best fits the data points. This is how OLS works. To illustrate what is meant by this, the code below creates a second model using just 2 variables to predict Median Income:

# 2nd OLS model
m2 = lm(MedInc~PctEld+PctBach, data = georgia) 
# summary
summary(m2)

## 
## Call:
## lm(formula = MedInc ~ PctEld + PctBach, data = georgia)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27452  -4604   -751   2842  36394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50012.7     3555.2  14.067  < 2e-16 ***
## PctEld       -1620.3      227.6  -7.120 3.71e-11 ***
## PctBach        562.5      123.1   4.571 9.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7832 on 156 degrees of freedom
## Multiple R-squared:  0.4515, Adjusted R-squared:  0.4445 
## F-statistic: 64.21 on 2 and 156 DF,  p-value: < 2.2e-16

Although both covariates are significant, this is a much looser model, with a much lower $R^2$.

What a linear regression model tries to do is to find the hyperplane that minimising the distance between it and all the observations, the data points. The best fitting hyperplane is the one that has the minimum summed distance between each observation value and the plane. We can visualise this for our second model using a 3D plot. The code below should generate a pop-out window that you can manipulate (rotate, zoom, etc):

plot3d(m2, col="orange", type = 's', size=1,
       vars = georgia |> st_drop_geometry() |> dplyr::select(MedInc, PctEld, PctBach) )
# run the code below to see the plot in your RStudio session 
# rglwidget(width = 600, height = 1100, reuse = FALSE)

If you rotate this, you will see that some of the observations (points) are very close to the hyperplane and others are far away. This means that the model describes the relationship between MedInc and PctEld and PctFBwell for the data points near to the plane, but less well for those further way. The ones that are far away are called model outliers. Essentially these are observations (data points) that the model, indicated by the plane in this case, does not predict very well.

We will examine outliers in this section.

Outliers are records or observations that the model does not describe very well, either vastly over-predicting Median Income or under-predicting it. In the m2 model above these were easy to find, since they were just the points lying a long way from the regression plane. However, when there are multiple predictors then the hyperplane is fitted to a multi-dimensional space, which in the case of model m created earlier is a 7-dimensional space, which is difficult to visualise and explore.

For outliers, a useful diagnostic is the studentised residual. Plotting these and looking for notably high or low values is a helpful way of spotting outliers in the regression model. A simple way of doing this is just to plot the residuals in order:

s.resids <- rstudent(m2)
plot(s.resids,type='h')
abline(h=c(-2,2),col='lightskyblue')

Figure 4.4: A plot of the studentised residuals.

Here we see some residuals are relatively high. Normally, one might expect standardised residuals to lie between -2 and +2. Adding lines to this plot helps to identify when this is not the case - as is done above.

Another useful job is to identify which are the outlying cases.

# create a tibble of georgia and the residuals
df_out <- tibble(Name=georgia$Name,Residual=s.resids)
# select observations whose absolute residual value is > 2
df_out[abs(df_out$Residual)>2,]

## # A tibble: 7 × 2
##   Name          Residual
##   <chr>            <dbl>
## 1 chattahoochee    -2.74
## 2 clarke           -3.98
## 3 fayette           3.29
## 4 forsyth           5.03
## 5 harris            2.19
## 6 henry             2.21
## 7 towns             2.29

Negative values here indicate a much lower actual income than the model predicts, positive values suggest it is much higher than predicted by the model.

Finally, we need to check another model assumption in lm - that the error term $\epsilon_i$’s are normally distributed. The best way to do this is via a qq-plot which plots the empirical quantiles of a variable whose distribution we are interested in against the theoretical distribution we are testing for. If the variable is from this distribution, the results should yield a straight line - frequently a reference line is provided to test this. Here s.resids is the variable of interest - the studentised residuals, and these are compared to the $t$ distribution. Although the $\epsilon_i$’s have a normal distribution, theory shows that the studentised residuals derived from a sample coming from this model have a $t$ distribution. However, the qqPlot function from the package car does all the work, and requires only the regression model object to create a plot:

qqPlot(m2)

Figure 4.5: A qq plot of the model residuals.

This demonstrates that mostly the residuals do conform to this, with a couple of extreme outliers at the each end - labelled in the plot. Consulting the earlier table suggests these are Clarke and Forsyth Counties.

df_out[c(29, 58), ]

## # A tibble: 2 × 2
##   Name    Residual
##   <chr>      <dbl>
## 1 clarke     -3.98
## 2 forsyth     5.03

It is also important to consider spatial patterns in the residuals, from a geographical viewpoint. The code below plots the spatial distribution of the outliers, defined as being further than 2 standard deviations away from the mean:

# add the residuals to the data 
georgia$s.resids = s.resids
# indicate if they are outliers
georgia$outlier <- cut(georgia$s.resids,
                       breaks=c(-Inf, -2, 2, Inf), 
                       labels=c("low","normal","high"))
# map the spatial distribution of outliers
ggplot(data = georgia) + 
  geom_sf(aes(fill = outlier))

Figure 4.6: A map of the outliers - observations where Median Income is over-estimated (red) and under-estimated by the model.

So the presence of a few outliers weakens our models and might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr approach and the filter function, before re-fitting the model.

# original model
m = lm(formula = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = georgia) 
# filter outliers
georgia %>% filter(abs(rstudent(m)) <= 2) -> georgia.f
# filtered model
m.f <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = georgia.f)

This done, the compareCoefs function in car lets the results and their standard errors (SE) be compared:

compareCoefs(m,m.f,print=T, pvals = T)

Table 4.1: A comparison of regression models with outliers (Model 1) and without (Model 2).
	Model 1	SE 1	Pr(>\|z\|) 1	Model 2	SE 2	Pr(>\|z\|) 2
(Intercept)	52598.947	3108.933	0.000	52904.556	2644.075	0.000
PctRural	73.771	20.432	0.000	67.450	16.686	0.000
PctBach	697.256	112.208	0.000	727.747	104.936	0.000
PctEld	-788.619	179.790	0.000	-924.904	146.770	0.000
PctFB	-1290.304	473.881	0.006	-1484.786	459.329	0.001
PctPov	-954.003	104.591	0.000	-886.720	92.269	0.000
PctBlack	33.132	37.170	0.373	42.998	31.465	0.172

There are a couple of things to consider when comparing Model 1 and Model 2:

the difference between the coefficient estimates
the change in the Standard Errors between the models
the change in the p-values (Pr(>|z|))

It can be seen that some of the coefficient estimates alter notably between the models with and without outliers. Some are small differences but others are large: consider effect on PctFB and PctBlack.

Robust regression offers a more nuanced approach. Here, instead of simply ignoring or removing the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality (for example the presence of a large university skewing income figures in and around Atlanta compared to most of the State). Rather than totally ignoring such observations, in a robust regression, they are down weighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme down weighting to the value zero!

There are a number of approaches to robust regression - here we will use one proposed by Huber (2011). This uses an algorithm as below:

Fit a standard regression model and note the residuals.
Compute the weight for each observation as a function of absolute residual size.
Refit the regression model with these weights. Update the residual values.
Repeat from step 2 until regression coefficient estimates converge.

This is all implemented in the rlm function in the package MASS. It is very easy to use - once you have loaded MASS it just works like lm:

m.r <- rlm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = georgia)
m.r

## Call:
## rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = georgia)
## Converged in 9 iterations
## 
## Coefficients:
## (Intercept)    PctRural     PctBach      PctEld       PctFB      PctPov 
## 51607.73916    63.37478   677.87963  -777.78234 -1224.69054  -880.53199 
##    PctBlack 
##    31.79542 
## 
## Degrees of freedom: 159 total; 152 residual
## Scale estimate: 4190

Here the coefficient estimates are printed out, and it may also be seen that the model converged in 9 iterations, and that the scale factor $\tau$ was estimated as 4190. As with lm you can print out a summary:

summary(m.r)

## 
## Call: rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = georgia)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12848.5  -2832.6   -176.5   2501.2  27420.4 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) 51607.7392  2737.7192    18.8506
## PctRural       63.3748    17.9920     3.5224
## PctBach       677.8796    98.8100     6.8604
## PctEld       -777.7823   158.3230    -4.9126
## PctFB       -1224.6905   417.2985    -2.9348
## PctPov       -880.5320    92.1023    -9.5604
## PctBlack       31.7954    32.7318     0.9714
## 
## Residual standard error: 4188 on 152 degrees of freedom

Finally, coefficients from all three approaches can be compared:

coef_tab <- data.frame(OLS=coef(m),`Outliers Deleted`=coef(m.f),Robust=coef(m.r))
coef_tab

Table 4.2: A comparison of regression model coefficient estimates.
	OLS	Outliers.Deleted	Robust
(Intercept)	52598.947	52904.556	51607.739
PctRural	73.771	67.450	63.375
PctBach	697.256	727.747	677.880
PctEld	-788.619	-924.904	-777.782
PctFB	-1290.304	-1484.786	-1224.691
PctPov	-954.003	-886.720	-880.532
PctBlack	33.132	42.998	31.795

For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the Outliers.Deleted estimate. However in others, it differs from OLS, but not in the same direction as the Outliers.Deleted case.

So the options for models when data have outliers that ‘throw’ the model are

Ignore them: the data is the data and provides a true picture of measurements that were made
Filter them: just use the observations that predict the target variable well
Down weight them: use a robust regression

As a precautionary approach, I suggest working with the robust approach if there are some notable outliers. However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent, even with robust regression: those counties do have those statistics, and so should perhaps not be ignored entirely.

Addendum: Prediction vs Inference

You will need to the OLS regression models created earlier into your workspace:

m = lm(formula = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = georgia)

Recall that the regression model m gives us an idea of how different factors are associated in different ways with Median Income in the counties of the state of Georgia and their interaction. This is key to understanding all predictive and explanatory models, whether spatial or not.

If we can examine the model we can make a series of statements about how the various factors relate to the target variable, MedInc representing County median income (in 1000s of dollars):

summary(m)

## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = georgia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12420.3  -2989.7   -616.3   2209.5  25820.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52598.95    3108.93  16.919  < 2e-16 ***
## PctRural       73.77      20.43   3.611 0.000414 ***
## PctBach       697.26     112.21   6.214 4.73e-09 ***
## PctEld       -788.62     179.79  -4.386 2.14e-05 ***
## PctFB       -1290.30     473.88  -2.723 0.007229 ** 
## PctPov       -954.00     104.59  -9.121 4.19e-16 ***
## PctBlack       33.13      37.17   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

There are 2 things to note:

First, the phrasing of the interpretation of the model outputs is key:

“each additional 1% of the population that has a degree is associated with an increase of $697.256 in median income”
“each additional 1% of the population that is elderly is associated with an increase of $-788.619 in median income” (i.e. a decrease)
etc

This is inference. These coefficient estimate enhance our understanding of the process being investigated. We are not as yet able to say anything about what is causing these variations in Median Income as we have no theoretical model to infer causality.

Second, m, as with any model, can be used for Prediction. In this case, to predict county level Median Income given inputs for the model factors (predictor variables). The table below shows some hypothetical data for hypothetical Counties and the predicted Median Income for these:

# create some test data 
test = data.frame(
  PctRural = c(10, 70, 50, 70, 10),
  PctBach  = c(5, 5, 10, 10, 25),
  PctEld   = c(2, 5, 5, 12, 15),
  PctFB    = c(1, 1, 1, 5, 5),
  PctPov   = c(20, 10, 8, 25, 25),  
  PctBlack = c(5, 75, 20, 30, 75)) 
# predict using the model
predIncome = round(predict(m, newdata = test), 3)
# display
data.frame(test, Predicted.Income = predIncome)

Table 4.3: Predicted Median Income values from the model ‘m’
PctRural	PctBach	PctEld	PctFB	PctPov	PctBlack	Predicted.Income
10	5	2	1	20	5	35041.00
70	5	5	1	10	75	48960.65
50	10	5	1	8	20	51057.26
70	10	12	5	25	30	25964.39
10	25	15	5	25	75	31122.10

In summary, statistical models are constructed for 2 primary purposes: to predict or to infer some process. The distinction between the two activities has sometimes been blurred as the volume of activities in data analytics and statistics has increased, and the background training of those involved these activities has diversified. Sometimes the terminology is confusing, especially the use of the term inference or infer.

In truth, both prediction and inference seek to infer $y$ from $x$ in some way through some function $f$, where $x$ represents the inputs (predictor, independent, etc) variables and $y$ is the target (or dependent) variable and $\epsilon$ is some random error term:

\[\begin{equation} y = f(x) + \epsilon{} \label{eq:eg} \end{equation}\]

It is always useful to return to core knowledge to (re-)establish fundamental paradigms, and in this case to consider how others regard the distinction between prediction and other kinds of inference:

…in a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth. In this case one might be interested in how the individual input variables affect the prices—that is, how much extra will a house be worth if it has a view of the river? This is an inference problem. Alternatively, one may simply be interested in predicting the value of a home given its characteristics: is this house under- or over-valued? This is a prediction problem. (James et al. (2013); p20, bold emphasis added).

Viewed through this lens, prediction and inference can be distinguished as follows:

Prediction uses the estimated function $f$ to forecast $y$, for example over unsampled areas when data for those areas becomes available, or into the future, given different future states.

Inference uses estimated function $f$ to understand the impact of the the inputs $x$ on the outcome $y$. It is often associated with process understanding, with the aim of, for example, determining how much of $y$ is explained by a particular $x$, for example $x_i$, through its partial derivative $\partial{f} / \partial{x_i}$.

The aim in both cases is to identify the function $f$ that best approximates the relationships or structure in the data. And, although Prediction and Inference are both in a sense inferential with one focusing on forecasting and the other on process understanding, and both may use the same estimation procedure to determine $f$, the major difference is that they have different procedural considerations. This is especially the case for how the issues around independence and collinearity amongst predictor variables are treated for example: highly correlated input variables may make it difficult to separate variable effects in when the objective is inference, whereas this does not matter for prediction, which is concerned only with the accuracy and reliability of the forecast.

Answers to Tasks

Task 1 Create an OLS model with the PctEld variable as the target variable and PctRural, PctPov, PctBach and PctFB as the target variables. Interpret the model outputs including the coefficient estimates, significance and model fit.

The model is specified and summarised below

# Task 1
m = lm(PctEld~PctRural+PctPov+PctFB+PctBach, data = georgia)
summary(m)

## 
## Call:
## lm(formula = PctEld ~ PctRural + PctPov + PctFB + PctBach, data = georgia)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6344 -1.3926 -0.2233  1.2571 11.7801 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.246854   1.271934   5.698 6.00e-08 ***
## PctRural     0.020454   0.008992   2.275  0.02430 *  
## PctPov       0.194533   0.028182   6.903 1.25e-10 ***
## PctFB       -0.563562   0.208723  -2.700  0.00771 ** 
## PctBach     -0.006093   0.050055  -0.122  0.90328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.325 on 154 degrees of freedom
## Multiple R-squared:  0.4447, Adjusted R-squared:  0.4303 
## F-statistic: 30.84 on 4 and 154 DF,  p-value: < 2.2e-16

This can be interpreted as follows:

all of the predictor variables are statistically significant except PctBach
the coefficient estimates of the significant predictors indicate that
- each 1% increase in PctRural is associated with a 0.02% increase in PctEld suggesting that is has a very weak effect
- each 1% increase in PctPov is associated with a 0.19% increase in PctEld suggesting that is has a somewhat weak but positive effect
- each 1% increase in PctFB is associated with a 0.56% decrease in PctEld suggesting that is has a strong negative effect
the model fit is fairly low (an $R^2$ of 0.44)

Task 2 Create and interpret a binomial GLM with the target variable of LocalHosp (difficulty in access to a local hospital) using the following target variables:

Health: their declared health status;
Illness: whether they have a limiting long term illness;
Hosp: distance from the respondent’s home postcode to the nearest GP surgery - this was calculated after the survey;
cars: the number of cars they own, specifically we can use this to determine non-car ownership.

A bit of pre-processing is needed to get these into a format that is suitable for analysis. First the data reloaded and then subsetted to include the variables of interest.

# Task 2
data.pt = st_read("place_survey_pts.gpkg", quiet = T)
data.pt <- data.pt %>% dplyr::select(LocalHosp, Health, Illness, Hosp, cars)

Again the text descriptions need to be converted to binary variables:

# convert the data 
data.pt %>% 
  transmute(
    HospDiff = (LocalHosp == "Fairly difficult" | LocalHosp == "Very difficult")+0,
    BadHealth = (Health == "Bad" | Health ==    "Very bad")+0,
    Illness = (Illness == "Do have")+0,
    NoCar = (cars == 0)+0,
    HospDist = Hosp/1000) -> data.pt

#create the model
m.glm = glm(formula = HospDiff~Illness + BadHealth + HospDist + NoCar, 
            data = data.pt, 
            family= binomial(link = "logit"))
# inspect the results
summary(m.glm)

## 
## Call:
## glm(formula = HospDiff ~ Illness + BadHealth + HospDist + NoCar, 
##     family = binomial(link = "logit"), data = data.pt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.583811   0.053153 -29.797  < 2e-16 ***
## Illness      0.233799   0.059518   3.928 8.56e-05 ***
## BadHealth    0.404305   0.118860   3.402  0.00067 ***
## HospDist     0.001931   0.004076   0.474  0.63566    
## NoCar        0.479664   0.068617   6.990 2.74e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8583.8  on 8529  degrees of freedom
## Residual deviance: 8482.7  on 8525  degrees of freedom
## AIC: 8492.7
## 
## Number of Fisher Scoring iterations: 4

# odds ratios
exp(coef(m.glm))

## (Intercept)     Illness   BadHealth    HospDist       NoCar 
##   0.2051916   1.2633908   1.4982615   1.0019330   1.6155322

# confidence intervals
exp(confint(m.glm, level = 0.95))

## Waiting for profiling to be done...

##                 2.5 %    97.5 %
## (Intercept) 0.1847910 0.2276029
## Illness     1.1238324 1.4191901
## BadHealth   1.1841057 1.8875011
## HospDist    0.9939228 1.0099336
## NoCar       1.4111448 1.8467651

What these show is that there are significant statistical relationships between all of the variables and experiencing difficulty in access to Local Hospitals. The proportion of respondents experiencing access difficulty…

increases with Long Term Illness (1.3 times more likely than for those without long term illness), with a CI or likely range of between 1.1 and 1.4 times;
increases with a lack of car ownership (1.6 times more likely than for those who own a car), with a CI of between 1.4 and 1.8 times;
increases if the respondent has Bad Health (1.5 times more likely than for those with Bad Health), with a CI of between 1.2 and 1.9;

Distance (HospDist) is not significant and as also reflected in the odds ratios of around 1

Optional: the MAUP

4.4.1 Introduction

The Modifiable Areal Unit Problem (or MAUP) is a manifestation of the Ecological Fallacy (Robinson (2009)) in spatial data (Openshaw (1984a), Openshaw (1984b)). In outline, the MAUP posits that statistical relationships, trends and correlations trends may exhibit very different properties when data are aggregated over different reporting units at different scales as they frequently are in analyses of spatial data. A common example is census data reported over different geographies. In this additional activities you will examine the impact of the MAUP using census data aggregated over different geographies.

The MAUP describes the distortion of rate calculations and their impact on statistical analyses whether aimed at prediction or inference. It results in differences in outcomes when data or processes are summarised in some way over different spatial units, which are then used inputs to a model. The model outputs will vary when constructed from data aggregated over different areal units (Openshaw (1984a)) because of differences in scale that drive the aggregation. Openshaw’s original publication on the MAUP can be downloaded from http://www.qmrg.org.uk/catmog/index.html.

The MAUP is a long standing and well recognised issue in geography, but is only starting to be discovered in other domains that use spatial data such as landscape ecology, climate change, land use decision making. And of course, in this time of the many new forms of big spatial data (from crowdsourcing, sensors, social media analysis etc), consideration of the MAUP has never been more important. The additional reading at the end of the session provides some accessible examples.

We have just made some model and inferred some relationships between the target variable median income and various socio-economic factors. This was done for the state of Georgia n the USA with data aggregated over the 159 counties in Georgia. The code below loads another version of the Georgia data, constructed from the same population data, but now aggregated over 35 zones.

# load the data
load("georgia_z.RData")

We can examine the distribution of the MedInc variable over the counties and the zones using ggplot2 :

p1 = 
  ggplot(georgia) + geom_sf(aes(fill = MedInc)) +
  scale_fill_distiller(palette = "YlGnBu") 
p2 = 
  ggplot(georgia_z) + geom_sf(aes(fill = MedInc)) +
  scale_fill_distiller(palette = "YlGnBu")

We can examine the distributions in the data after creating a data, frame for each:

g = georgia |> st_drop_geometry() |> dplyr::select(MedInc, PctRural: PctBlack)
gz = georgia_z |> st_drop_geometry() |> dplyr::select(MedInc, PctRural: PctBlack)
summary(g)

##      MedInc         PctRural         PctBach          PctEld     
##  Min.   :23456   Min.   :  2.50   Min.   : 4.20   Min.   : 1.46  
##  1st Qu.:29773   1st Qu.: 54.70   1st Qu.: 7.60   1st Qu.: 9.81  
##  Median :34307   Median : 72.30   Median : 9.40   Median :12.07  
##  Mean   :37147   Mean   : 70.18   Mean   :10.95   Mean   :11.74  
##  3rd Qu.:41204   3rd Qu.:100.00   3rd Qu.:12.00   3rd Qu.:13.70  
##  Max.   :81603   Max.   :100.00   Max.   :37.50   Max.   :22.96  
##      PctFB           PctPov         PctBlack    
##  Min.   :0.040   Min.   : 2.60   Min.   : 0.00  
##  1st Qu.:0.415   1st Qu.:14.05   1st Qu.:11.75  
##  Median :0.720   Median :18.60   Median :27.64  
##  Mean   :1.131   Mean   :19.34   Mean   :27.39  
##  3rd Qu.:1.265   3rd Qu.:24.65   3rd Qu.:40.06  
##  Max.   :6.740   Max.   :35.90   Max.   :79.64

summary(g)

##      MedInc         PctRural         PctBach          PctEld     
##  Min.   :23456   Min.   :  2.50   Min.   : 4.20   Min.   : 1.46  
##  1st Qu.:29773   1st Qu.: 54.70   1st Qu.: 7.60   1st Qu.: 9.81  
##  Median :34307   Median : 72.30   Median : 9.40   Median :12.07  
##  Mean   :37147   Mean   : 70.18   Mean   :10.95   Mean   :11.74  
##  3rd Qu.:41204   3rd Qu.:100.00   3rd Qu.:12.00   3rd Qu.:13.70  
##  Max.   :81603   Max.   :100.00   Max.   :37.50   Max.   :22.96  
##      PctFB           PctPov         PctBlack    
##  Min.   :0.040   Min.   : 2.60   Min.   : 0.00  
##  1st Qu.:0.415   1st Qu.:14.05   1st Qu.:11.75  
##  Median :0.720   Median :18.60   Median :27.64  
##  Mean   :1.131   Mean   :19.34   Mean   :27.39  
##  3rd Qu.:1.265   3rd Qu.:24.65   3rd Qu.:40.06  
##  Max.   :6.740   Max.   :35.90   Max.   :79.64

You can see that there are differences - notably in the high values of median income, but a similar picture is present.

4.4.2 Impact on Correlations of Counts

Then the cor function can be applied to the data frame, and for legibility the round function is also applied:

# generate correlations
cor_g = cor(g)
cor_z = cor(gz)
# examine
round(cor_g, 3)

##          MedInc PctRural PctBach PctEld  PctFB PctPov PctBlack
## MedInc    1.000   -0.169   0.523 -0.615  0.320 -0.818   -0.536
## PctRural -0.169    1.000  -0.619  0.390 -0.547  0.174   -0.069
## PctBach   0.523   -0.619   1.000 -0.458  0.672 -0.402   -0.109
## PctEld   -0.615    0.390  -0.458  1.000 -0.483  0.568    0.297
## PctFB     0.320   -0.547   0.672 -0.483  1.000 -0.329   -0.112
## PctPov   -0.818    0.174  -0.402  0.568 -0.329  1.000    0.736
## PctBlack -0.536   -0.069  -0.109  0.297 -0.112  0.736    1.000

round(cor_z, 3)

##          MedInc PctRural PctBach PctEld  PctFB PctPov PctBlack
## MedInc    1.000   -0.443   0.660 -0.651  0.703 -0.858   -0.423
## PctRural -0.443    1.000  -0.720  0.640 -0.664  0.297   -0.196
## PctBach   0.660   -0.720   1.000 -0.644  0.877 -0.461   -0.067
## PctEld   -0.651    0.640  -0.644  1.000 -0.662  0.599    0.201
## PctFB     0.703   -0.664   0.877 -0.662  1.000 -0.560   -0.192
## PctPov   -0.858    0.297  -0.461  0.599 -0.560  1.000    0.712
## PctBlack -0.423   -0.196  -0.067  0.201 -0.192  0.712    1.000

There are some important differences in the correlations at different scales, with the following having correlations that have a difference greater than 0.2:

MedInc and PctRural
MedInc and PctFB
PctRural and PctEld
PctBach and PctFB
PctFb and PctPov

These show important difference in the levels of correlation, which are just based on frequency counts, and similar patterns would be shown if we did the same analysis for rates or percentages.

4.4.3 Impact on models

The code below recreates the OLS model from above using the original data and the data aggregated into zones

m = lm(formula = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, 
       data = georgia) 
mz = lm(formula = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, 
        data = georgia_z)

When we examine the model summaries, very different inferences are drawn both in terms of the variables that are significant, and the coefficient estimates describing the rate at which changes in the different predictors variables are associated with the changes in Median Income:

summary(m)
summary(mz)

This is also shown when the coefficients are compared directly. Notice, for example the huge difference in coefficients and significance values for PctFB:

compareCoefs(m,mz,print=T, pvals = T)

## Calls:
## 1: lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##   PctBlack, data = georgia)
## 2: lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##   PctBlack, data = georgia_z)
## 
##             Model 1 Model 2
## (Intercept)   52599   44077
## SE             3109    4910
## Pr(>|z|)    < 2e-16 < 2e-16
##                            
## PctRural       73.8    60.0
## SE             20.4    40.3
## Pr(>|z|)    0.00031 0.13620
##                            
## PctBach         697     369
## SE              112     216
## Pr(>|z|)    5.2e-10 0.08826
##                            
## PctEld         -789     -65
## SE              180     329
## Pr(>|z|)    1.2e-05 0.84351
##                            
## PctFB         -1290     678
## SE              474    1055
## Pr(>|z|)    0.00647 0.52015
##                            
## PctPov         -954   -1125
## SE              105     168
## Pr(>|z|)    < 2e-16 2.3e-11
##                            
## PctBlack       33.1   160.8
## SE             37.2    60.3
## Pr(>|z|)    0.37273 0.00763
##

So… all of this implies that the trends and relationships in the data that you identify in your models, in your statistical analyses, etc, are heavily contingent on the scales at which the data are aggregated, and thus your understanding of the problem in hand. This is the Modifiable Areal Unit Problem!

Additional Resources

Regression

OLS regression: https://ademos.people.uic.edu/Chapter12.html
Poisson and logistic regression: https://ademos.people.uic.edu/Chapter19.html
a description of the assumptions behind OLS regression: https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/
a description of how OLS works (ignore the equations!): https://towardsdatascience.com/understanding-the-ols-method-for-simple-linear-regression-e0a4e8f692cc
a tutorial doing an OLS by hand: http://science.clemson.edu/physics/labs/tutorials/excel/regression.html

The MAUP

Nice accessible lecture about the MAUP: https://www.ed.ac.uk/files/imports/fileManager/RFlowerdew_Slides.pdf
Commentary on Big Data: Brunsdon, C. and Comber, A., 2020. Opening practice: supporting reproducibility and critical spatial data science. Journal of Geographical Systems, pp.1-20 - see https://link.springer.com/article/10.1007/s10109-020-00334-2

References

Comber, Alexis J, Chris Brunsdon, and Robert Radburn. 2011. “A Spatial Analysis of Variations in Health Access: Linking Geography, Socio-Economic Status and Access Perceptions.” International Journal of Health Geographics 10 (1): 1–11.

Comber, Alexis, Martin Callaghan, Paul Harris, Binbin Lu, Nick Malleson, and Chris Brunsdon. 2022. “Gwverse: A Template for a New Generic Geographically Weighted r Package.” Geographical Analysis 54 (3): 685–709.

Huber, Peter J. 2011. “Robust Statistics.” In International Encyclopedia of Statistical Science, 1248–51. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Openshaw, Stan. 1984a. “CATMOG 38: The Modifiable Areal Unit Problem.” In Norwich: Geo-Abstracts.

———. 1984b. “Ecological Fallacies and the Analysis of Areal Census Data.” Environment and Planning A 16 (1): 17–31.

Robinson, William S. 2009. “Ecological Correlations and the Behavior of Individuals.” International Journal of Epidemiology 38 (2): 337–41.