Part 1 GAM with Splines

In this session you will:

  • Set up your RStudio session, loading data and packages
  • Undertake and unpick a standard GAM regression
  • Extend this to a GAM with splines (also referred to as “smooths”)

1.1 Data and Packages

The first code chuck loads the packages required, checking for the presence on your computer and installing them if necessary. Remember that you can execute chunks of code by highlighting it and clicking the Run button, or by placing your cursor inside it and pressing Cmd+Shift+Enter.

# package list
packages <- c("parlitools", "tidyverse","sf", "mgcv", "ggspatial", "cols4all", "cowplot")
# check which packages are not installed
not_installed <- packages[!packages %in% installed.packages()[, "Package"]]
# install missing packages
if (length(not_installed) > 1) {
  install.packages(not_installed, repos = "https://cran.rstudio.com/", dep = T)
}
# load the packages
library(parlitools) # for UK voting data 
library(tidyverse)  # for data manipulation and plotting
library(sf)         # for spatial data
library(mgcv)       # for GAMs
library(ggspatial)  # for map backdrops
library(cols4all)   # for nice shading in graphs and maps
library(cowplot)    # for managing plots

Next some data of the 2016 UK Brexit vote are created. Census and voting data can be obtained from the parlitools R package (Odell 2020). This includes data on voting for the 632 Westminster parliamentary constituencies in Great Britain (i.e. excluding those in Northern Ireland, for which some data is missing). Information on the data and the package can be found at . The voting data were linked to socio-economic factors that were associated with the Leave vote as described in Beecham et al. (2020), from census data also contained in the parlitools R package. The data are reported over hexagonal cartograms. (NB The true constituency areas are available from the UK Office of National Statistics at https://geoportal.statistics.gov.uk/datasets/d3650c649bff40768badafdf0518a4af_0/.)

## 1.1 create the data
# Great Britain, not NI
NI <- str_detect(west_hex_map$gss_code,'^N')
n <- sum(!NI) ## Cheeky recast of logicals to numeric 0/1
hex.sp <- west_hex_map[!NI,] ## Subset of map
st_crs(hex.sp) = st_crs(hex.sp)
## 1.2 Join the other data tables
hex.sp <- hex.sp %>% rename(ons_const_id=gss_code) %>% 
  left_join(bes_2015,by = 'ons_const_id') %>%
  left_join(leave_votes_west, by = 'ons_const_id') %>% 
  left_join(census_11, by = 'ons_const_id') 
## 1.3 Select variables for analysis
hex.sp <- 
  hex.sp %>% 
  mutate(over65 = age_65_to_74+ age_75_to_84    + age_85_to_89 + age_90_plus, 
         ttu = nssecsemi_routine + nssecroutine + nsseclower_supervisor, 
         lhosp = industry_accommodation,
         manu = industry_manufacturing,
         degree = qual_level_4,
         badhealth = health_bad + health_very_bad,
         bornuk = born_uk,
         leave = figure_to_use *100, 
         to15 = turnout_15) %>% 
  select(ons_const_id, leave, to15, over65,  ttu, lhosp, manu, degree, badhealth, bornuk, region_name)
## 1.4 create regions
# tidy region names
hex.sp$region_name = gsub(" Euro Region", "", hex.sp$region_name) 
hex.sp$region_name = gsub(" and the Humber", "", hex.sp$region_name) 
hex.sp$region_name = gsub("Eastern", "East", hex.sp$region_name) 
regions_hex.sp <- 
  hex.sp %>% 
  group_by(region_name) %>%
  summarise(geometry = sf::st_union(geometry)) %>%
  ungroup()

If you want to examine the data outside or R/RStudio (for example in a GIS), it can be saved off to a GeoPackage format:

st_write(hex.sp, "hex.sp.gpkg", quiet = T, delete_layer = T)

The variables are described in Table 1.1.

Table 1.1: The variables used to construct the GGP-GAM of Leave voting rates, for each parliamentary consituency.
Variable Description
leave Leave vote (%)
to15 Turn out in the 2015 general election (%)
over65 Over 65s (%)
ttu Transport trade and utilities employment (%)
lhosp Leisure & hospitality employment (%)
manu Manufacturing employment (%)
degree Degree (level 4) qualification (%)
badhealth Bad health (%)
bornuk UK born (%)

The spatial distribution of the response variable and the UK regions for reference in shown in Figure 1.1. The voting pattern reflects some well recognised local trends: low leave vote in Scotland and London, high leave vote in the Midlands and North.

p1 <-  
  ggplot() +
  geom_sf(data = hex.sp,aes(fill=(leave-50)), col = NA) + 
  scale_fill_continuous_c4a_div(palette="tol.sunset",name='Leave (%) - 50%' ) + 
  theme_bw() +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
p2 <- 
  ggplot() +
  geom_sf(data = regions_hex.sp, aes(fill = region_name), col = NA) +
  scale_fill_discrete_c4a_cat(palette="brewer.paired",name='Regions') + 
  theme_bw() +
    theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank()) 

plot_grid(p1,p2, ncol = 2)
The Brexit leave vote majority over parliamentary constituencies with the UK regions, recast into hexagons.

Figure 1.1: The Brexit leave vote majority over parliamentary constituencies with the UK regions, recast into hexagons.

1.2 Regression

The standard form for a linear regression, which are commonly solved using an ordinary least squares approach (OLS) is:

\[y = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{k} + \epsilon \]

where \(y\) is the response variable, \(x_{k}\) is the value of the \(k^{th}\) predictor variable, \(m\) is the number of predictor variables, \(\beta_0\) is the intercept term, \(\beta_k\) is the regression coefficient for the \(k^{th}\) predictor variable and \(\epsilon\) is the random error term.

We can make an initial model of land price and examine the result:

reg.mod = lm(leave~ to15 + over65 + ttu + lhosp + manu + degree + badhealth + bornuk, 
             data = hex.sp)
summary(reg.mod)
## 
## Call:
## lm(formula = leave ~ to15 + over65 + ttu + lhosp + manu + degree + 
##     badhealth + bornuk, data = hex.sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8099  -2.6869   0.3864   3.0490  13.9272 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 147.18794    5.04424  29.179  < 2e-16 ***
## to15         -0.90338    0.06559 -13.773  < 2e-16 ***
## over65        1.29320    0.07921  16.327  < 2e-16 ***
## ttu          -0.25512    0.07267  -3.511 0.000479 ***
## lhosp        -1.16661    0.13615  -8.569  < 2e-16 ***
## manu          0.56592    0.08414   6.726 3.95e-11 ***
## degree       -0.99318    0.05906 -16.817  < 2e-16 ***
## badhealth    -2.05115    0.20333 -10.088  < 2e-16 ***
## bornuk       -0.09962    0.03339  -2.984 0.002957 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.088 on 623 degrees of freedom
## Multiple R-squared:  0.8047, Adjusted R-squared:  0.8022 
## F-statistic: 320.8 on 8 and 623 DF,  p-value: < 2.2e-16

We have a model with a good \(R^2\) value, describing the relationship between the different covariates and the leave vote. And the coefficient estimates can be interpreted in the usual way:

  • each additional 1% of the electorate that voted in the 2015 election, is associated with a 0.9% decrease in the leave vote;
  • each additional 1% of the population who are 0ver 65, is associated with a 1.3% increase in the leave vote;
  • …etc

This should all be relatively familiar, and sets the scene for GAMs in the next section.

1.3 GAM Regression

GAMs are Generalised Additive Models. They were originally proposed by Hastie and Tibshirani (1987). The gam function in the mcgv package (Wood 2018) can be used to fit a regression model in the same way,. The model created by the code below generates the same fixed parametric coefficient estimates and leading to the same interpretations:

gam.mod = gam(leave~ to15 + over65 + ttu + lhosp + manu + degree + badhealth + bornuk, 
              data = hex.sp)
summary(gam.mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## leave ~ to15 + over65 + ttu + lhosp + manu + degree + badhealth + 
##     bornuk
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 147.18794    5.04424  29.179  < 2e-16 ***
## to15         -0.90338    0.06559 -13.773  < 2e-16 ***
## over65        1.29320    0.07921  16.327  < 2e-16 ***
## ttu          -0.25512    0.07267  -3.511 0.000479 ***
## lhosp        -1.16661    0.13615  -8.569  < 2e-16 ***
## manu          0.56592    0.08414   6.726 3.95e-11 ***
## degree       -0.99318    0.05906 -16.817  < 2e-16 ***
## badhealth    -2.05115    0.20333 -10.088  < 2e-16 ***
## bornuk       -0.09962    0.03339  -2.984 0.002957 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.802   Deviance explained = 80.5%
## GCV = 26.257  Scale est. = 25.883    n = 632

However, GAMs also allow users to examine non-linear relationships through the uses of smooths or splines (more on these later). The basic ideas behind the GAM framework are that:

  1. Relationships between the dependent variable and predictor variables follow smooth patterns that can be linear or non-linear.
  2. It is possible to estimate these smooth relationships individually and simultaneously and then predict the target variable by adding them up.

These ideas are illustrated in the figure below from https://multithreaded.stitchfix.com/blog/2015/07/30/gam/:

where \(Y\) is the target variable \(E(Y)\) is the expected value, \(g(Y)\) is the link function that links the expected value to the predictor variables \(x_1 \dots x_p\), and the terms \(s_1(x_1), \dots , s_p(x_p)\) indicate the smooth, nonparametric (non-linear) functions. Formal descriptions of GAMs and splines are provided in Part 2.

The code below plots the target variable against the predictor variables. Here is is evident that each of the relationships with the target variable are relatively smooth, but importantly that the the assumption of linear relationships estimated under a standard regression may not be appropriate in all cases:

hex.sp %>% st_drop_geometry() %>% 
  select(-ons_const_id , -region_name) %>% 
  pivot_longer(-leave) %>% 
  ggplot(aes(leave, value)) + 
  #geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~name, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Scatterplots of the target variable (Brexit leave vote) with the covariates.

Figure 1.2: Scatterplots of the target variable (Brexit leave vote) with the covariates.

1.4 GAMs with Splines / Smooths

To handle these non-linearities GAMs can be specified with smooths, which are also referred to as splines. These are best illustrated through some simulated data which is created in the code snippet below along with a standard linear model:

# make simulated data
set.seed(1)
dat <- gamSim(1,n=100,scale=2)[, 1:4]
## Gu & Wahba 4 term additive model
head(dat)
##           y        x0        x1        x2
## 1 14.531995 0.2655087 0.6547239 0.2675082
## 2 16.113160 0.3721239 0.3531973 0.2186453
## 3  9.583515 0.5728534 0.2702601 0.5167968
## 4 15.686696 0.9082078 0.9926841 0.2689506
## 5  8.221554 0.2016819 0.6334933 0.1811683
## 6  9.903443 0.8983897 0.2132081 0.5185761
# create OLS regression model 
mod = lm(y ~ x1 + x2, data = dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8749 -1.3978 -0.0907  1.6120  7.9196 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.2807     0.7534   9.664 7.14e-16 ***
## x1            8.0138     1.0482   7.645 1.51e-11 ***
## x2           -5.9673     1.0222  -5.838 7.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.835 on 97 degrees of freedom
## Multiple R-squared:  0.4859, Adjusted R-squared:  0.4753 
## F-statistic: 45.84 on 2 and 97 DF,  p-value: 9.685e-15

Everything is fine superficially here: we can see a positive effect of \(x_1\) on \(y\), a negative one for \(x_2\), with a reasonable \(R^2\). A similar story is found if some diagnostic plots are generated as in Figure 1.3:

# Residuals vs. Fitted plot
p1 = data.frame(fitted = mod$fitted.values, resids = mod$residuals) %>%
  ggplot(aes(x = fitted, y = resids)) + 
  geom_point(col = "tomato", alpha = 0.5, pch = 19) +
  theme_bw() + ggtitle("Residuals vs Fitted") +
  geom_smooth(se = FALSE, col = "tomato") +
  xlab("Fitted values") + ylab("Residuals")

# Quantiles (QQ) plot
p2 = data.frame(studentised.resids = rstudent(mod)) %>%
  ggplot(aes(sample = studentised.resids)) + 
  stat_qq(col = "dodgerblue", pch = 1, cex = 2.5) + 
  stat_qq_line(col = "dodgerblue", lty = 2) +
  theme_bw() + ggtitle("Normal QQ") +
  xlab("Theoertical Quantiles") + ylab("Studentised Residuals")

plot_grid(p1, p2, ncol = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Diagnostics pllots of the OLS regression of simulated data.

Figure 1.3: Diagnostics pllots of the OLS regression of simulated data.

So far so good. However, if the relationships between the target variable \(y\) and the 2 covariates are examined as in Figure 1.4, then there are perhaps some reasons for concern: although there is a broadly linear and the positive effect for \(x_1\), it is a bit curvy; but the relationship with \(x_2\) is difficult to characterise in a linear way - it is initially sharply positive for low values, and then negative.

dat %>% select(y, x1, x2) %>%
  pivot_longer(-y) %>%
  ggplot(aes(x = value, y = y)) +
  geom_point() +
  facet_wrap(~name) + 
  xlab("")+ ylab("leave") +
  geom_smooth(se = FALSE) +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Scatterplots of the simualted target variable agsinst the covariates.

Figure 1.4: Scatterplots of the simualted target variable agsinst the covariates.

This is a very common situation: the assumptions of linearity of a standard linear model are often violated, or the model does not capture the nuances of the data very well. Options for handling this may be to explore non-linear relationships such as polynomial regression with quadratic or cubic terms, or through more complex specifications involving sines or cosines etc.

However, these may fail with highly complex relationships, particularly they may fail to capture the true amount of ‘wiggliness’ in the data or they may overfit it. And we frequently have to randomly guess about the form of the relationship until we find a decent model form and associated fits.

One way of dealing with this non-linearity and unknown wiggliness is to fit a series of mini models to different parts of the data in a piecewise way. This can be done by dividing the data into chunks at various points (or knots) along the x-axis, and then fit a linear or polynomial model to that subset of data. The following code fits a linear regression for subsets of \(x_2\) identified manually from the figure above and plots them (Figure ??).

# subset - these define the knots
xa =  dat %>% filter(x2< 0.25)
xb =  dat %>% filter(x2>= 0.25 & x2 < 0.45) 
xc =  dat %>% filter(x2>= 0.45 & x2 < 0.75) 
xd =  dat %>% filter(x2>= 0.75)
# plot
p1 = ggplot(data = rbind(xa, xb, xc, xd), aes(x = x2, y = y)) + 
    geom_point(col = NA) +
    geom_point(data = xa, aes(x2, y), col = "tomato") +
      stat_smooth(data = xa,  aes(x2, y), method = "lm", 
                  se = FALSE, color = "tomato") +
    geom_point(data = xb, aes(x2, y), col = "dodgerblue") +
      stat_smooth(data = xb,  aes(x2, y), method = "lm", 
                  se = FALSE, color = "dodgerblue") +
    geom_point(data = xc, aes(x2, y), col = "goldenrod") +
      stat_smooth(data = xc,  aes(x2, y), method = "lm", 
                  se = FALSE, color = "goldenrod") +
    geom_point(data = xd, aes(x2, y), col = "darkgreen") +
      stat_smooth(data = xd,  aes(x2, y), method = "lm", 
                  se = FALSE, color = "darkgreen") 

And it is easy to can see how these could be extended to other relationships for each subset. The code below uses the example of quadratic relationships between the x2 and y in the subsets and plots them in Figure 1.5:

p2 = ggplot(data = rbind(xa, xb, xc, xd), aes(x = x2, y = y)) + geom_point(col = NA) +
    geom_point(data = xa, aes(x2, y), col = "tomato") +
    stat_smooth(data = xa,  aes(x2, y), method = "lm", 
                formula = y ~ x + I(x^2),
                se = FALSE, color = "tomato") +
    geom_point(data = xb, aes(x2, y), col = "dodgerblue") +
    stat_smooth(data = xb,  aes(x2, y), method = "lm", 
                formula = y ~ x + I(x^2),
                se = FALSE, color = "dodgerblue") +
    geom_point(data = xc, aes(x2, y), col = "goldenrod") +
    stat_smooth(data = xc,  aes(x2, y), method = "lm", 
                formula = y ~ x + I(x^2),
                se = FALSE, color = "goldenrod") +
    geom_point(data = xd, aes(x2, y), col = "darkgreen") +
    stat_smooth(data = xd,  aes(x2, y), method = "lm", 
                formula = y ~ x + I(x^2),
                se = FALSE, color = "darkgreen") 
plot_grid(p1, p2)
Scatterplot of the x2 covariate, subseted into chunks with 'within-variable' quadratic regression trend lines.

Figure 1.5: Scatterplot of the x2 covariate, subseted into chunks with ‘within-variable’ quadratic regression trend lines.

You can see from the plots above how, for a single variable - recall we are just examining x2 in the simulated data, many different relationships with the target variable (y) may be present within the data, and how these can be extracted using knots.

In the code above, the knots - points at which a within-data local relationships were found - were identified manually and mini-models for each subset were only created in within the code for the plot.

However, these ideas of multiple “within the variable” models, underpin the notion of splines or smooths in a Generalized Additive Model. A formal definition is provided in the next session.

For a GAM with splines can be fitted to the data:

gam.mod = gam(y ~ s(x1) + s(x2), data = dat)
gam.check(gam.mod)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 29 iterations.
## The RMS GCV score gradient at convergence was 4.258793e-05 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##         k'  edf k-index p-value
## s(x1) 9.00 5.71    1.16    0.92
## s(x2) 9.00 5.72    0.99    0.43
summary(gam.mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1) + s(x2)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.832      0.213   41.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df     F p-value    
## s(x1) 5.710  6.859 10.86  <2e-16 ***
## s(x2) 5.723  6.863 17.67  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.704   Deviance explained = 73.8%
## GCV = 5.1794  Scale est. = 4.5354    n = 100

The model is well tuned with 9 knots for each covariate and a fixed term for the intercept. The fit has increased from the OLS regression model (mod) defined earlier and the coefficients can be extracted:

coef(gam.mod)
## (Intercept)     s(x1).1     s(x1).2     s(x1).3     s(x1).4     s(x1).5 
##   8.8317045   1.3821500   1.4110734   2.5028131   2.2725032   0.9366032 
##     s(x1).6     s(x1).7     s(x1).8     s(x1).9     s(x2).1     s(x2).2 
##  -1.3508776   1.1290149  -4.1634269  -1.7907158  -5.0219858 -15.5418200 
##     s(x2).3     s(x2).4     s(x2).5     s(x2).6     s(x2).7     s(x2).8 
##  -3.6806831  -5.4304512  -1.4428636   5.1484617   1.4518634  18.6901232 
##     s(x2).9 
##   6.3781799

Notice that there is one coefficient for fixed term and each knot in the smooth. We can see, for example,that the coefficients for x1 have mostly positive slopes but also negative ones in places within the data, and that those for x2 are mostly negative.

plot(coef(gam.mod)[2:10])
abline(h = 0)

Returning to the Brexit data, a GAM with splines / smooths is constructed using the code below. Note how There are many types of spline / smooths (see the help page help(s) for the full set of options). Here the defaults for the gam function were accepted.

gam.bx.mod <- gam(leave~ s(to15) + s(over65) + s(ttu) + s(lhosp) + s(manu) + 
                    s(degree) + s(badhealth) + s(bornuk), 
                  data = hex.sp)

It is possible to check the model to see whether the number of knots has been sufficiently optimised using the gam.check function from the mcgv package. This prints out a number of diagnostic plots as before: QQ-plot, residuals against predictors, a histogram of residuals and response (\(y\)) vs. fitted values (\(\hat{y})\).

gam.check(gam.bx.mod)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 11 iterations.
## The RMS GCV score gradient at convergence was 1.387693e-06 .
## The Hessian was positive definite.
## Model rank =  73 / 73 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value    
## s(to15)      9.00 2.76    0.97    0.21    
## s(over65)    9.00 2.90    0.98    0.28    
## s(ttu)       9.00 4.76    1.02    0.67    
## s(lhosp)     9.00 2.47    1.04    0.84    
## s(manu)      9.00 2.74    1.02    0.67    
## s(degree)    9.00 2.80    0.90  <2e-16 ***
## s(badhealth) 9.00 4.41    1.00    0.42    
## s(bornuk)    9.00 1.00    0.94    0.09 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Perhaps of most interest is the printed summary of the gam.check function, which indicates the number of knots, \(k\) in each spline / smooth and values for the effective degrees of freedom (edf). The edf values can be interpreted as an indicator of the non-linearity of the relationship of the response (pricel) with the predictor variables - i.e. the effective degrees of freedom would have a value of ~1 if the model penalized the smooth term to a near linear relationship. Higher edf values indicate increasing non-linearity in the relationship. Here all edf values are less than \(k\) indicating that the model is adequately specified. Here the p-values relate to the knots parameter \(k\) and whether it may be too low, especially if the reported edf is close to the k' value, the maximum possible EDF for the term. These all look OK.

In the mcgv package, the parameter which controls the degree of smoothing of the data via spline knots is optimised during GAM fitting. This will be returned to in the context of spatially varying coefficient models with Gaussian Process splines in the next session.

Finally, it is also possible to examine the full summary of the GAM including fixed (parametric) coefficient estimates, which can be can be interpreted as linear terms in the model, and a summary of the smooths. All are significant and the model fit (\(R^2\)) has increased compared to the initial models (reg.mod and gam.mod).

summary(gam.bx.mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## leave ~ s(to15) + s(over65) + s(ttu) + s(lhosp) + s(manu) + s(degree) + 
##     s(badhealth) + s(bornuk)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.0583     0.1907   272.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F  p-value    
## s(to15)      2.764  3.532 59.02  < 2e-16 ***
## s(over65)    2.898  3.760 80.85  < 2e-16 ***
## s(ttu)       4.757  5.880  4.25 0.000402 ***
## s(lhosp)     2.474  3.133 12.95  < 2e-16 ***
## s(manu)      2.742  3.517 14.53  < 2e-16 ***
## s(degree)    2.796  3.608 61.85  < 2e-16 ***
## s(badhealth) 4.408  5.469 21.21  < 2e-16 ***
## s(bornuk)    1.000  1.000 10.39 0.001332 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.824   Deviance explained = 83.1%
## GCV = 23.936  Scale est. = 22.996    n = 632

For the smooth terms (splines), the p-values relate to the smooths defined over the variable space, and their significance can be interpreted as indicating whether they vary within that space. The p-values indicate that all of the covariates are significant locally except manu and badhealth. Any insignificant p-values can be interpreted as indicating that although there are effects, these do not vary locally within this attribute space.

Again the coefficients can be examined, and here the bornuk ones are extremely low.

mat = matrix(coef(gam.bx.mod)[-1], ncol = 8)
colnames(mat) = names(hex.sp)[3:10]
mat
##              to15      over65        ttu       lhosp        manu      degree
##  [1,] -0.03208693 -0.86832884  0.1146659  0.09466284  1.00212782  0.12030573
##  [2,]  0.88319856  0.03839976 -3.0721789  0.76683457  0.34803006  0.18855487
##  [3,]  0.09169345 -0.35863446 -0.8519068 -0.03125884 -0.09331350  0.04480240
##  [4,]  0.69376063  0.17423738  0.6588155 -0.83502142  0.09496868 -0.31274332
##  [5,] -0.18046641  0.30819461 -0.1973320  0.29238087  0.02102715  0.07893378
##  [6,] -0.65060032 -0.16447298  0.7619501 -0.79444011 -0.11200009  0.46008706
##  [7,] -0.21480666  0.31592848 -0.1920892  0.37609755  0.01249560  0.04370047
##  [8,]  4.22577112  3.15070352 -3.7124649 -3.86661485  0.49231716 -2.98262376
##  [9,] -4.94241406  4.89424920 -1.1810484 -1.08596851  1.14810055 -8.24086777
##         badhealth        bornuk
##  [1,]  1.15025416 -3.596843e-10
##  [2,] -3.02825998 -3.876715e-10
##  [3,] -0.16663199  1.938178e-11
##  [4,]  1.27332722 -2.366977e-10
##  [5,]  0.08460121 -1.293325e-12
##  [6,] -1.33469348  1.983867e-10
##  [7,] -0.20290879 -4.053034e-13
##  [8,]  7.41570212  6.654922e-10
##  [9,] -4.91423243 -1.370869e+00

1.5 Summary

The aim of this session was to set the scene for spatially varying coefficient models with GAMs in the next session. It aimed to introduce and illustrate the workings of GAMs and particularly to provide a generalised understanding of what GAM smooths do and how they do it. A key consideration of the specification of knots (\(k\)) in splines and thus nature of the smoothing they specify. These are optimised by the gam function in the mcgv package.

References

Beecham, R., Williams, N., and Comber, A., 2020. Regionally-structured explanations behind area-level populism: An update to recent ecological analyses. Plos one, 15 (3), e0229974.
Hastie, T. and Tibshirani, R., 1987. Generalized additive models: Some applications. Journal of the American Statistical Association, 82 (398), 371–386.
Odell, E., 2020. Package ‘parlitools’. R package version, 12.
Wood, S., 2018. Mixed GAM computation vehicle with automatic smoothness estimation. R package version 1.8-24.