Module 5A Generalized Additive Models

Reading

Basic Concepts

Motivation and Examples

Motivation

Real data rarely depend on a single predictor and may/maynot be normal. We often have several covariates, each with its own nonlinear effect, plus possible interactions. How can we extend penalized splines to this multivariable setting while keeping models interpretable and smoothness data-driven?

Answer: Generalized Additive Model: A GAM models the mean via a link: \[ g\!\big(\mathbb{E}[Y]\big) = \beta_0 + \sum_{j=1}^p s_j(X_j), \] where \(g(\cdot)\) is the link, \(s_j(\cdot)\) are smooth functions (e.g., splines), and \(p\) is the number of predictors.

where:

  • \(Y\) is the response variable,

  • \(g()\) is the link function that relates the expected value of \(Y\) \((E(Y)\) to the linear predictor.

  • \(\beta_0\) is the intercept term.

  • \(s_j(X_j)\) are smooth functions, such as splines, applied to the predictor variable \(X_j\).

  • \(p\) is the number of predictor variables.

  • Each smooth function \(s_j(\cdot)\) allows for greater flexibile, non-linear relationships between \(Y\) and \(X_j\).

Why GAMs?

  • Flexible: captures separate nonlinear effects \(s_j(X_j)\) for each predictor.
  • Interpretable: additive structure → clear partial effects and plots.
  • Automatic smoothing: ech \(\lambda\) is learned from the data (e.g., REML/GCV).
  • Extensible: include factors, linear terms, and interactions via tensor-product smooths (e.g., \(s_{12}(x_1,x_2)\)).

Motivating Example: Low Birthweight in GA Recall from example in Module 3.

Dataset: a cohort of live births from GA born in the year 2001 (N=77,340)

Variables:

  • \(ptb\): indicator for preterm birth for whether baby from pregnancy \(i\) was born <37 weeks.

  • \(age\): mothers age at delivery

  • \(male\): indicator of baby’s sex (1=male, 0=female)

  • \(tobacco\): indicator for mother tobacco use during pregnancy test (1=yes, 0=no)

In previous analysis, we fit this using a logistic regression model, which assumed a binomial data model with a logit-link. We regressed the outcome (probability of low-birth weight) on age, sex, and tobacco status.

But we may want to account for non-linear age effect:

\[ \begin{aligned} \text{ptb}_i &\sim \text{Bernoulli}(p_i),\\ \text{logit}(p_i) &= \beta_0 + \beta_1\,\text{male}_i + \beta_2\,\text{tobacco}_i + s(\text{age}_i). \end{aligned} \]

where \(s(age_i)\) is a smooth function of age.

Note: By default, mgcv::gam() uses cubic regression splines (bs = "cr") with k = 10 basis functions (including the intercept of the smooth).

library (mgcv)
library(ggplot2)
library(dplyr)
# Pre-term births with non-linear age effect
load(paste0(getwd(), '/data/PTB.RData'))

fit.gam = gam(ptb~ s(age) + male + tobacco ,
              family = binomial, data = dat)
summary(fit.gam)

Family: binomial 
Link function: logit 

Formula:
ptb ~ s(age) + male + tobacco

Parametric coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.44226    0.01913 -127.647  < 2e-16 ***
maleM        0.07274    0.02588    2.811  0.00494 ** 
tobacco      0.39016    0.05356    7.284 3.24e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df Chi.sq p-value    
s(age) 3.314  4.146  70.17  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.00177   Deviance explained = 0.304%
UBRE = -0.42095  Scale est. = 1         n = 77340

Model Interpretation:

  • The response variable ptb is modeled using a binomial family with a logit link function.

  • The formula includes:

    • A smooth term for age (s(age)).
    • Linear terms for male and tobacco.
  • Intercept: The log-odds of ptb when all predictors are at their reference levels is estimated as \(-2.44\) (highly significant, \(p < 2 \times 10^{-16}\)).

  • maleM (male effect): Being male is associated with a small but significant increase in the log-odds of ptb (estimate = \(0.073\), \(p = 0.00494\)). Tobacco use is associated with a significant increase in the log-odds of ptb (estimate = \(0.39\), \(p < 3.24 \times 10^{-13}\)).

  • Smooth Term for Age (s(age)) : The smooth function for age has an estimated effective degrees of freedom (edf) of \(3.31\), indicating a moderately flexible relationship. Is highly significant (\(p < 2 \times 10^{-16}\)), suggesting a non-linear relationship between age and the log-odds of ptb.

  • Adjusted \(R^2\): The model explains only \(0.177\%\) of the variability in the response, indicating weak predictive power.

  • Deviance explained: The model accounts for \(0.304\%\) of the total deviance, further supporting limited explanatory capability. -

  • UBRE score: - A measure of model fit, with a value of \(-0.42095\). UBRE is similar to AIC for non-normal data: Lower is better!

  • Sample size: - The model was fit on \(77,340\) observations.

Conclusion: While the predictors (age, male, and tobacco) show significant associations with ptb, the overall model has limited explanatory power for predicting ptb.


Checking GAMs

mgcv::gam.check(fit.gam)


Method: UBRE   Optimizer: outer newton
full convergence after 3 iterations.
Gradient range [9.816712e-07,9.816712e-07]
(score -0.4209495 & scale 1).
Hessian positive definite, eigenvalue range [1.285937e-05,1.285937e-05].
Model rank =  12 / 12 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'  edf k-index p-value
s(age) 9.00 3.31    0.94    0.54

Interpretation:

  • Residual plots for fitted values are a good way to check residuals in the normal case. But for binary outcomes, they look odd and are not helpful.
  • \(k' = 9\rightarrow\) the smooth for age was represented using 9 basis functions (default).
  • \(edf=3.31\rightarrow\) the effective degrees of freedom are much lower than 9, indicating that the smooth for age is relatively simple- not over wiggly or overfitted. It if is close to \(k\) then it is overfitting.
  • \(k-index = 0.94\) and \(p-value = 0.44 \rightarrow\)
    • No evidence that the basis dimension \(k\) is too small.
    • The smooth has enough flexibility to capture the nonlinear pattern in age.
    • If the k-index were close to 0 or p-value < 0.05, it would suggest that the smooth is constrained by too few basis functions and should be increased.

Looking at the plot below, what is the trend of age and preterm birth?

Biologically, log-odds of preterm birth is higher at very young ages, lowest around 29 years, and then increased again.


Uncertainty Bands via a Bayesian View

One question we have not yet answered is how we obtain the 95% confidence intervals shown in the smooth function plots.

We assume the underlying smooth function \(s(x)\) is smooth and regularized.
This assumption can be formalized as a prior in a Bayesian framework, where the spline penalty corresponds to a prior distribution on the spline coefficients.

In Bayesian statistics, the parameters \(\boldsymbol{\beta}\) are treated as random variables.
A ridge-type penalty corresponds to placing a (possibly improper) Gaussian prior on \(\boldsymbol{\beta}\):

\[ p(\boldsymbol{\beta}) \propto \exp\!\left(-\tfrac{1}{2}\,\boldsymbol{\beta}^\top \mathbf{K}\,\boldsymbol{\beta}\right), \]

where \(\mathbf{K}\) is the penalty matrix encoding curvature (e.g., based on the integrated squared second derivative).

For Gaussian data, this leads to an approximate posterior distribution:

\[ [\boldsymbol{\beta}\mid \mathbf{y}, \lambda] \sim N\!\left(\hat{\boldsymbol{\beta}},\; \sigma^2\,(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{K})^{-1}\right), \]

where:

  • \(\hat{\boldsymbol{\beta}}\): penalized (posterior mode) estimate,
  • \(\sigma^2\): residual variance,
  • \(\lambda\): smoothing parameter controlling the trade-off between fit and smoothness,
  • \(\mathbf{K}\): penalty matrix.

For non-Gaussian or generalized models, replace \(\mathbf{X}^\top \mathbf{X}/\sigma^2\) by the observed Fisher information \(\hat{\mathbf{I}}\) (the Hessian of the negative log-likelihood at \(\hat{\boldsymbol{\beta}}\)):

\[ [\boldsymbol{\beta}\mid \mathbf{y}, \lambda] \sim N\!\left(\hat{\boldsymbol{\beta}},\; (\hat{\mathbf{I}} + \lambda \mathbf{K})^{-1}\right). \]

For quasi-likelihood models (e.g., quasi-Poisson), scale by the dispersion parameter \(\phi\) as appropriate.

In this framework, the 95% intervals shown for each smooth \(s_j(x)\) are Bayesian credible intervals derived from the approximate posterior distribution of the smooth function.
Importantly, as shown in Wood (2017, §6.10), these credible intervals also demonstrate good frequentist coverage—meaning that the “Bayesian” intervals behave well as classical confidence intervals in repeated-sampling terms.


Multiple Smoothers

So far, we have focused on fitting a single smooth function. In real-world applications, we often have multiple predictors, each contributing its own nonlinear effect on the response. This leads to additive models with multiple smooth components.

Consider the model with two continuous variables \(x_i\) and \(z_i\):

\[ y_i = \beta_0 + g_1(x_i) + g_2(z_i) + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) \]

Here:

  • \(g_1(\cdot)\) and \(g_2(\cdot)\) are smooth, potentially nonlinear functions of their respective predictors.
  • The model can be extended to generalized responses (e.g., binomial, Poisson) through an appropriate link function.

Each smooth can be written as a basis expansion with a curvature penalty: \[ \begin{aligned} g_1(x) &= \sum_{m=1}^{M_1} \beta_{m1}\, b_{m1}^{(d)}(x),\qquad g_2(z) = \sum_{m=1}^{M_2} \beta_{m2}\, b_{m2}^{(d)}(z), \end{aligned} \] with roughness penalties \[ \lambda_1 \int \big[g_1''(x)\big]^2\,dx \;+\; \lambda_2 \int \big[g_2''(z)\big]^2\,dz. \]

We add penalization to each term to induce smoothness so that each function remains flexible but avoids overfitting.


Motivating Example: Air Pollution and Mortality

Scientific Question: What is the association between daily mortality counts and daily concentrations of fine particulate matter (PM₂.₅)?

Context:
Fine particulate matter (PM₂.₅) consists of solid and liquid particles in the air with a diameter less than \(2.5\,\mu\text{m}\). These particles mainly arise from:

  • Power generation,
  • Vehicle emissions, and
  • Industrial activities.

Data Sources (New York City, 2001–2005):

  • Mortality Data: Daily counts of non-accidental deaths (age ≥ 65) from the CDC.
  • Air Pollution Data: Daily PM₂.₅ concentrations from the U.S. EPA.
  • Meteorology Data: Temperature and humidity from the National Climatic Data Center (NOAA).

Time-Series Health Model

We are interested in the association between daily variation in mortality counts and daily variation in air pollution exposure (PM₂.₅). In a time-series design, the population serves as the unit of analysis—our outcome represents total daily deaths in a defined area. Time-varying confounders (e.g., temperature, humidity, long-term trends) are flexibly controlled using smooth functions.

Temporal Lag: There is often a delay between exposure and health response.We therefore model the association between daily mortality and previous-day exposure.

Let

  • \(y_t\) denote the daily count of non-accidental deaths (age ≥ 65) on day \(t\),
  • \(x_{t-1}\) be the lag-1 PM₂.₅ concentration (exposure on the previous day), and
  • \(\text{DOW}_t\) denote the day of week (a categorical factor).

GAM Formula

We model the expected deaths with a log link (quasi-Poisson): \[ \begin{aligned} \log \mathbb{E}[y_t] &= \beta_0 \;+\; f_{\text{PM}}\!\big(x_{t-1}\big) \;+\; \boldsymbol{\alpha}^{\top}\mathbf{1}\{\text{DOW}_t\} \;+\; f_{\text{Season}}(\text{DOY}_t) \;+\; f_{\text{Trend}}(t) \\ &\quad \;+\; f_{\text{Temp}}(\text{Temp}_t) \;+\; f_{\text{Dp}}(\text{DpTemp}_t) \;+\; f_{\text{RMTemp}}(\text{rmTemp}_t) \;+\; f_{\text{RMDp}}(\text{rmDpTemp}_t). \end{aligned} \]

Each \(f_j\) is a penalized spline with

\[ f_j(u) = \sum_{m=1}^{k_j} \beta_{jm} b_{jm}(u) \]

and penalty \(\lambda_j\,\boldsymbol{\beta}_j^\top \mathbf{K}_j \boldsymbol{\beta}_j.\)

Fitting in R

# Lag exposure variable
health$pm25_lag1 <- c(NA, head(health$pm25, -1))
health$doy <- as.integer(format(health$date, "%j"))   # Day of year
health$fdow <- factor(health$dow)
health$fdow <- relevel(health$fdow, ref = "Sunday")

fit <- gam(
  cr65plus ~ s(pm25_lag1) + fdow +
    s(doy, bs = "cc", k = 30) +             # cyclic annual seasonality
    s(date2, k = 100) +                     # long-term trend
    s(Temp) + s(DpTemp) +                   # same-day meteorology
    s(rmTemp) + s(rmDpTemp),                # running means (lagged effects)
  family = quasipoisson(link = "log"),
  data = health,
  method = "REML" #better for GAMs
)
summary(fit)

Family: quasipoisson 
Link function: log 

Formula:
cr65plus ~ s(pm25_lag1) + fdow + s(doy, bs = "cc", k = 30) + 
    s(date2, k = 100) + s(Temp) + s(DpTemp) + s(rmTemp) + s(rmDpTemp)

Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.226470   0.007835 539.468   <2e-16 ***
fdowFriday    -0.011713   0.011098  -1.055   0.2914    
fdowMonday     0.023848   0.010998   2.168   0.0303 *  
fdowSaturday  -0.008918   0.011068  -0.806   0.4205    
fdowThursday   0.003806   0.011051   0.344   0.7306    
fdowTuesday   -0.001758   0.011058  -0.159   0.8737    
fdowWednesday -0.001287   0.011060  -0.116   0.9074    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                edf Ref.df     F p-value    
s(pm25_lag1)  1.000  1.000 5.427 0.01994 *  
s(doy)       13.651 28.000 7.597 < 2e-16 ***
s(date2)     11.111 13.579 9.336 < 2e-16 ***
s(Temp)       1.687  2.164 6.506 0.00126 ** 
s(DpTemp)     1.002  1.004 4.422 0.03539 *  
s(rmTemp)     5.420  6.652 2.914 0.00596 ** 
s(rmDpTemp)   1.000  1.000 7.893 0.00501 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.455   Deviance explained = 46.4%
-REML = 1079.7  Scale est. = 1.0884    n = 1825

Model Interpretation

  • Lagged PM₂.₅ association:
    The smooth function of previous-day PM₂.₅ (\(s(\text{pm25\_lag1})\)) is statistically significant (p = 0.0234), indicating that higher PM₂.₅ levels are associated with increased daily mortality among adults aged 65+.

  • Seasonality and long-term trends:
    The smooth terms for day-of-year (\(s(\text{doy})\)) and calendar time (\(s(\text{date2})\)) are both highly significant (p < 2 × 10⁻¹⁶), reflecting strong seasonal and gradual long-term patterns in mortality rates.

  • Meteorological covariates:
    Temperature (\(s(\text{Temp})\), p = 0.0009), dew point temperature (\(s(\text{DpTemp})\), p = 0.064), and their running means (\(s(\text{rmTemp})\), p = 0.002; \(s(\text{rmDpTemp})\), p = 0.003) all show significant smooth effects, suggesting both same-day and lagged weather conditions influence mortality.

  • Overall model performance:
    The model explains 49.7% of the deviance (R²₍adj₎ = 0.48), indicating a strong fit for daily mortality time series data. Smoothers with higher effective degrees of freedom (edf) capture complex nonlinear patterns, while penalization prevents overfitting.

Model Diagnostics

Again, we use gam.check() to evaluate the adequacy of the spline basis dimensions and the overall stability of the fitted GAM. This diagnostic checks whether each smooth term has enough flexibility (via its basis size \(k\)) to capture the underlying pattern in the data, and verifies model convergence, gradient behavior, and Hessian stability.
If a term shows a low k-index (< 1) with a significant p-value, it suggests that the chosen \(k\) is too low and the smooth may be underfitting.
Otherwise, large p-values indicate that the current basis dimension is sufficient for that term.

gam.check(fit)


Method: REML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-0.0002595022,0.0008164376]
(score 1079.705 & scale 1.08843).
Hessian positive definite, eigenvalue range [7.693588e-06,906.0847].
Model rank =  179 / 179 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                k'   edf k-index p-value    
s(pm25_lag1)  9.00  1.00    1.03    0.93    
s(doy)       28.00 13.65    1.02    0.81    
s(date2)     99.00 11.11    0.93  <2e-16 ***
s(Temp)       9.00  1.69    1.00    0.52    
s(DpTemp)     9.00  1.00    1.00    0.51    
s(rmTemp)     9.00  5.42    1.05    0.97    
s(rmDpTemp)   9.00  1.00    1.01    0.65    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Basis dimension adequacy:
    Most smooth terms have high k-index values (~1) and non-significant p-values, indicating that the chosen basis dimensions (\(k\)) were adequate for capturing the data’s complexity.

  • Potential under-smoothing:
    If any term had a lower k-index and significant p-value, suggesting that the basis dimension might be slightly restrictive for long-term trends. Increasing \(k\) could allow more temporal flexibility if needed.

  • Overall adequacy:
    The diagnostic results overall confirm that the spline basis choices and penalization settings are appropriate.


Autocorrelation in residuals


r colorize(“Summary”, “purple”)

Section Description
What is the Model? A Generalized Additive Model (GAM) extends GLMs by allowing each predictor to have its own smooth, potentially nonlinear effect: \(g(\mathbb{E}[Y]) = \beta_0 + \sum_{j=1}^p s_j(X_j)\).
Why Use GAMs? To capture nonlinear relationships between predictors and outcomes while keeping the model additive and interpretable. Each smooth \(s_j(X_j)\) shows a separate partial effect.
Key Idea Replace linear terms with penalized splines; the penalty \(\lambda_j \boldsymbol{\beta}_j'\mathbf{K}_j\boldsymbol{\beta}_j\) controls smoothness. Each \(\lambda_j\) is estimated from data (e.g., REML or GCV).
Bayesian View The penalty acts as a Gaussian prior on spline coefficients: \(p(\boldsymbol{\beta}) \propto \exp[-\tfrac{1}{2}\boldsymbol{\beta}'\mathbf{K}\boldsymbol{\beta}]\). The resulting uncertainty bands are approximate Bayesian credible intervals with good frequentist coverage.
Multiple Smoothers GAMs combine several smooth functions additively: \(y_i = \beta_0 + g_1(x_i) + g_2(z_i) + \varepsilon_i\), each with its own smoothing parameter.
Interactions Use tensor-product smooths \(s(x_1, x_2)\) to model smooth interactions between predictors.
Model Checking gam.check() assesses whether basis dimension \(k\) is large enough and whether smooths are over- or under-fitted. A \(k\)-index near 1 indicates adequate flexibility.
Example 1 – Preterm Birth Logistic GAM with \(s(\text{age})\) reveals a U-shaped age effect on log-odds of preterm birth—highest at youngest and oldest mothers.
Example 2 – Air Pollution Poisson GAM of daily deaths vs. lagged PM₂.₅ shows nonlinear exposure effects, seasonality, and long-term trend using multiple smooths.
Interpretation Each smooth \(s_j(X_j)\) displays the partial, nonlinear effect of \(X_j\) on the response scale defined by the link. Shaded bands show uncertainty.
In Summary GAMs unify splines and GLMs, offering a flexible, interpretable, and data-driven framework for modeling nonlinear effects across multiple predictors.