Module 5 Cheat Sheet

Overview

  • Generalized Additive Models (GAMs) extend GLMs by allowing nonlinear smooth effects of predictors.
  • Core idea (Module 5A): model the mean via a link as an additive sum of smooth functions: \[ g(\mathbb{E}[Y_i]) = \beta_0 + \sum_{j=1}^p s_j(X_{ij}). \]
  • Smooths \(s_j(\cdot)\) are represented using basis functions (splines) plus penalties that control wiggliness.
  • Module 5A focuses on:
    • additive smooths (one or many \(s_j\)),
    • penalized estimation and uncertainty bands,
    • a time-series GAM for NYC mortality and PM\(_{2.5}\).
  • Module 5B extends GAMs to:
    • smooth interactions between continuous variables using tensor-product smooths \(s_{12}(x,z)\),
    • varying-coefficient smooths where \(s(x)\) varies across groups using by=.

1. Generalized Additive Models (Module 5A)

Overview

  • GAMs are GLMs where linear terms are replaced by smooth functions: \[ g(\mathbb{E}[Y_i]) = \beta_0 + \sum_{j=1}^p s_j(X_{ij}). \]
  • Each \(s_j(\cdot)\) captures a potentially nonlinear effect of predictor \(X_j\) on the link scale.
  • Implementation in R: mgcv::gam() with smooth terms like s(age), s(pm25_lag1), s(doy, bs = "cc"), etc.

Basis representation and roughness penalties

  • A single smooth \(s(x)\) can be written as a spline basis expansion: \[ s(x) = \sum_{m=1}^{k} \beta_m b_m(x), \] where \(b_m(x)\) are spline basis functions and \(\beta_m\) are coefficients.

  • To avoid overfitting, curvature is penalized; for example, \[ \lambda \int [s''(x)]^2 \, dx, \] where \(\lambda \ge 0\) is a smoothing parameter (larger \(\lambda\) = smoother \(s\)).

  • With multiple smooths \(s_1(x)\) and \(s_2(z)\): \[ s_1(x) = \sum_{m=1}^{k_1} \beta_{1m} b_{1m}(x), \quad s_2(z) = \sum_{m=1}^{k_2} \beta_{2m} b_{2m}(z), \] with separate penalties \[ \lambda_1 \int [s_1''(x)]^2 dx + \lambda_2 \int [s_2''(z)]^2 dz. \]

Penalized least squares / penalized likelihood

  • Gaussian case (least squares): \[ \min_{\boldsymbol{\beta}}\Big\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda_1 \boldsymbol{\beta}_1^\top \mathbf{K}_1 \boldsymbol{\beta}_1 + \lambda_2 \boldsymbol{\beta}_2^\top \mathbf{K}_2 \boldsymbol{\beta}_2 \Big\}, \] where \(\mathbf{K}_j\) encodes curvature of \(s_j\).
  • General GAM (any exponential-family link): \[ g(\mathbb{E}[Y_i]) = \beta_0 + s_1(x_i) + s_2(z_i), \] estimated by minimizing a penalized log-likelihood: \[ \min_{\boldsymbol{\beta}}\Big\{ -\ell(\boldsymbol{\beta}\mid\mathbf{y}) + \lambda_1 \boldsymbol{\beta}_1^\top \mathbf{K}_1 \boldsymbol{\beta}_1 + \lambda_2 \boldsymbol{\beta}_2^\top \mathbf{K}_2 \boldsymbol{\beta}_2 \Big\}. \]
  • Smoothing parameters \(\lambda_j\) are chosen from the data (e.g., REML, GCV).

Example 1: PTB logistic GAM (nonlinear age effect)

  • Data: GA birth cohort; outcome = preterm birth (ptb), predictors = age, male, tobacco.
  • Model: \[ \begin{aligned} \text{ptb}_i &\sim \text{Bernoulli}(p_i),\\ \text{logit}(p_i) &= \beta_0 + \beta_1 \,\text{male}_i + \beta_2 \,\text{tobacco}_i + s(\text{age}_i). \end{aligned} \]
  • s(age) uses a spline basis (by default cubic regression spline with \(k=10\)).
  • Effective degrees of freedom (edf) for s(age) summarize the complexity of the age effect:
    • edf \(\approx 1\) → nearly linear,
    • edf \(> 1\) → nonlinear shape.
  • Fitted smooth: U-shaped relationship in log-odds of preterm birth: highest risk at very young ages, lowest around age \(\approx 29\), then increasing again at older ages.

2. Multiple Smoothers & Time-Series GAM (Module 5A)

NYC mortality and air pollution example

  • Data (NYC, 2001–2005):
    • Outcome: daily non-accidental deaths, age \(\ge 65\) (cr65plus).
    • Main exposure: PM\(_{2.5}\) (daily, plus lagged versions).
    • Confounders: temperature, dew point temperature, long-term and seasonal time trends, day of week.

Time-series GAM model

  • Let:
    • \(y_t\) = daily count of deaths,
    • \(x_{t-1}\) = lag-1 PM\(_{2.5}\) (previous-day exposure),
    • \(\text{DOW}_t\) = day of week,
    • \(\text{DOY}_t\) = day of year,
    • \(t\) = time index,
    • \(T_t\), \(Dp_t\) = same-day Temp and dew point,
    • \(\text{rmTemp}_t\), \(\text{rmDp}_t\) = running-mean meteorology (lags).
  • Example model (quasi-Poisson, log link): \[ \begin{aligned} \log \mathbb{E}[y_t] &= \beta_0 + s(\text{pm25\_lag1}_t) + \boldsymbol{\alpha}^\top \mathbf{1}\{\text{DOW}_t\} \\ &\quad + s(\text{DOY}_t) + s(t) + s(T_t) + s(Dp_t) + s(\text{rmTemp}_t) + s(\text{rmDp}_t). \end{aligned} \]
  • Each smooth \(s(\cdot)\) is represented by spline bases with its own penalty.

Interpretation highlights

  • s(pm25_lag1):
    • Statistically significant smooth → higher lagged PM\(_{2.5}\) associated with increased daily mortality among older adults.
  • s(doy) and s(date2):
    • Highly significant → strong seasonal and long-term patterns in mortality.
  • s(Temp), s(DpTemp), s(rmTemp), s(rmDpTemp):
    • Capture complex, nonlinear meteorological effects (both same-day and lagged).

Model performance (from Module 5A)

  • Deviance explained \(\approx 49.7\%\); adjusted \(R^2 \approx 0.48\).
  • Smoothers with higher edf capture more complex shapes, but penalization keeps curves from overfitting.

Model checking: gam.check()

  • gam.check(fit) evaluates:
    • basis dimension \(k\) adequacy for each smooth,
    • residual diagnostics and stability.
  • Key output:
    • \(k'\): nominal basis size used,
    • edf: effective degrees of freedom,
    • \(k\)-index and p-value for each smooth.
  • Interpretation:
    • \(k\)-index near 1 and large p-value → chosen \(k\) is adequate.
    • Low \(k\)-index with small p-value → basis may be too small; increase \(k\) to allow more flexibility.

3. Uncertainty Bands for Smooths (Module 5A)

Penalized estimation and approximate distribution

  • Let a smooth \(s_j(x)\) have basis vector \(\mathbf{b}(x)\) and coefficients \(\boldsymbol{\beta}_j\): \[ s_j(x) = \mathbf{b}(x)^\top \boldsymbol{\beta}_j. \]
  • GAM estimation maximizes: \[ \ell(\boldsymbol{\beta};\mathbf{y}) - \tfrac{1}{2}\sum_j \lambda_j \boldsymbol{\beta}_j^\top \mathbf{K}_j \boldsymbol{\beta}_j, \] where \(\mathbf{K}_j\) is a penalty matrix and \(\lambda_j\) controls smoothness.
  • Near the optimum, \[ \hat{\boldsymbol{\beta}} \approx N\!\Big(\boldsymbol{\beta}_{\text{true}}, \mathbf{V}_\beta\Big), \quad \mathbf{V}_\beta \approx \big(\hat{\mathbf{I}} + \sum_j \lambda_j \mathbf{K}_j\big)^{-1}, \] with \(\hat{\mathbf{I}}\) the observed Fisher information.

Standard errors and bands for \(s_j(x)\)

  • For a given \(x\): \[ \hat{s}_j(x) = \mathbf{b}(x)^\top \hat{\boldsymbol{\beta}}_j, \quad \mathrm{SE}\{\hat{s}_j(x)\} = \sqrt{\mathbf{b}(x)^\top \mathbf{V}_{\beta,j} \mathbf{b}(x)}. \]
  • Pointwise 95% interval on the linear predictor scale: \[ \hat{s}_j(x) \pm 1.96 \times \mathrm{SE}\{\hat{s}_j(x)\}. \]

Response scale interpretation

  • For non-identity links, intervals are first computed on the link scale, then transformed:
    • Poisson/log: \(\hat{\mu}(x) = \exp\{\hat{\eta}(x)\}\),
    • Binomial/logit: \(\hat{p}(x) = \operatorname{logit}^{-1}\{\hat{\eta}(x)\}\).
  • In mgcv plots, shaded bands represent pointwise confidence intervals for \(s_j(x)\) on the link scale:
    • Narrow where data are dense,
    • Wider where data are sparse.

4. Tensor-Product Smooths for Interactions (Module 5B)

Motivation

  • Additive GAMs from Module 5A allow separate smooths \(s_1(x)\) and \(s_2(z)\), but no smooth interaction.
  • In many applications, the effect of one continuous predictor changes smoothly across levels of another:
    • e.g., joint effect of temperature and PM\(_{2.5}\) on mortality.
  • We want a bivariate surface \(s_{12}(x,z)\), not a smooth of the product \(xz\).

Model form with interaction

  • For two continuous predictors \(x\) and \(z\): \[ g(\mathbb{E}[Y]) = \beta_0 + s_1(x) + s_2(z) + s_{12}(x,z), \] where:
    • \(s_1(x)\) = smooth main effect of \(x\),
    • \(s_2(z)\) = smooth main effect of \(z\),
    • \(s_{12}(x,z)\) = smooth interaction (bivariate surface).

Basis via tensor products

  • Let:
    • \(\mathbf{b}_x(x) \in \mathbb{R}^Q\) = spline basis for \(x\),
    • \(\mathbf{b}_z(z) \in \mathbb{R}^P\) = spline basis for \(z\).
  • Tensor-product basis: \[ \mathbf{b}_{xz}(x,z) = \mathbf{b}_x(x) \otimes \mathbf{b}_z(z), \] where \(\otimes\) is the Kronecker product.
  • Bivariate smooth: \[ s_{12}(x,z) = \mathbf{b}_{xz}(x,z)^\top \boldsymbol{\beta}, \] with \(Q \times P\) basis functions and coefficients before penalization.

Anisotropic penalties

  • 1D penalty matrices:
    • \(\mathbf{K}_x\) controls curvature in \(x\)-direction,
    • \(\mathbf{K}_z\) controls curvature in \(z\)-direction.
  • Tensor-product penalty: \[ \mathcal{P}(\boldsymbol{\beta}) = \lambda_x \boldsymbol{\beta}^\top (\mathbf{K}_x \otimes \mathbf{I}_P)\boldsymbol{\beta} + \lambda_z \boldsymbol{\beta}^\top (\mathbf{I}_Q \otimes \mathbf{K}_z)\boldsymbol{\beta}, \] where:
    • \(\lambda_x\) controls smoothing horizontally (vary \(x\), hold \(z\)),
    • \(\lambda_z\) controls smoothing vertically (vary \(z\), hold \(x\)).

Example: NYC mortality with Temp × PM\(_{2.5}\) interaction

  • Model (Poisson/log link): \[ \log \mathbb{E}[y_t] = \beta_0 + s_{12}(\text{Temp}_t, \text{PM}_{t-1}) + \boldsymbol{\alpha}^\top \mathbf{1}\{\text{DOW}_t\} + s(\text{DOY}_t) + s(t), \] where \(s_{12}(\text{Temp},\text{PM})\) is implemented as te(Temp, pm25.lag1).
  • Interpretation from Module 5B:
    • Tensor-product smooth for Temp × PM\(_{2.5}\) is highly significant.
    • Certain combinations of higher temperature and PM\(_{2.5}\) are associated with elevated mortality risk.
  • Basis dimension:
    • Marginal bases for Temp and PM\(_{2.5}\) often default to small \(k\) (e.g., \(k_x = k_z = 5\)),
    • Tensor-product grid has up to \(Q \times P = 5 \times 5 = 25\) basis functions,
    • Penalization shrinks many directions toward zero; effective degrees of freedom (edf \(\approx 5.1\) in the example) reflect the complexity of the fitted surface.
  • Note: Unlike univariate smooths, te(x,z) does not automatically use \(k=10\) per margin; defaults are smaller unless specified via k = c(k_x, k_z).

5. Varying-Coefficient Smooths (Effect Modification, Module 5B)

Motivation

  • Sometimes the effect of a continuous predictor \(x\) differs by categories of a grouping variable \(G\) (e.g., day of week, sex, site).
  • We want \(s(x)\) to vary by group, but we are not modeling a full 2D continuous surface in \((x,G)\).
  • Example from Module 5B: PM\(_{2.5}\)–mortality relationship varying by day of week.

Model form

  • Let \(G\) be a categorical variable with a reference group and other levels.
  • Varying-coefficient GAM: \[ g(\mathbb{E}[Y]) = \beta_0 + \beta_G G + s(x) + s(x,\text{by}=G), \] where:
    • \(s(x)\) = baseline smooth (reference group),
    • \(s(x,\text{by}=G)\) = group-specific difference smooths,
    • \(\beta_G G\) = parametric main effect of \(G\) (needed for identifiability).

Basis representation

  • Suppose \(G\) has \(C\) categories and \(b_m(x)\), \(m=1,\dots,k\) are the basis functions for \(s(x)\). Then: \[ s(x) + s(x,\text{by}=G) = \sum_{m=1}^k \beta_m b_m(x) + \sum_{c=1}^{C-1} \sum_{m=1}^k \gamma_{mc} b_m(x) I(G = c), \] where:
    • first term = baseline smooth for the reference group,
    • second term = \((C-1)\) deviation smooths, one per non-reference group.

Penalties

  • Each smooth (baseline and deviations) gets its own penalty: \[ \lambda_{\text{base}} \boldsymbol{\beta}^\top K \boldsymbol{\beta} + \sum_{c=1}^{C-1} \lambda_c \boldsymbol{\gamma}_c^\top K \boldsymbol{\gamma}_c. \]
  • Interpretation:
    • \(\boldsymbol{\gamma}_c\) = coefficients for group-\(c\) deviation curve,
    • \(K\) = shared curvature penalty matrix,
    • large \(\lambda_c\) → deviation smooth shrinks toward 0 (group curve similar to baseline),
    • small \(\lambda_c\) → more flexible, group-specific shape.

Example: PM\(_{2.5}\) effect by day of week (NYC)

  • Model (Poisson/log link): \[ \log \mathbb{E}[y_t] = \beta_0 + s(\text{PM}_{t-1}) + s(\text{PM}_{t-1},\text{by}=\text{DOW}_t) + s(\text{DOY}_t) + s(t), \] implemented as:

    alldeaths ~
      s(pm25.lag1) +
      s(pm25.lag1, by = fdow) +
      s(doy, bs = "cc", k = 30) +
      s(date2, k = 100)
  • Interpretation from Module 5B:

    • Baseline smooth s(pm25.lag1) has edf ≈ 1 → effectively linear for the reference day (Sunday).
    • Deviation smooths s(pm25.lag1, by = fdowX) have very low edf (1–2) and are not statistically significant.
    • No strong evidence that the PM\(_{2.5}\)–mortality relationship is nonlinear or varies by day of week.
    • Seasonal (s(doy)) and long-term (s(date2)) trends remain dominant smooth components.

Summary

Topic Summary
What is the Model? A Generalized Additive Model (GAM) extends GLMs by allowing each predictor to have its own smooth, potentially nonlinear effect: \[g(\mathbb{E}[Y]) = \beta_0 + \sum_{j=1}^p s_j(X_j).\] Smooths \(s_j(\cdot)\) are estimated from data using spline bases and curvature penalties.
Multiple smoothers (NYC example) In time-series GAMs, several smooths are combined additively: \[\log \mathbb{E}[y_t] = \beta_0 + s(\text{PM}) + s(\text{DOY}) + s(t) + s(\text{Temp}) + \dots\] allowing flexible control for confounding (weather, seasonality, trend) while estimating the exposure-response curve.
Penalization Smooths are controlled by penalties of the form \(\lambda_j \boldsymbol{\beta}_j^\top K_j \boldsymbol{\beta}_j\), where \(\lambda_j\) is a smoothing parameter and \(K_j\) encodes curvature. Estimation proceeds by maximizing a penalized log-likelihood.
Uncertainty bands GAM smooths have an approximate multivariate normal distribution for coefficients. Standard errors for \(s_j(x)\) are obtained via \(\mathrm{Var}\{\hat{s}_j(x)\} = \mathbf{b}(x)^\top \mathbf{V}_{\beta,j} \mathbf{b}(x)\), giving pointwise intervals \(\hat{s}_j(x) \pm 1.96\,\mathrm{SE}\{\hat{s}_j(x)\}\). Shaded bands in mgcv plots are these pointwise intervals on the link scale.
Tensor-product smooths Smooth interactions between two continuous predictors are modeled by a tensor-product smooth \(s_{12}(x,z)\) with basis \(\mathbf{b}_{xz}(x,z) = \mathbf{b}_x(x)\otimes\mathbf{b}_z(z)\). The penalty \[\mathcal{P}(\boldsymbol{\beta}) = \lambda_x \boldsymbol{\beta}^\top (\mathbf{K}_x \otimes \mathbf{I}_P)\boldsymbol{\beta} + \lambda_z \boldsymbol{\beta}^\top (\mathbf{I}_Q \otimes \mathbf{K}_z)\boldsymbol{\beta}\] allows anisotropic smoothing along \(x\) and \(z\) directions.
Varying-coefficient smooths When a smooth effect of \(x\) varies across groups \(G\), GAMs use a baseline smooth \(s(x)\) plus deviation smooths \(s(x,\text{by}=G)\): \[s(x) + s(x,\text{by}=G) = \sum_m \beta_m b_m(x) + \sum_{c=1}^{C-1}\sum_m \gamma_{mc} b_m(x) I(G=c).\] Each deviation smooth has its own penalty, shrinking group curves toward the baseline if \(\lambda_c\) is large.
Estimation & diagnostics Smoothing parameters are selected by REML or GCV. Use gam.check() to assess basis dimension \(k\) and smooth adequacy (via \(k\)-index and p-values). EDF summarize smooth complexity; values near 1 indicate near-linearity.
Interpretation For univariate \(s_j(x)\), interpret the shape of the smooth and its band on the link scale (e.g., log-odds, log-rate). For tensor-product smooths \(s_{12}(x,z)\), use 3D surfaces or heatmaps. For varying-coefficient smooths, compare baseline and group-specific curves.
Key takeaway GAMs provide a flexible, interpretable, and data-driven framework: they extend GLMs with nonlinear smooths, allow complex time-series adjustments, and handle smooth interactions and effect modification through tensor-product and varying-coefficient smooths, as illustrated by the PTB and NYC mortality examples.