Module 5 Part 3: Generalized Additive Models

Reading

From Wood, S. Generalized Additive Models: An Introduction with R, 2nd Edition.
A nice resource for fitting GAMS: https://wiki.qcbs.ca/r_workshop8

Basic Concepts

Generalized Additive Models (logistic)
Additive models with multiple covariates.
Using $g a m ()$ in package $m g c v ()$ to fit semiparametric models with parametric slopes and nonparametric terms.
Bivariate splines
Interpreting non-linear effects

Binary Outcome Exmaple

Motivating Exaple: Recall from example in Module 4.

Dataset: a cohort of live births from GA born in the year 2001 (N=77,340)

Variables:

$p t b$ : indicator for preterm birth for whether baby from pregnancy $i$ was born <37 weeks.
$a g e$ : mothers age at delivery
$m a l e$ : indicator of baby’s sex (1=male, 0=female)
$t o b a c c o$ : indicator for mother tobacco use during pregnancy test (1=yes, 0=no)

Previous analysis

Loading required package: nlme

This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.


Attaching package: 'dplyr'

The following object is masked from 'package:nlme':

    collapse

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Call:
glm(formula = ptb ~ age + male + tobacco, family = binomial(link = "logit"), 
    data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5160  -0.4236  -0.4103  -0.4088   2.2500  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.4212664  0.0631355 -38.350  < 2e-16 ***
age         -0.0006295  0.0021596  -0.291  0.77068    
maleM        0.0723659  0.0258672   2.798  0.00515 ** 
tobacco      0.4096495  0.0534627   7.662 1.83e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 44908  on 77339  degrees of freedom
Residual deviance: 44846  on 77336  degrees of freedom
AIC: 44854

Number of Fisher Scoring iterations: 5

Plot diagnostics are NOT very helpful with binary responses!

Generalized Additive Model

To account for non-linear age effect:

$\begin{aligned} p t b_{i} & \sim B e r n o u l l i (p_{i}) \\ l o g i t (p_{i}) & = β_{0} + β_{1} m a l e_{i} + β_{2} t o b a c c o_{i} + s (a g e_{i}) \end{aligned}$

where $s (a g e_{i})$ is a smooth function of age. The above model is known as a generalized additive model (GAMs).

Important

Generalized Additive Model

The general formula for a Generalized Additive Model (GAM) is:

$g (E [Y]) = β_{0} + \sum_{j = 1}^{p} s_{j} (X_{j})$

where:

$Y$ is the response variable,
$g ()$ is the link function that relates the expected value of $Y$ $(E (Y)$ to the linear predictor.
$β_{0}$ is the intercept term.
$s_{j} (X_{j})$ are smooth functions, such as splines, applied to the predictor variable $X_{j}$ .
$p$ is the number of predictor variables.
In GAMs, the smooth function $s_{j} (\cdot)$ allows for greater flexibility in modeling non-linear relationships.

fit.gam = gam(ptb~ s(age) + male + tobacco ,
              family = binomial, data = dat)
summary(fit.gam)


Family: binomial 
Link function: logit 

Formula:
ptb ~ s(age) + male + tobacco

Parametric coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.44226    0.01913 -127.647  < 2e-16 ***
maleM        0.07274    0.02588    2.811  0.00494 ** 
tobacco      0.39016    0.05356    7.284 3.24e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df Chi.sq p-value    
s(age) 3.314  4.146  70.17  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.00177   Deviance explained = 0.304%
UBRE = -0.42095  Scale est. = 1         n = 77340

Interpretation:

Model Overview

The response variable ptb is modeled using a binomial family with a logit link function.
The formula includes:
- A smooth term for age (s(age)).
- Linear terms for male and tobacco.

Coefficients

Intercept: The log-odds of ptb when all predictors are at their reference levels is estimated as $- 2.44$ (highly significant, $p < 2 \times 10^{- 16}$ ).
maleM (male effect): Being male is associated with a small but significant increase in the log-odds of ptb (estimate = $0.073$ , $p = 0.00494$ ). Tobacco use is associated with a significant increase in the log-odds of ptb (estimate = $0.39$ , $p < 3.24 \times 10^{- 13}$ ).
Smooth Term for Age (s(age)) : The smooth function for age has an estimated effective degrees of freedom (edf) of $3.31$ , indicating a moderately flexible relationship. Is highly significant ( $p < 2 \times 10^{- 16}$ ), suggesting a non-linear relationship between age and the log-odds of ptb.

Model Fit and Diagnostics

Adjusted $R^{2}$ : The model explains only $0.177 %$ of the variability in the response, indicating weak predictive power.
Deviance explained: The model accounts for $0.304 %$ of the total deviance, further supporting limited explanatory capability. -
UBRE score: - A measure of model fit, with a value of $- 0.42095$ .
Sample size: - The model was fit on $77, 340$ observations.

Conclusion: While the predictors (age, male, and tobacco) show significant associations with ptb, the overall model has limited explanatory power for predicting ptb.

Checking GAMs

mgcv::gam.check(fit.gam)


Method: UBRE   Optimizer: outer newton
full convergence after 3 iterations.
Gradient range [9.816712e-07,9.816712e-07]
(score -0.4209495 & scale 1).
Hessian positive definite, eigenvalue range [1.285937e-05,1.285937e-05].
Model rank =  12 / 12 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'  edf k-index p-value
s(age) 9.00 3.31    0.95    0.73

Looking at the plot below, what is the trend of age and preterm birth?

Biologically, log-odds of preterm birth is higher at very young ages, lowest around 29 years, and then increased again.

Bayesian Credible Intervals

We are assuming the underlying function is smooth. This assumption can be formalized as a prior in a Bayesian model.

In Bayesian statistics, the parameters $β$ are treated as random variables. A ridge penalty corresponds to an improper Gaussian prior:

$f_{β} \propto \exp (- \frac{β^{'} B β}{2}) .$

For Gaussian data, this results in a posterior distribution:

$[β ∣ y, λ] \sim N (\hat{β}, σ^{2} (X^{'} X + λ B)^{- 1}),$

where:

$\hat{β}$ is the estimated parameter vector.
$σ^{2}$ is the variance.
$λ$ is the penalty parameter.
$B$ is the penalty matrix.

For a general likelihood, the Fisher Information matrix (Hessian of the negative log-likelihood at $\hat{β}$ ) is used:

$[β ∣ y, λ] \sim N (\hat{β}, {(\hat{I} + λ B)}^{- 1}),$

where $\hat{I}$ is the Fisher Information matrix. For quasi-Poisson data, multiply the penalty by $ϕ$ .

In particular, the “Bayesian credible intervals” plotted in mgcv::gam have Frequentist coverage probabilities. For more details, see Section 6.10 in Wood (2017).

Multiple Smoothers

Let’s consider the additive model for two continuous variables $x_{i}$ and $z_{i}$ :

$y_{i} = β_{0} + g_{1} (x_{i}) + g_{z} (z_{i}) + ϵ_{i}, ϵ_{i} \sim N (0, σ^{2})$ where $g_{1} (), g_{2} ()$ denote smooth relationships between response $y_{i}$ and predictors $x_{i}$ and the $z_{i}$ .

Again, extends to generalized linear models (binomial, Poisson, etc).

We again express non-linear functions using basis functions:

$\begin{aligned} g_{1} (x_{i}) & = \sum_{m = 1}^{M} B_{m 1} b_{m 1} (x_{i}) \\ g_{2} (z_{i}) & = \sum_{m = 1}^{M} B_{m 2} b_{m 2} (z_{i}) \end{aligned}$ - Induce smoothing by penalizing regression coefficients. Note we also use smoothers to control for confounders flexibly.

Motivating Example: Associations between Mortality and Fine Particulate Matter

Fine Particulate Matter (PM2.5)

Represents a mixture of solid and liquid particles in the air that are less than $2.5 μ m$ in diameter.
Mainly arises from combustion sources, including:
- Power generation
- Vehicle emissions
- Industrial operations.

Scientific Question: What is the association between daily mortality counts and daily concentration of outdoor PM2.5 air pollution?

Data Sources:

Mortality Data: Daily counts of non-accidental deaths (age $\geq 65$ ) in the 5-county New York City area (2001–2005) obtained from the National Center for Health Statistics (CDC).
Air Pollution Data: Daily PM2.5 concentrations from the Environmental Protection Agency (EPA).
Meteorology Data: Daily meteorological conditions from the National Climatic Data Center (NOAA).

Time Series Health Model

We are interested in the association between daily variation in mortality counts and daily variation in exposure (PM2.5)
In a time-series design we view population as the unit of analysis, i.e., outcome = total mortality counts arising from the population.
Confounders that vary smoothly in time can be easily controlled for by including smoothers.

Let $y_{t}$ denote the death count on day $t$ , and $x_{t}$ be the corresponding PM2.5 level.

Temporal Delay

There is typically a temporal delay between exposure and outcome. To address this, let’s examine the association between daily mortality and previous-day exposure.

The model is given by:

$l o g (y_{t}) = β_{0} + g_{1} (x_{t - 1}) + confounders + ϵ_{t}, ϵ_{t} \overset{iid}{\sim} N (0, σ^{2}) .$

Focus of Interest

We are particularly interested in $g_{1} (x_{t - 1})$ , the smooth function of lagged PM2.5 ( $x_{t - 1}$ ). Modeling log-transformed death counts helps improve the normality of residuals.

Confounders to Consider:

Day of the week.
Seasonality and long-term trends.
Same-day temperature.
Same-day dew point temperature.
Previous day’s temperature and dew point temperature.

Let’s model the joint effects:

# create lag variable of pm25: used later on
health$pm25.lag1 = c(NA, health$pm25[1:1825])
health$date2=order(health$date)
health$fdow = factor(health$dow)
health$fdow = relevel (health$fdow, ref = "Sunday")
# use a more intuitive reference level:
fit = gam(log(cr65plus)~s(pm25)+fdow+s(date2, k = 100)+s(Temp)+ s(DpTemp)+s(rmTemp)+
            s(rmDpTemp), data = health)
summary(fit)


Family: gaussian 
Link function: identity 

Formula:
log(cr65plus) ~ s(pm25) + fdow + s(date2, k = 100) + s(Temp) + 
    s(DpTemp) + s(rmTemp) + s(rmDpTemp)

Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.2174859  0.0077639 543.220   <2e-16 ***
fdowFriday    -0.0103892  0.0109781  -0.946   0.3441    
fdowMonday     0.0234074  0.0109481   2.138   0.0327 *  
fdowSaturday  -0.0071663  0.0109200  -0.656   0.5117    
fdowThursday   0.0041357  0.0109771   0.377   0.7064    
fdowTuesday    0.0002408  0.0109574   0.022   0.9825    
fdowWednesday  0.0011193  0.0109764   0.102   0.9188    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df      F  p-value    
s(pm25)      2.198  2.796  2.527 0.068785 .  
s(date2)    60.958 73.528  6.877  < 2e-16 ***
s(Temp)      1.000  1.000 17.434 3.14e-05 ***
s(DpTemp)    1.524  1.908  3.272 0.030233 *  
s(rmTemp)    5.538  6.762  3.443 0.001410 ** 
s(rmDpTemp)  1.000  1.000 13.239 0.000282 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.465   Deviance explained = 48.8%
GCV = 0.01619  Scale est. = 0.015487  n = 1826

Run diagnostics on this model to determine if it is a good fit!

Run the $g a m . c h e c k ()$ function on the fit to get diagnostic $p l o t ()$ to get model output.
What are the findings from this diagnostic checks?
It is always good to check the correlations between variables. VIFs are mode complicated due to expanding covariates with a spline basis. Here we will be concerned if there are high correlations with PM2.g.lag1. Run the code below to obtain VIFs:
Analyze residuals for evidence of autocorrelation. If residuals are correlated, inference is not valid and p-values tend to be small. Use $a c f (r e s i d (f i t))$ ,
Interpret model findings.

# cor(health[,c("pm25","date2","Temp","DpTemp","rmTemp","rmDpTemp")])
# temp=lm(log(cr65plus)~pm25+fdow+date2+
# Temp+DpTemp+rmTemp+rmDpTemp, data = health)
# car::vif(temp)

# library(corrplot)
# temp = health[,c('pm25.lag1','date2','Temp','DpTemp','rmTemp','rmDpTemp')]
# corrplot.mixed(cor(temp,use='complete.obs'))

Example of Invalid Model

To see the impact of autocorrelation, we remove date:

fit.nodate = gam(log(cr65plus)~s(pm25.lag1)+fdow+s(Temp)+ s(DpTemp)+s(rmTemp)+s(rmDpTemp), data = health)

acf(resid(fit.nodate))

summary(fit.nodate)


Family: gaussian 
Link function: identity 

Formula:
log(cr65plus) ~ s(pm25.lag1) + fdow + s(Temp) + s(DpTemp) + s(rmTemp) + 
    s(rmDpTemp)

Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.2196867  0.0086125 489.950   <2e-16 ***
fdowFriday    -0.0149662  0.0121799  -1.229   0.2193    
fdowMonday     0.0245747  0.0121649   2.020   0.0435 *  
fdowSaturday  -0.0108324  0.0121525  -0.891   0.3729    
fdowThursday  -0.0001314  0.0121833  -0.011   0.9914    
fdowTuesday   -0.0029427  0.0121768  -0.242   0.8091    
fdowWednesday -0.0007306  0.0121867  -0.060   0.9522    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df      F p-value    
s(pm25.lag1) 1.000  1.000 47.599  <2e-16 ***
s(Temp)      1.683  2.177  4.480  0.0103 *  
s(DpTemp)    1.000  1.000  1.406  0.2359    
s(rmTemp)    7.705  8.572  6.781  <2e-16 ***
s(rmDpTemp)  2.111  2.762  1.906  0.1233    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.337   Deviance explained = 34.4%
GCV = 0.019388  Scale est. = 0.01917   n = 1825

Now we no longer properly account for the dependence between nearby observations. This leads to significant autocorrelation ( $> 2 \cdot 1 / \sqrt{n}$ ).

Autocorrelation in residuals

One person’s mean structure is another person’s correlation structure.
If you don’t have covariates to model the correlation structure, you can specify correlated errors.
This is beyond the scope of this course, but a nice tutorial is available at https://petolau.github.io/Analyzing-double-seasonal-time-series-with-GAM-in-R/

Bivariate Splines

Under an additive model framework, we can also consider smooth effects of two variables jointly.

Important

Bivariate Splines

Specifically, we can think of $f (x_{i}, z_{i})$ as a smooth function of two variables. $y_{i} = f (x_{i}, z_{i}) + ϵ_{i}$

We can think of $f (x_{i}, z_{i})$ as a surface.

We can use a 2-dimensional splines to model $f (x_{i}, z_{i})$ .

Let $s_{i} = (x_{i}, z_{i})$ be some pair of covariate values, and let $k_{m} = (x_{m}, z_{m})$ denote the $m^{t h}$ knot in the domain of $x_{i}$ and $z_{i}$ . We can express the smooth function as

$f (x_{i}, z_{i}) = β_{0} + \sum_{m = 1}^{M} β_{m} b_{m} ({\tilde{s}}_{i}, {\tilde{k}}_{m})$

Note that $b_{m} (,)$ is a basis function that maps $R \times R \to R$ .

Bivariate Basis Functions Using Thin-Plate Splines

One popular bivariate basis function uses thin-plate splines, which extend to $s_{i} \in R^{d}$ and $\partial l_{g}$ penalties.

These splines are designed to interpolate and approximate smooth surfaces over two dimensions (bivariate).

We consider $d = 2$ and $l = 2$ (smoothness penalty involves 2nd order derivative).

$f ({\tilde{s}}_{i}) = β_{0} + β_{1} x_{i} + β_{2} z_{i} + \sum_{m = 1}^{M} β_{2 + m} b_{m} ({\tilde{s}}_{i}, {\tilde{k}}_{m})$

using radial basis function given by

$b_{m} ({\tilde{s}}_{i}, {\tilde{k}}_{m}) = | | {\tilde{s}}_{i} - {\tilde{k}}_{m} | |^{2} \log | | {\tilde{s}}_{i} - {\tilde{k}}_{m} | |$ shoul

where $| | s_{i} - k_{m} | |$ is the Euclidean distance between the covariate $s_{i}$ and the knot location $k_{m}$ .

The radial basis kernel is $r^{2} l o g r$ .

The thin-plate spline is sensitive to the scale of each variable, but invariant to rotation (i.e., isotropic).

It is best suited for variables measured on the same scale, such as geographical distance.

Joint Effects of Temperature and Dew Point Temperature

mgcv::gam uses thin plate regression splines as the default smoothers of a single variable (as well as two variables).

This is the general idea: - Construct the $n \times n$ matrix from $E$ from $| | s_{i} - s_{i^{'}} | |^{2} l o g (| | s_{i} - s_{i^{'}} | |)$

Use SVD to find a low rank representation.
In practice, there are some additional things to worry about to make 1 (for the intercept), and $x$ and $z$ in the null space of the penalty.
Then estimate the $β_{0}, β_{1}$ for $x$ , and $β_{2}$ for $z$ (unpenalized) and $β_{3}, . . ., β_{k}$ (penalized) which reduces computation costs.

To define a bivariate smoother, simply specify $s (v a r 1, v a r 2)$ in the equation formula.

fit_thin <- gam(log(alldeaths) ~ s(date2,k=100) +
                  s(Temp, DpTemp, k=20), data = health)
summary(fit_thin)


Family: gaussian 
Link function: identity 

Formula:
log(alldeaths) ~ s(date2, k = 100) + s(Temp, DpTemp, k = 20)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.962063   0.002049    2422   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(date2)       67.37  80.08 9.597  <2e-16 ***
s(Temp,DpTemp) 12.66  15.76 4.327  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.464   Deviance explained = 48.8%
GCV = 0.008021  Scale est. = 0.0076651  n = 1826

Interpretation:

Family and Link Function: The model assumes a Gaussian distribution for the response variable with an identity link function, meaning predictions are on the same scale as the response variable $\log (alldeaths)$ .
Model Formula: The model includes:
- A smooth function of date2 with $k = 100$ (basis dimension for flexibility).
- A bivariate smooth function of Temp and DpTemp with $k = 20$ .
Parametric Coefficients:
- The intercept is 4.962, representing the baseline log of deaths when all covariates are at their mean.
- The intercept’s $t$ -value is extremely high (2422), and the associated $p$ -value is $< 2 e - 16$ , indicating it is highly significant.
Smooth Terms:
- s(date2):
  - Effective degrees of freedom (edf): 67.37 suggesting a highly flexible fit for the time trend.
  - $F$ -value: 9.597 with a $p$ -value $< 2 e - 16$ , indicating that the temporal trend is highly significant.
- s(Temp, DpTemp):
  - edf: 12.66 indicating a moderately flexible surface for the relationship between temperature and dew point temperature.
  - $F$ -value: 4.327 with a $p$ -value <2e-16, suggesting the smooth term for Temp and DpTemp is also highly significant.
Model Performance:
- Adjusted $R^{2}$ : 0.464, meaning 46.4% of the variability in the response (log of deaths) is explained by the model.
- Deviance Explained: 48.8%, similar to $R^{2}$ , indicating a good fit but room for improvement.
- Generalized Cross-Validation (GCV) score: 0.008021 smaller is better! Mostly used for model comparison.
- Scale Estimate $σ^{2}$ : 0.0076651, indicating the residual variance.
Key Takeaways:
- Both smooth terms (date2 and the bivariate smoother of Temp and DpTemp) are highly significant and contribute meaningfully to explaining variability in the log of deaths.
- The model captures temporal trends and the effect of temperature and dew point interactions effectively, but the explained variance suggests other factors might also influence the response variable.