The response variable ptb is modeled using a binomial family with a logit link function.
The formula includes:
A smooth term for age (s(age)).
Linear terms for male and tobacco.
Coefficients
Intercept: The log-odds of ptb when all predictors are at their reference levels is estimated as \(-2.44\) (highly significant, \(p < 2 \times 10^{-16}\)).
maleM (male effect): Being male is associated with a small but significant increase in the log-odds of ptb (estimate = \(0.073\), \(p = 0.00494\)). Tobacco use is associated with a significant increase in the log-odds of ptb (estimate = \(0.39\), \(p < 3.24 \times 10^{-13}\)).
Smooth Term for Age (s(age)) : The smooth function for age has an estimated effective degrees of freedom (edf) of \(3.31\), indicating a moderately flexible relationship. Is highly significant (\(p < 2 \times 10^{-16}\)), suggesting a non-linear relationship between age and the log-odds of ptb.
Model Fit and Diagnostics
Adjusted\(R^2\): The model explains only \(0.177\%\) of the variability in the response, indicating weak predictive power.
Deviance explained: The model accounts for \(0.304\%\) of the total deviance, further supporting limited explanatory capability. -
UBRE score: - A measure of model fit, with a value of \(-0.42095\).
Sample size: - The model was fit on \(77,340\) observations.
Conclusion: While the predictors (age, male, and tobacco) show significant associations with ptb, the overall model has limited explanatory power for predicting ptb.
Checking GAMs
mgcv::gam.check(fit.gam)
Method: UBRE Optimizer: outer newton
full convergence after 3 iterations.
Gradient range [9.816712e-07,9.816712e-07]
(score -0.4209495 & scale 1).
Hessian positive definite, eigenvalue range [1.285937e-05,1.285937e-05].
Model rank = 12 / 12
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(age) 9.00 3.31 0.96 0.95
Looking at the plot below, what is the trend of age and preterm birth?
Biologically, log-odds of preterm birth is higher at very young ages, lowest around 29 years, and then increased again.
Bayesian Credible Intervals
We are assuming the underlying function is smooth. This assumption can be formalized as a prior in a Bayesian model.
In Bayesian statistics, the parameters \(\beta\) are treated as random variables. A ridge penalty corresponds to an improper Gaussian prior:
\[
f_\beta \propto \exp\left(-\frac{\beta^\prime B \beta}{2}\right).
\]
For Gaussian data, this results in a posterior distribution:
\[
[\beta \mid y, \lambda] \sim N\left(\hat{\beta}, \sigma^2(X^\prime X + \lambda B)^{-1}\right),
\]
where:
\(\hat{\beta}\) is the estimated parameter vector.
\(\sigma^2\) is the variance.
\(\lambda\) is the penalty parameter.
\(B\) is the penalty matrix.
For a general likelihood, the Fisher Information matrix (Hessian of the negative log-likelihood at \(\hat{\beta}\)) is used:
where \(\hat{I}\) is the Fisher Information matrix. For quasi-Poisson data, multiply the penalty by \(\phi\).
In particular, the “Bayesian credible intervals” plotted in mgcv::gam have Frequentist coverage probabilities. For more details, see Section 6.10 in Wood (2017).
Multiple Smoothers
Let’s consider the additive model for two continuous variables \(x_i\) and \(z_i\):
\[
y_i = \beta_0 + g_1(x_i) + g_z(z_i) + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)
\] where \(g_1(), g_2()\) denote smooth relationships between response \(y_i\) and predictors \(x_i\) and the \(z_i\).
Again, extends to generalized linear models (binomial, Poisson, etc).
We again express non-linear functions using basis functions:
\[
\begin{align*}
g_1(x_i ) & = \sum_{m=1}^M B_{m1} b_{m1}(x_i)\\
g_2(z_i ) & = \sum_{m=1}^M B_{m2} b_{m2}(z_i)
\end{align*}
\] - Induce smoothing by penalizing regression coefficients. Note we also use smoothers to control for confounders flexibly.
Motivating Example: Associations between Mortality and Fine Particulate Matter
Fine Particulate Matter (PM2.5)
Represents a mixture of solid and liquid particles in the air that are less than \(2.5 \, \mu\text{m}\) in diameter.
Mainly arises from combustion sources, including:
Power generation
Vehicle emissions
Industrial operations.
Scientific Question: What is the association between daily mortality counts and daily concentration of outdoor PM2.5 air pollution?
Data Sources:
Mortality Data: Daily counts of non-accidental deaths (age \(\geq 65\)) in the 5-county New York City area (2001–2005) obtained from the National Center for Health Statistics (CDC).
Air Pollution Data: Daily PM2.5 concentrations from the Environmental Protection Agency (EPA).
Meteorology Data: Daily meteorological conditions from the National Climatic Data Center (NOAA).
Time Series Health Model
We are interested in the association between daily variation in mortality counts and daily variation in exposure (PM2.5)
In a time-series design we view population as the unit of analysis, i.e., outcome = total mortality counts arising from the population.
Confounders that vary smoothly in time can be easily controlled for by including smoothers.
Let \(y_t\) denote the death count on day \(t\), and \(x_t\) be the corresponding PM2.5 level.
Temporal Delay
There is typically a temporal delay between exposure and outcome. To address this, let’s examine the association between daily mortality and previous-day exposure.
We are particularly interested in \(g_1(x_{t-1})\), the smooth function of lagged PM2.5 (\(x_{t-1}\)). Modeling log-transformed death counts helps improve the normality of residuals.
Confounders to Consider:
Day of the week.
Seasonality and long-term trends.
Same-day temperature.
Same-day dew point temperature.
Previous day’s temperature and dew point temperature.
Let’s model the joint effects:
# create lag variable of pm25: used later onhealth$pm25.lag1 =c(NA, health$pm25[1:1825])health$date2=order(health$date)health$fdow =factor(health$dow)health$fdow =relevel (health$fdow, ref ="Sunday")# use a more intuitive reference level:fit =gam(log(cr65plus)~s(pm25)+fdow+s(date2, k =100)+s(Temp)+s(DpTemp)+s(rmTemp)+s(rmDpTemp), data = health)summary(fit)
Run diagnostics on this model to determine if it is a good fit!
Run the \(gam.check()\) function on the fit to get diagnostic \(plot()\) to get model output.
What are the findings from this diagnostic checks?
It is always good to check the correlations between variables. VIFs are mode complicated due to expanding covariates with a spline basis. Here we will be concerned if there are high correlations with PM2.g.lag1. Run the code below to obtain VIFs:
Analyze residuals for evidence of autocorrelation. If residuals are correlated, inference is not valid and p-values tend to be small. Use \(acf(resid(fit))\),
Under an additive model framework, we can also consider smooth effects of two variables jointly.
Important
Bivariate Splines
Specifically, we can think of \(f(x_i, z_i)\) as a smooth function of two variables. \[
y_i = f(x_i, z_i) + \epsilon_i
\]
We can think of \(f(x_i, z_i)\) as a surface.
We can use a 2-dimensional splines to model \(f(x_i, z_i)\).
Let \(\mathbf{s}_i = (x_i, z_i)\) be some pair of covariate values, and let \(\mathbf{k}_m = (x_m, z_m)\) denote the \(m^{th}\) knot in the domain of \(x_i\) and \(z_i\). We can express the smooth function as
where \(||\mathbf{s}_i - \mathbf{k}_m||\) is the Euclidean distance between the covariate \(\mathbf{s}_i\) and the knot location \(\mathbf{k}_m\).
The radial basis kernel is \(r^2 log r\).
The thin-plate spline is sensitive to the scale of each variable, but invariant to rotation (i.e., isotropic).
It is best suited for variables measured on the same scale, such as geographical distance.
Joint Effects of Temperature and Dew Point Temperature
mgcv::gam uses thin plate regression splines as the default smoothers of a single variable (as well as two variables).
This is the general idea: - Construct the \(n \times n\) matrix from \(\mathbf{E}\) from \(||\mathbf{s}_i - \mathbf{s}_{i'}||^2 log(||\mathbf{s}_i - \mathbf{s}_{i'}||)\)
Use SVD to find a low rank representation.
In practice, there are some additional things to worry about to make 1 (for the intercept), and \(x\) and \(z\) in the null space of the penalty.
Then estimate the \(\beta_0, \beta_1\) for \(x\) , and \(\beta_2\) for \(z\) (unpenalized) and \(\beta_3, ..., \beta_k\) (penalized) which reduces computation costs.
To define a bivariate smoother, simply specify \(s(var1, var2)\) in the equation formula.
fit_thin <-gam(log(alldeaths) ~s(date2,k=100) +s(Temp, DpTemp, k=20), data = health)summary(fit_thin)
Family and Link Function: The model assumes a Gaussian distribution for the response variable with an identity link function, meaning predictions are on the same scale as the response variable \(\log(\text{alldeaths})\).
Model Formula: The model includes:
A smooth function of date2 with \(k=100\) (basis dimension for flexibility).
A bivariate smooth function of Temp and DpTemp with \(k=20\).
Parametric Coefficients:
The intercept is 4.962, representing the baseline log of deaths when all covariates are at their mean.
The intercept’s \(t\)-value is extremely high (2422), and the associated \(p\)-value is \(<2e-16\), indicating it is highly significant.
Smooth Terms:
s(date2):
Effective degrees of freedom (edf): 67.37 suggesting a highly flexible fit for the time trend.
\(F\)-value: 9.597 with a \(p\)-value \(<2e-16\), indicating that the temporal trend is highly significant.
s(Temp, DpTemp):
edf: 12.66 indicating a moderately flexible surface for the relationship between temperature and dew point temperature.
\(F\)-value: 4.327 with a \(p\)-value <2e-16, suggesting the smooth term for Temp and DpTemp is also highly significant.
Model Performance:
Adjusted \(R^2\): 0.464, meaning 46.4% of the variability in the response (log of deaths) is explained by the model.
Deviance Explained: 48.8%, similar to \(R^2\), indicating a good fit but room for improvement.
Generalized Cross-Validation (GCV) score: 0.008021 smaller is better! Mostly used for model comparison.
Scale Estimate \(\sigma^2\) : 0.0076651, indicating the residual variance.
Key Takeaways:
Both smooth terms (date2 and the bivariate smoother of Temp and DpTemp) are highly significant and contribute meaningfully to explaining variability in the log of deaths.
The model captures temporal trends and the effect of temperature and dew point interactions effectively, but the explained variance suggests other factors might also influence the response variable.