3 Smoothing splines

3.1 Introduction

In the previous section we learn how to fit regression splines by specifying the knots and a set of basis function. It should be easy to see that a higher number of knots will lead to a lower MSE because we will be overfitting the features of the curve.

The model below is fitted with natural splines with 25 knots.

Clearly the curve seems to be overfitting the data.

We will use a similar idea to the one used in regularisation (module 4). We select many knots but penalise for the roughness of the fitting.

Remember that the 1st derivative indicates the slope of the curve and the second derivative is the speed of change of the slope. Thus, the second derivative of a curve is associated with the roughness of the curve.

We will then use the second derivative as the penalisation term in the residual sum of squares.

\(\sum_{i=1}^{n} (y_i-f(x_i))^2 + \lambda \int f''(t)^2 dt\)

The result is a smoothing spline. For smoothing splines, the number of knots is not as important given that the penalisation term will handle the roughness.

The animation below, shows the fitting of smoothing splines, with amounts of penalisation (lambda), and automatic choice of number of knots given by the smooth.spline function in R. The cross validated MSE is also shown.

A large \(\lambda\) results in a smooth curve (a straight line in the limit) and a smaller \(\lambda\) leads to a more rough curve. The optimal \(\lambda\) can be chosen by cross-validation.

The smoothing splines can be incorporated in the generalised linear models framework which is usually referred as generalised additive models (GAM). Rather than a linear effect of a predictor, we can have a smoothing spline modeling the association of the predictor with the outcome:

\(g\left(E(y|\mathbf{x} ) \right) = \beta_0 + f_1(x_{1}) + f_2(x_{2}) + ... + f_k(x_{k})\)

Notice that \(f_p(x_p)\) can be a linear function \(\beta_p x_p\) and if all \(f\)’s are linear function, the model above is a GLM.

3.2 Readings

Read the following chapters of An introduction to statistical learning:

3.3 Practice session

Task 1 - Fit a smoothing spline

We will continue the example using the dataset tricepsavailable in the MultiKink package. The data contains the measurement of the triceps skin fold of 892 females (variable triceps) and we want to model its association with age, using smoothing cubic splines.

The function smooth.spline() fits smoothing cubic splines. We can provide the penalisation and/or number of knots, df, or just use the defaults.

library(splines)
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 

data("triceps")   #load the dataset triceps
                  #notice that the variable of interest
                  #it is also called triceps. Don't get 
                  #confused!

#smooth spline with automatic number of knots chosen
#and penalisation chosen by leave-one-out CV (this is the
#option cv=T, otherwise generalized’ cross-validation is used)
sspline <- smooth.spline(triceps$age, 
                         triceps$triceps, 
                         cv=TRUE) 
## Warning in smooth.spline(triceps$age, triceps$triceps, cv = TRUE): cross-validation with non-unique 'x' values seems doubtful
plot(triceps$age, triceps$triceps)
lines(sspline, col="blue")

*Note: The generalised cross-validation (default), in this case, selects a very low value for \(\lambda\) which results in under smoothing. It is not clear why that is the case.

Let’s get the triceps predicted value for the ages of 10 and 30

predict(sspline, x=c(10,30))
## $x
## [1] 10 30
## 
## $y
## [1]  6.470573 14.290032

The predicted values are 6.4705727 and 14.2900322, respectively.

As mentioned before, we could have chosen the amount of penalisation and this would lead to a different smoothing. For example, for

sspline <- smooth.spline(triceps$age, 
                         triceps$triceps, lambda=.0001) 
plot(triceps$age, triceps$triceps)
lines(sspline, col="blue")

TRY IT YOURSELF:

  1. Fit smoothing splines with df=19 and df=30 and compare them with the one above. Comment on the results.
See the solution code
sspline <- smooth.spline(triceps$age, 
                         triceps$triceps, lambda=.0001) 
sspline19 <- smooth.spline(triceps$age, 
                         triceps$triceps, df=19) 
sspline30 <- smooth.spline(triceps$age, 
                         triceps$triceps, df=30) 
plot(triceps$age, triceps$triceps)
lines(sspline, col="blue")
lines(sspline19, col="red")
lines(sspline30, col="green")

The spline with df=19 corresponds to having a lambda=.0001, so the result is the same. If you run sspline and sspline you can see that the parameters are similar.

The spline with df=30 will be more flexible (more rough).

Task 2 - Fit an additive model

The dataset bmd.csv contains 169 measurement of bone mineral density (variable bmd) in men and women of different age. We want to fit a model for bmd using age, sex and bmi as predictors.

#read the data and compute BMI
bmd.data     <- 
  read.csv("https://www.dropbox.com/s/c6mhgatkotuze8o/bmd.csv?dl=1", 
           stringsAsFactors = TRUE)
bmd.data$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2

We will use the function gam() from the mgcv library. Notice that there is another gam function from the gam package (from the help file: “Note that this version of gam is different from the function with the same name in the R library mgcv, which uses only smoothing splines with a focus on automatic smoothing parameter selection via GCV.”)

The function s() within the model indicates that we want to smoothing spline for that predictor. There are several types of splines implemented in the function. We will use the option basis spline equal to cubic regression splines: bs="cr". The function s() has similarities to the smooth.spline() but is implemented

#libraries that we will need
library(mgcv)   #package for gam
set.seed(1974) #fix the random generator seed 
bmd.gam <- gam(bmd ~ s(age, bs="cr")+ s(bmi, bs="cr") + sex, data=bmd.data)

summary(bmd.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## bmd ~ s(age, bs = "cr") + s(bmi, bs = "cr") + sex
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74193    0.01456   50.97  < 2e-16 ***
## sexM         0.08092    0.02064    3.92 0.000131 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F  p-value    
## s(age) 1.035  1.070 17.65 3.02e-05 ***
## s(bmi) 5.687  6.611 10.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.381   Deviance explained = 40.9%
## GCV = 0.018099  Scale est. = 0.017165  n = 169

The effective number of degrees of freedom for age is approximately 1, which suggests that the effect of age is linear. We can plot the fitted splines for each predictor. Notice, however, that the plot is not in the original scale.

plot(bmd.gam)

We can also plot the surface fitted for each sex. For example, for women,

# Let's create a grid to be used in  persp() 
steps <- 60
age <- with(bmd.data, 
            seq(min(age), max(age), 
                length = steps) )

bmi <-  with(bmd.data, 
             seq(min(bmi), max(bmi), 
                 length = steps) )

newdat <- expand.grid(age = age,             #grid
                      bmi = bmi, 
                      sex="F")  

bmdpred <- matrix(predict(bmd.gam, newdat), 
                  steps, steps)             #predictions

# Now plot it
p <- persp(age, bmi,
           bmdpred, 
           theta = 65,                    #angle of the perspective
           col = "green")

# To add the points, you need the same 3d transformation
obs <- with(bmd.data[bmd.data$sex=="F", ], 
            trans3d(age, bmi, bmd, p))
pred <- with(bmd.data[bmd.data$sex=="F", ], 
             trans3d(age, bmi, fitted(bmd.gam)[bmd.data$sex=="F"], p))


# Add segments to show the points and where they are in 3d
points(obs, col = "red", pch = 16)
segments(obs$x, obs$y, pred$x, pred$y)

TRY IT YOURSELF:

  1. How would the plot for sex=M compare to the surface above?
See the solution code

The surface would have exactly the same shape but would be separated by the 0.08 units (estimate for the variable sex in the model)

  1. Fit a linear model with the same variables and compare the AIC of the linear model with the previous GAM model
See the solution code
bmd.glm <- glm(bmd ~ age+bmi+ sex, data=bmd.data)
summary(bmd.glm)
AIC(bmd.glm, bmd.gam)

The GAM model seems to fit better because it has a lower AIC.

3.4 Exercises

Solve the following exercise:

  1. The dataset SA_heart.csv contains on coronary heart disease status (variable chd) and several risk factors including the cumulative tobacco consumption tobacco, systolic sbp, and age
  1. Fit a GAM logistic model for chd with splines for the predictor tobacco,sbp and age
See the solution code for a)
    library(caret)
    library(splines)
    set.seed(2001)
    SA_heart <- read.csv("https://www.dropbox.com/s/cwkw3p91zyizcqz/SA_heart.csv?dl=1")
    
    chd.gam <- gam(chd ~ s(tobacco, bs="cr") + 
                     s(sbp, bs="cr") +
                     s(age, bs="cr"), 
                   family = binomial, 
                   data=SA_heart)
    summary(chd.gam)
    plot(chd.gam)

  1. Find the AUC ROC for the model above.

  2. Compare the AUC ROC of the GAM model with a logistic regression with linear effects for the predictors.

  1. The dataset fev.csv contains the measurements of forced expiratory volume (FEV) tests, evaluating the pulmonary capacity in 654 children and young adults.
  1. Plot the association between fev and height and fit a smoothing spline for fev using height as a predictor

  2. Plot the association between fev and age and fit a smoothing spline for fev using age as a predictor

  3. Fit a GAM model for fev with smoothing slipes for height and age and also add sex to the model.

  4. Comment in the results of the fitted GAM model and plot the fitted splines for the predictors