2 Piecewise Regression and Splines

2.1 Introduction

An alternative to fit all data points with a single polynomial curve, is to fit segments to different parts of the data, with breakpoints (knots) at pre-determined places.

We can further require continuity, meaning that the segments have to be connected

Again, the knots need to be specified and the regression equation becomes:

$y = \beta_0 + \beta_1 x + \beta_2 (x-k_1)_{+} + \beta_3 (x-k_2)_{+} + ... + \beta_6 (x-k_p)_{+} + \varepsilon\\$

where $(x-k)_{+} = \begin{cases} 0, & \text{ if } x < k \\ x-k, & \text{ if } x \geq k\\ \end{cases}$

The approach above may be extended to use polynomial segments. For example, using cubic segments, the model would become

$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x-k_1)_{+} + \beta_5 (x-k_1)^2_{+} + \beta_6 (x-k_1)^3_{+} + ... + \varepsilon$

The above model will be a smooth curve within the intervals bounded by the knots, but the “connection” between the segments will not be smooth. To force smoothness over the entire fitted curve we can restrict the the model above to only include the cubic components:

$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x-k_1)^3_{+} + \beta_5 (x-k_2)^3_{+} + \beta_6 (x-k_3)^3_{+} + ... + \varepsilon$

This will guarantee that the fitted curve is differentiable, with no sharp changes in the direction. This is called a cubic spline.

An improvement of the fitting of splines in the boundary of the data is achieved by using linear fitting before the first knot and after the last one.

2.2 Readings

Read the following chapters of An introduction to statistical learning:

2.3 Practice session

Task 1 - Fit a piecewise linear regression

We will continue the example using the dataset triceps available in the MultiKink package. The data contains the measurement of the triceps skin fold of 892 females (variable triceps) and we want to model its association with age, using piecewise linear regression with knots at 5,10,20,30 and 40.

First, we will load the data

#libraries that we will need
#install.packages("MultiKink")  
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 

data("triceps")   #load the dataset triceps
                  #notice that the variable of interest
                  #it is also called tricets. Don't get 
                  #confused!

And plot the scatter for triceps and age

 #simple scatter
  #we can store the scatter in an object 
  #to use it later
 tri.age.plot <- ggplot(triceps, aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_minimal() 
 tri.age.plot

We will fit linear models within the intervals defined by the knots. The predict() will give us the fitted lines.

###Piecewise regression

pred1 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age<5,]))
pred2 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age >=5 & triceps$age<10,]))
pred3 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=10 & triceps$age<20,]))
pred4 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=20 & triceps$age<30,]))
pred5 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=30 & triceps$age<40,]))
pred6 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=40,]))

And we can now add the segments to the scatter plot

tri.age.plot + 
  geom_line(data=triceps[triceps$age<5,], 
            aes(y = pred1, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age >=5 & triceps$age<10,], 
            aes(y = pred2, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=10 & triceps$age<20,], 
            aes(y = pred3, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=20 & triceps$age<30,], 
            aes(y = pred4, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=30 & triceps$age<40,], 
            aes(y = pred5, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=40,], 
            aes(y = pred6, x=age), size = 1, col="blue")

We can restrict the segments to be connected, i.e., to fit a continuous line. The model is

$y = \beta_0 + \beta_1 x + \beta_2 (x-k_1)_{+} + \beta_3 (x-k_2)_{+} + ... + \beta_6 (x-k_p)_{+} + \varepsilon\\$

where $(x-k)_{+} = \begin{cases} 0, & \text{ if } x < k \\ x-k, & \text{ if } x \geq k\\ \end{cases}$

So, we need to add the terms $(x-k)$ when $x \geq k$ . We will do this by adding $I((age-k)*(age >= k))$ terms to the linear model. Note that $(age >= k)$ is a logical statement that will be 0 ( $FALSE$ ) of 1 ( $TRUE$ ) and I() evaluates that all expression.

pred7 <- predict(lm(triceps~ age + I((age-5)*(age>=5)) +
                                   I((age-10)*(age >= 10)) +
                                   I((age-20)*(age >= 20)) +
                                   I((age-30)*(age >= 30)) +
                                  I((age-40)*(age >= 40)),
                    data = triceps))

tri.age.plot +
  geom_line(data=triceps, 
            aes(y = pred7, x=age), size = 1, col="blue")

TRY IT YOURSELF:

Using the same knots as above, fit a quadratic piecewise regression

See the solution code

 pred.quad <- predict(lm(triceps~ age + I(age^2) + 
                    I((age-5)*(age>=5)) + I((age-5)^2*(age>=5)) +
                    I((age-10)*(age >= 10)) + I((age-10)^2*(age>=10)) +
                    I((age-20)*(age >= 20)) + I((age-20)^2*(age>=20)) +
                    I((age-30)*(age >= 30)) + I((age-30)^2*(age>=30)) +
                    I((age-40)*(age >= 40)) + I((age-40)^2*(age>=40)),
                    data = triceps))

tri.age.plot +
  geom_line(data=triceps, 
            aes(y = pred.quad, x=age), size = 1, col="blue")

Task 2 - Fit a natural cubic spline

We will the same dataset triceps as in TASK 1 to fit a natural cubic spline for the association of age and triceps.

The function bs() in the splines package generates the B-spline basis matrix for a polynomial spline, and the function ns() in the same library generates the B-spline basis matrix matrix for a natural cubic spline (restriction that the fitted curve linear at the extremes). We will compare both.

library(splines)
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 

data("triceps")   #load the dataset triceps
                  #notice that the variable of interest
                  #it is also called triceps. Don't get 
                  #confused!

#linear model with the natural cubic splines function 
cub.splines.bs <- lm(triceps ~ bs(age, knots = c(5,10,20,30,40)), 
                   data=triceps)
summary(cub.splines.bs)

## 
## Call:
## lm(formula = triceps ~ bs(age, knots = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5234  -1.6912  -0.2917   1.1356  23.0922 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              6.9598     0.9729   7.154 1.77e-12 ***
## bs(age, knots = c(5, 10, 20, 30, 40))1   2.5367     1.7154   1.479   0.1396    
## bs(age, knots = c(5, 10, 20, 30, 40))2  -0.3032     0.9629  -0.315   0.7529    
## bs(age, knots = c(5, 10, 20, 30, 40))3  -1.9092     1.2993  -1.469   0.1421    
## bs(age, knots = c(5, 10, 20, 30, 40))4   7.4056     1.2179   6.081 1.78e-09 ***
## bs(age, knots = c(5, 10, 20, 30, 40))5   6.1050     1.4043   4.347 1.54e-05 ***
## bs(age, knots = c(5, 10, 20, 30, 40))6  10.1770     1.5427   6.597 7.23e-11 ***
## bs(age, knots = c(5, 10, 20, 30, 40))7   3.9428     1.9082   2.066   0.0391 *  
## bs(age, knots = c(5, 10, 20, 30, 40))8  10.1473     1.7545   5.784 1.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 883 degrees of freedom
## Multiple R-squared:  0.4261, Adjusted R-squared:  0.4209 
## F-statistic: 81.94 on 8 and 883 DF,  p-value: < 2.2e-16

cub.splines.ns <- lm(triceps ~ ns(age, knots = c(5,10,20,30,40)), 
                   data=triceps)

summary(cub.splines.ns)

## 
## Call:
## lm(formula = triceps ~ ns(age, knots = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4875  -1.6873  -0.3665   1.1146  22.8643 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              8.3811     0.5219  16.059  < 2e-16 ***
## ns(age, knots = c(5, 10, 20, 30, 40))1  -3.5592     0.6712  -5.303 1.44e-07 ***
## ns(age, knots = c(5, 10, 20, 30, 40))2   5.7803     1.0379   5.569 3.39e-08 ***
## ns(age, knots = c(5, 10, 20, 30, 40))3   5.5118     0.9416   5.853 6.78e-09 ***
## ns(age, knots = c(5, 10, 20, 30, 40))4   6.9070     0.9050   7.632 5.99e-14 ***
## ns(age, knots = c(5, 10, 20, 30, 40))5   5.4136     1.3783   3.928 9.24e-05 ***
## ns(age, knots = c(5, 10, 20, 30, 40))6   6.6460     1.0829   6.137 1.27e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.759 on 885 degrees of freedom
## Multiple R-squared:  0.4199, Adjusted R-squared:  0.416 
## F-statistic: 106.8 on 6 and 885 DF,  p-value: < 2.2e-16

Notice that are less regression parameters for the natural spline due to the linearity restriction. We can see this in the plot. To plot we could either get predictions from the fitted models or fit the models in the ggplot function directly:

 #simple scatter
 tri.age.plot <- ggplot(triceps, aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_minimal() 
 
  tri.age.plot +
    stat_smooth(method = "lm", 
               formula = y~bs(x,knots = c(5,10,20,30,40)), 
               lty = 1, col = "blue") + 
    stat_smooth(method = "lm", 
               formula = y~ns(x,knots = c(5,10,20,30,40)), 
               lty = 1, col = "red")

TRY IT YOURSELF:

Fit a natural spline with 6 degrees of freedom and compare it with the natural spline using knots = c(5,10,20,30,40). What is the difference?

See the solution code

 tri.age.plot +
    stat_smooth(method = "lm", 
               formula = y~ns(x,knots = c(5,10,20,30,40)), 
               lty = 1, col = "red") + 
    stat_smooth(method = "lm", 
               formula = y~ns(x,df=6), 
               lty = 1, col = "yellow")

#df=6 also chooses 5 knots but the knots
#are based on the quantiles of the data
#in this case the knots are at values:
attr(ns(triceps$age, df=6), "knots")

## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##      3.76      7.67     12.21     18.12     32.55

Calculate the MSE (or the root mean squared error) for the models using natural cubic splines with $df$ from 2 (linear model) up to 20. You can use the library caret.

See the solution code

library(caret)
set.seed(1001)

#repeated CV for the MSE
trC.lm <- trainControl(method = "repeatedcv", 
                         number = 10, 
                         repeats = 10)

#function to fit a spline with x degrees of freedom
  my.spline.f <- function(x) {
    #need to construct the model formula
    spline.formula <- as.formula(paste("triceps ~ ns(age, df=",x, ")" ))                                          
    pol.model <- train(spline.formula,                   
                        data = triceps,            
                        method = "lm",
                        trControl = trC.lm)
    
    RMSE.cv = pol.model$results[2]                #extracts the RMSE
  }
  
  #RMSE
  t(sapply(2:20, my.spline.f))                    #Computes the RMSE for splines
                                                  #with df degrees 2 to 20
  ###########################################################################
  #if you want to plot the curves,
  #it is tricky to get ggplot to work 
  #within a loop. This is a solutions:
  col.ran <- sample(colours(), 20)                  #colours for the lines
  my.plot<- tri.age.plot                            #scatterplot
  for (i in 2:20){
      #builds the stat_smooth with df=i
      loop_input <-  paste("stat_smooth(method = \"lm\", 
                          formula = y~ns(x,df=",i,"), 
                          lty = 1, col =\"",col.ran[i],"\", 
                          se = FALSE)", sep="")
      
      #updates the scatter plot with 
      #the new spline
      my.plot <- my.plot + eval(parse(text=loop_input))    
  }
  
 my.plot

2.4 Exercises

Solve the following exercise:

The dataset SA_heart.csv contains on coronary heart disease status (variable chd) and several risk factors including the cumulative tobacco consumption tobacco.

Fit a logistic model for chd using the predictor tobacco (as a linear effect) and compute its AIC
Plot the fitted curve in a)
Fit a logistic model for chd with a natural cubic spline for the predictor tobacco, with $df$ 5 and 10. Compute the AIC of the two models.
Plot the fitted curves in c)
Compute the cross-validated ROC of the models a) and c) (use the caret package)

See the solution code for e)

    library(caret)
    library(splines)
    set.seed(2001)
    SA_heart <- read.csv("https://www.dropbox.com/s/cwkw3p91zyizcqz/SA_heart.csv?dl=1")
    
    # caret will give an error for factors coded as 0 and 1
    # because it uses the factors names to create
    # names of internal variables. This way it is better
    #to use an outcome variable with strings as the factor names
    SA_heart$chd.f <- ifelse(SA_heart$chd ==1, 
                             "chd", 
                             "nochd")     
    
    #sets the control for 10-fold cross-validation, 10 times
    # the classProbs = TRUE and summaryFunction = twoClassSummary
    # store the information to compute the area under the ROC
    trC.lm <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 10,
                           classProbs = TRUE,                 #necessary for 
                           summaryFunction = twoClassSummary) #the AUC ROC

     #linear effect
     roc.l <- train(form = chd.f  ~ tobacco,
                    data = SA_heart,
                    method = "glm",
                    family = "binomial",
                    trControl = trC.lm,
                    metric = "ROC") 
     
     roc.5 <- train(form = chd.f  ~ ns(tobacco,df=5),
                    data = SA_heart,
                    method = "glm",
                    family = "binomial",
                    trControl = trC.lm,
                    metric = "ROC") 
     #cubic effect
     roc.10 <- train(form = chd.f  ~ ns(tobacco,df=10),
                    data = SA_heart,
                    method = "glm",
                    family = "binomial",
                    trControl = trC.lm,
                    metric = "ROC") 
  
     roc.l
     roc.5
     roc.10

Which model would you prefer?

The dataset fev.csv contains the measurements of forced expiratory volume (FEV) tests, evaluating the pulmonary capacity in 654 children and young adults.

Plot the association between fev and height and fit a linear model for fev using height as a predictor
Fit a model for fev with a cubic b-spline for the predictor height, with $df$ 5 and 10.
Fit a model for fev with a natural cubin spline for the predictor height, with $df$ 5 and 10.
Plot the fitted curves for models a), b) and c)
compare the cross-validated MSE of the models a), b) and c)