# 1 Polynomial Regression

## 1.1 Introduction

The extension of the linear models \(y=\beta_0 + \beta_1x + \varepsilon\) to include higher degree polynomial terms \(x^2\), \(x^3\), …, \(x^p\) is straightforward. Each additional term can be viewed as another predictor in the regression equation:

\(y=\beta_0 + \beta_1x + \beta_2x^2 + \dots + \beta_px^p + \varepsilon\)

This allows the fit of more flexible models representing the association between the outcome and some continuous predictors.

However, in practice, we hardly go beyond the degree 3. If the association between the outcome and predictor is highly non-linear, than it is preferable to use the methods that we will discuss in the next sections.

## 1.3 Practice session

### Task 1 - Fit a cubic model

The dataset **triceps** is available in the `MultiKink`

package.

The data contains the measurement of the triceps skin fold of 892 females
(variable *triceps*) and we want to model its association with **age**.

First, we will load the data

```
#libraries that we will need
#install.packages("MultiKink")
library(MultiKink) #for the data
library(ggplot2) #for the plots
```

`## Warning: package 'ggplot2' was built under R version 4.1.2`

```
set.seed(1974) #fix the random generator seed
data("triceps") #load the dataset triceps
#notice that the variable of interest
#it is also called triceps. Don't get
#confused!
```

And plot the scatter for **triceps** and **age**

```
#simple scatter
#we can store the scatter in an object
#to use it later
<- ggplot(triceps, aes(x=age, y=triceps)) +
tri.age.plot geom_point(alpha=0.55, color="black") +
theme_minimal()
tri.age.plot
```

To fit a cubic model we can write all the terms using the `I()`

function to
evaluate \(x^2\) and \(x^3\), otherwise R will not use the quadratic and cubic terms:

```
<- lm(triceps~age + I(age^2) + I(age^3),
model.cubic data=triceps)
summary(model.cubic)
```

```
##
## Call:
## lm(formula = triceps ~ age + I(age^2) + I(age^3), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5832 -1.9284 -0.5415 1.3283 24.4440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.004e+00 3.831e-01 20.893 < 2e-16 ***
## age -3.157e-01 7.721e-02 -4.089 4.73e-05 ***
## I(age^2) 3.101e-02 3.964e-03 7.824 1.45e-14 ***
## I(age^3) -4.566e-04 5.612e-05 -8.135 1.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.868 on 888 degrees of freedom
## Multiple R-squared: 0.3836, Adjusted R-squared: 0.3815
## F-statistic: 184.2 on 3 and 888 DF, p-value: < 2.2e-16
```

Another option is to use the `poly()`

function. Note, however, that the this
function fits a **linear transformation** of the terms \(x, x^2,x^3\). This is
for numerical stability given that those three terms will be highly correlated.
Thus, the regression coefficients will not be the same but the model is just
a reparameterisation of the one above and its predictions are exactly the same.

```
<- lm(triceps~poly(age,3),
model.cubic.poly data=triceps)
summary(model.cubic.poly)
```

```
##
## Call:
## lm(formula = triceps ~ poly(age, 3), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5832 -1.9284 -0.5415 1.3283 24.4440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7024 0.1295 74.911 < 2e-16 ***
## poly(age, 3)1 85.2594 3.8682 22.041 < 2e-16 ***
## poly(age, 3)2 -3.1638 3.8682 -0.818 0.414
## poly(age, 3)3 -31.4683 3.8682 -8.135 1.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.868 on 888 degrees of freedom
## Multiple R-squared: 0.3836, Adjusted R-squared: 0.3815
## F-statistic: 184.2 on 3 and 888 DF, p-value: < 2.2e-16
```

If you look at the predictions of both model, the results are exactly the same

`plot(predict(model.cubic.poly), predict(model.cubic)) `

We can also fit and plot the cubic model using `ggplot`

. We will use the initial
scatter plot an add the fitted curve

```
#plots
+
tri.age.plot stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T), size = 1)
```

**TRY IT YOURSELF:**

- Add a linear fit to the plot above

## See the solution code

```
+
tri.age.plot stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T), size = 1) +
stat_smooth(method = "lm",
formula = y~poly(x,1,raw=T), lty = 2, col = "red" , size = 1)
```

### Task 2 - Mean Squared Error for the quadratic model

We will use the same dataset and the same variables as in TASK 1 but now we want to compute the cross-validated MSE for the quadratic model

There are multiple ways of doing this. We will take advantage of the
easy implementation of cross-validation in the `caret`

package. We will do
10-fold cross-validations and repeat it 10 times:

`library(caret)`

`## Loading required package: lattice`

```
set.seed(1234)
#repeated CV for the MSE
<- trainControl(method = "repeatedcv",
trC.lm number = 10, #10-fold cross-validation
repeats = 10) #10 times
#function to fit a polynomial model of degree x
<- train(triceps ~ poly(age,2),
pol.model data = triceps,
method = "lm",
trControl = trC.lm)
#this is the root mean squared error
$results[2] pol.model
```

```
## RMSE
## 1 3.988524
```

**TRY IT YOURSELF:**

- Calculate the MSE (or the root mean squared error) for the model using a degree 4 polynomial, through cross-validation

## See the solution code

```
set.seed(1001)
#repeated CV for the MSE
<- trainControl(method = "repeatedcv",
trC.lm number = 10, #10-fold cross-validation
repeats = 10) #10 times
#function to fit a polynomial model of degree x
<- train(triceps ~ poly(age,4),
pol.model data = triceps,
method = "lm",
trControl = trC.lm)
#this is the root mean squared error
$results[2] pol.model
```

```
## RMSE
## 1 3.782918
```

- Calculate the MSE (or the root mean squared error) for the models using polynomials from degree 1 (linear) up to 10

## See the solution code

```
set.seed(1001)
#repeated CV for the MSE
<- trainControl(method = "repeatedcv",
trC.lm number = 10,
repeats = 10)
#function to fit a polynomial model of degree x
<- function(x) {
my.pol.f <-poly(triceps$age, x, raw=T) #design matrix with age,
xx#age^2, ..., age^10
<- cbind(triceps=triceps$triceps, xx) #dataset with the added
new.data #poly terms
<- train(triceps ~ ., #the . uses all the
pol.model data = new.data, #predictors
method = "lm",
trControl = trC.lm)
= pol.model$results[2]
RMSE.cv
}
#RMSE
t(sapply(1:10, my.pol.f)) #applies the function
#to poly degrees 1 to 10
```

## 1.4 Exercises

Solve the following exercise:

- The dataset SA_heart.csv
contains on coronary heart disease status (variable
*chd*) and several risk factors including the cumulative tobacco comsumption*tobacco*.

Fit a logistic model for

*chd*using the predictor*tobacco*(as a linear effect) and compute its AICPlot the fitted curve in a)

Fit a logistic model for

*chd*allowing a cubic effect of the predictor*tobacco*and compute its AIC.Plot the fitted curve in c)

Compute the cross-validated ROC of the models a) and c) (use the

`caret`

package)

## See the solution code for e)

```
library(caret)
set.seed(2001)
<- read.csv("https://www.dropbox.com/s/cwkw3p91zyizcqz/SA_heart.csv?dl=1")
SA_heart
# caret will give an error for factors coded as 0 and 1
# because it uses the factors names to create
# names of internal variables. This way it is better
#to use an outcome variable with strings as the factor names
$chd.f <- ifelse(SA_heart$chd ==1,
SA_heart"chd",
"nochd")
#sets the control for 10-fold cross-validation, 10 times
# the classProbs = TRUE and summaryFunction = twoClassSummary
# store the information to compute the area under the ROC
<- trainControl(method = "repeatedcv",
trC.lm number = 10,
repeats = 10,
classProbs = TRUE, #necessary for
summaryFunction = twoClassSummary) #the AUC ROC
#linear effect
<- train(form = chd.f ~ tobacco,
roc.l data = SA_heart,
method = "glm",
family = "binomial",
trControl = trC.lm,
metric = "ROC")
#cubic effect
<- train(form = chd.f ~ poly(tobacco,3),
roc.c data = SA_heart,
method = "glm",
family = "binomial",
trControl = trC.lm,
metric = "ROC")
roc.l roc.c
```

- Which model would you prefer?