Chapter 8 Polynomial Regression
8.1 The Assumption of Linearity
- So far, we have worked under the assumption that the relationships between predictors and the response are (first-order) linear.
- In reality, they are almost never exactly linear.
- However, as long as the relationships are approximately linear, linear models work fine.
- But what can we do when the relationships are clearly non-linear??
Example
Figure 8.1 shows a fitted regression line from the model \(y = \beta_0 + \beta_1 x + \epsilon\): clearly not a good fit!
Why not add \(x^2\) in the equation in order to bend the straight line? See Figure 8.2.
The fitted regression line from the model \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\) looks somehow better, especially on the left tail.
Why not make the line even more flexible by adding an \(x^3\)-term? see Figure 8.3.
The fitted regression line from model \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\) looks just right.
8.2 Polynomial Regression Models
- We have just implemented polynomial regression - as easy as that!
- In general, polynomial models are of the form \[ y = f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \ldots + \beta_d x^d + \epsilon, \] where \(d\) is called the degree of the polynomial.
- Notice how convenient this is: the non-linear relationship between \(y\) and \(x\) is captured by the polynomial terms but the models remain linear in the parameters/coefficients (the models are additive).
- This means we can use standard Least Squares for estimation!
- For example, previously we fitted a third-order polynomial, which can be also written as \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \] with the predictors simply being \(X_1 = X\), \(X_2 = X^2\) and \(X_3 = X^3\).
What degree \(d\) should I choose?
- What about \(d\) here - do we have a tuning-problem again?
- In this case not really. In practice, values of \(d\) greater than 3 or 4 are rarely used.
- This is because if \(d\) is large the fitted curve becomes overly flexible/complex and can take very strange shapes.
- We want the curve to pass through the data in a smooth way.
Example: Overfitting
We fit a polynomial of degree 3 to some training data. On the left of Figure 8.4, we compare the training data (red points) with their fitted values (that is, the model predictions at the training data inputs) joined up as a line. On the right, we compare some test data with their model predicted values (again, joined up as a line). We can see the learned curve fits relatively nicely on both the training data and the test data (it is very similar in both plots, noting that the fit is only plotted across the range of test points in the right-hand plot).We then fit a polynomial of degree 14 to the same training data. On the left of Figure 8.5, we can see that the fitted curve tries to pass through every training data point. The result of this overfitting on prediction is clearly seen on the right.
Example: Wage data
We consider the Wage
dataset from library ISLR
, which contains income and demographic information for males who reside in the central Atlantic region of the United States. In particular, each panel of Figure 8.6 is a plot of wage
against age
. We then see the results of plotting polynomial models \(\hat{f}(x)\) (solid lines) of age
for wage
of various degrees - 2, 3 and 4 respectively. We can see some slight differences between the models for this data.
The purpose of fitting any of these polynomial models is because we are interested in modelling wage
as a function \(\hat{f}\) of the predictor age
as best we can, this leading to an understanding of the relationship between age
and wage
. For example. in the left-hand panel we have a degree-2 polynomial fit \(\hat{f}\). This would allow us to estimate wage at a particular value for age, \(x_0\) say, as
\[
\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2
\]
where \(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2\) are the least squares fitted coefficients.
Least squares also returns variance estimates for each of the fitted coefficients \(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2\), as well as the covariances between pairs of coefficient estimates. We can use these to estimate the point-wise standard error of the (mean) prediction \(\hat{f}(x_0)\).
The coloured points on either side of the solid-line fitted curve are therefore \(2\times\) standard error curves, that is they correspond to \(\hat{f}(x) \pm 2 \times \mathrm{SE}[\hat{f}(x)]\), where \(\mathrm{SE}[\hat{f}(x)]\) corresponds to the standard error on the mean prediction at \(x\). We plot twice the standard error because, for normally distributed error terms, this quantity corresponds to an approximate \(95\%\) confidence interval (in this case for the mean prediction at \(x\)).
Disadvantages
- The fitted curve from polynomial regression is obtained by global training. That is, we use the entire range of values of the predictor to fit the curve.
- This can be problematic: if we get new samples from a specific subregion of the predictor this might change the shape of the curve in other subregions!
- Ideally, we would like the curve to adapt locally within subregions which are relevant to the analysis.
- A first simple approach for resolving this is via step functions (see Chapter 9).
8.3 Practical Demonstration
In this section we will start with the analysis of the Boston
dataset included in the R library MASS
. This dataset consists of 506 samples. The response variable is median value of owner-occupied homes in Boston (medv
). The dataset has 13 associated predictor variables.
Initial steps
We load library MASS
(installing the package is not needed for MASS
), check the help document for Boston
, check for missing entries and use commands names()
and head()
to get a better picture of how this dataset looks like.
library(MASS)
help(Boston)
sum(is.na(Boston))
## [1] 0
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
We will analyse medv
with respect to the predictor lstat
(percentage of lower status population).
head(cbind(Boston$medv, Boston$lstat))
## [,1] [,2]
## [1,] 24.0 4.98
## [2,] 21.6 9.14
## [3,] 34.7 4.03
## [4,] 33.4 2.94
## [5,] 36.2 5.33
## [6,] 28.7 5.21
For convenience we will just name the response y
and the predictor x
. We will also pre-define the labels for the x and y-axes that we will use repeatedly in figures throughout this practical.
= Boston$medv
y = Boston$lstat
x = 'Median Property Value'
y.lab = 'Lower Status (%)' x.lab
First, lets plot the two variables.
plot( x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "", bty = 'l' )
The plot suggests a non-linear relationship between medv
and lstat
.
Polynomial regression
Let us start by fitting to the data a degree-2 polynomial using the command lm()
and summarising the results using summary()
.
= lm( y ~ x + I(x^2) )
poly2 summary( poly2 )
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <0.0000000000000002 ***
## x -2.332821 0.123803 -18.84 <0.0000000000000002 ***
## I(x^2) 0.043547 0.003745 11.63 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 0.00000000000000022
Important note: Above we see that we need to enclose x^2
within the envelope function I()
. This is because x^2
is a function of x
and when we use a function (any function) of a predictor in lm()
we need to do that, otherwise we get wrong results! Besides that we see the usual output from summary()
.
The above syntax is not very convenient because we need to add manually further polynomial terms. For instance, for a 4-degree polynomial we would need to use lm(y ~ x + I(x^2)+ I(x^3)+ I(x^4))
. Fortunately, we have the function poly()
which makes things easier for us.
= lm(y ~ poly(x, 2, raw = TRUE))
poly2 summary(poly2)
##
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <0.0000000000000002 ***
## poly(x, 2, raw = TRUE)1 -2.332821 0.123803 -18.84 <0.0000000000000002 ***
## poly(x, 2, raw = TRUE)2 0.043547 0.003745 11.63 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 0.00000000000000022
The argument raw = TRUE
above is used in order to get the same coefficients as previously. If we do not use this argument the coefficients will differ because they will be calculated on a different (orthogonal) basis.
However, this may not be that important because with polynomial regression it is often the case that we are not interested in the regression coefficients as the model becomes more complex. In terms of fitting the curve poly(x, 2, raw = TRUE))
and poly(x, 2))
will give the same result!
Now, lets see how we can produce a plot similar to those shown in the Wage data example. A first important step is that we need to create an object, which we name sort.x
, which has the sorted values of predictor x
in a ascending order. Without sort.x
we will not be able to produce the plots! Then, we need to use predict()
with sort.x
as input in order to proceed to the next steps.
= sort(x)
sort.x 1:10] # the first 10 sorted values of x sort.x[
## [1] 1.73 1.92 1.98 2.47 2.87 2.88 2.94 2.96 2.97 2.98
= predict(poly2, newdata = list(x = sort.x), se = TRUE)
pred2 names(pred2)
## [1] "fit" "se.fit" "df" "residual.scale"
The object pred2
contains fit
, which are the fitted values, and se.fit
, which are the standard errors of the mean prediction, that we need in order to construct the approximate 95% confidence intervals (of the mean prediction).
With this information we can construct the confidence intervals using cbind()
. Lets see how the first 10 fitted values and confidence intervals look like.
$fit[1:10] # the first 10 fitted values of the curve pred2
## 1 2 3 4 5 6 7 8
## 38.95656 38.54352 38.41374 37.36561 36.52550 36.50468 36.37992 36.33840
## 9 10
## 36.31765 36.29691
= cbind( pred2$fit - 2 * pred2$se.fit,
se.bands2 $fit + 2 * pred2$se.fit )
pred21:10,] # the first 10 confidence intervals of the curve se.bands2[
## [,1] [,2]
## 1 37.58243 40.33069
## 2 37.20668 39.88036
## 3 37.08853 39.73895
## 4 36.13278 38.59845
## 5 35.36453 37.68647
## 6 35.34546 37.66390
## 7 35.23118 37.52865
## 8 35.19314 37.48365
## 9 35.17413 37.46117
## 10 35.15513 37.43870
Now that we have all that is needed we can finally produce the plot!
Final detail: Below we use lines()
for pred2$fit
because this is a vector, but for se.bands2
, which is a matrix, we have to use matlines()
.
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands2, lwd = 1.4, col = "red", lty = 3)
Having seen this procedure in detail once, plotting the fitted curves for polynomials of different degrees is straightforward. The code below produces a plot of degree-2 up to degree-5 polynomial fits.
= lm(y ~ poly(x, 3))
poly3 = lm(y ~ poly(x, 4))
poly4 = lm(y ~ poly(x, 5))
poly5
= predict(poly3, newdata = list(x = sort.x), se = TRUE)
pred3 = predict(poly4, newdata = list(x = sort.x), se = TRUE)
pred4 = predict(poly5, newdata = list(x = sort.x), se = TRUE)
pred5
= cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands3 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands4 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)
se.bands5
par(mfrow = c(2,2))
# Degree-2
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands2, lwd = 2, col = "red", lty = 3)
# Degree-3
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-3 polynomial", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort.x, se.bands3, lwd = 2, col = "darkviolet", lty = 3)
# Degree-4
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-4 polynomial", bty = 'l')
lines(sort.x, pred4$fit, lwd = 2, col = "blue")
matlines(sort.x, se.bands4, lwd = 2, col = "blue", lty = 3)
# Degree-5
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-5 polynomial", bty = 'l')
lines(sort.x, pred5$fit, lwd = 2, col = "black")
matlines(sort.x, se.bands5, lwd = 2, col = "black", lty = 3)
For this data it is not clear which curve fits better. All four curves look reasonable given the data that we have available. Without further indications, one may choose the degree-2 polynomial since it is simpler and seems to do about as well as the others.
However, if we want to base our decision on a more formal procedure, rather than on a subjective decision, one option is to use the classical statistical methodology of analysis-of-variance (ANOVA). Specifically, we will perform sequential comparisons based on the F-test, comparing first the linear model vs. the quadratic model (degree-2 polynomial), then the quadratic model vs. the cubic model (degree-3 polynomial) and so on. We therefore have to fit the simple linear model, and we also choose to fit the degree-6 polynomial to investigate the effects of an additional predictor as well. We can perform this analysis in RStudio using the command anova()
as displayed below.
= lm(y ~ x)
poly1 = lm(y ~ poly(x, 6))
poly6 anova(poly1, poly2, poly3, poly4, poly5, poly6)
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ poly(x, 2, raw = TRUE)
## Model 3: y ~ poly(x, 3)
## Model 4: y ~ poly(x, 4)
## Model 5: y ~ poly(x, 5)
## Model 6: y ~ poly(x, 6)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 151.8623 < 0.00000000000000022 ***
## 3 502 14616 1 731.8 26.9390 0.0000003061 ***
## 4 501 13968 1 647.8 23.8477 0.0000014062 ***
## 5 500 13597 1 370.7 13.6453 0.0002452 ***
## 6 499 13555 1 42.4 1.5596 0.2123125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis that is checked at each step is that the decrease in RSS is not significant. The hypothesis is rejected if the p-value (column Pr(>F)
) is smaller than a given significance level (say 0.05). If the hypothesis is rejected then we move on to the next comparison. Based on this reasoning and the results above we would select the 5-degree polynomial.
Note that alternatively, we can always use (yes, you guessed it…) cross-validation to decide the degree of the polynomial model!
We could also try polynomial regression with multiple predictors. For example, include average number of rooms per dwelling rm
as the second predictor. Note that in this case we will also including the interaction terms \(x_1x_2\).
=Boston$lstat
x1=Boston$rm
x2
<- lm(y ~ poly(x1, 2) + poly(x2 , 2) + x1:x2)
polym1 summary(polym1)
##
## Call:
## lm(formula = y ~ poly(x1, 2) + poly(x2, 2) + x1:x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9520 -2.5690 -0.4172 2.0432 27.6293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.20765 4.91368 8.183 0.000000000000002315 ***
## poly(x1, 2)1 106.09456 61.94682 1.713 0.08739 .
## poly(x1, 2)2 13.47448 7.30579 1.844 0.06572 .
## poly(x2, 2)1 108.16782 12.75828 8.478 0.000000000000000259 ***
## poly(x2, 2)2 36.83909 6.27520 5.871 0.000000007919304425 ***
## x1:x2 -0.23121 0.06422 -3.600 0.00035 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.554 on 500 degrees of freedom
## Multiple R-squared: 0.7573, Adjusted R-squared: 0.7548
## F-statistic: 312 on 5 and 500 DF, p-value: < 0.00000000000000022
# we could use polym() makes things easier for us
=lm(y ~ polym(x1, x2, degree=2) )
polym2summary(polym2)
##
## Call:
## lm(formula = y ~ polym(x1, x2, degree = 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9520 -2.5690 -0.4172 2.0432 27.6293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.8222 0.2828 77.178 < 0.0000000000000002
## polym(x1, x2, degree = 2)1.0 -127.0837 6.6032 -19.246 < 0.0000000000000002
## polym(x1, x2, degree = 2)2.0 13.4745 7.3058 1.844 0.06572
## polym(x1, x2, degree = 2)0.1 61.9766 6.0811 10.192 < 0.0000000000000002
## polym(x1, x2, degree = 2)1.1 -585.8312 162.7253 -3.600 0.00035
## polym(x1, x2, degree = 2)0.2 36.8391 6.2752 5.871 0.00000000792
##
## (Intercept) ***
## polym(x1, x2, degree = 2)1.0 ***
## polym(x1, x2, degree = 2)2.0 .
## polym(x1, x2, degree = 2)0.1 ***
## polym(x1, x2, degree = 2)1.1 ***
## polym(x1, x2, degree = 2)0.2 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.554 on 500 degrees of freedom
## Multiple R-squared: 0.7573, Adjusted R-squared: 0.7548
## F-statistic: 312 on 5 and 500 DF, p-value: < 0.00000000000000022