9.1 Splines
A regression spline fits a piecewise polynomial to the range of X partitioned by knots (K knots produce K + 1 piecewise polynomials) James et al (James et al. 2013). The polynomials can be of any degree d, but are usually in the range [0, 3], most commonly 3 (a cubic spline). To avoid discontinuities in the fit, a degree-d spline is constrained to have continuity in derivatives up to degree d−1 at each knot.
A cubic spline fit to a data set with K knots, performs least squares regression with an intercept and 3 + K predictors, of the form
\[y_i = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \beta_4h(X, \xi_1) + \beta_5h(X, \xi_2) + \dots + \beta_{K+3}h(X, \xi_K)\]
where \(\xi_1, \dots, \xi_K\) are the knots are truncated power basis functions \(h(X, \xi) = (X - \xi)^3\) if \(X > \xi\), else 0.
Splines can have high variance at the outer range of the predictors. A natural spline is a regression spline additionally constrained to be linear at the boundaries.
How many knots should there be, and Where should the knots be placed? It is common to place knots in a uniform fashion, with equal numbers of points between each knot. The number of knots is typically chosen by trial and error using cross-validation to minimize the RSS. The number of knots is usually expressed in terms of degrees of freedom. A cubic spline will have K + 3 + 1 degrees of freedom. A natural spline has K + 3 + 1 - 5 degrees of freedom due to the constraints at the endpoints.
A further constraint can be added to reduce overfitting by enforcing smoothness in the spline. Instead of minimizing the loss function \(\sum{(y - g(x))^2}\) where \(g(x)\) is a natural spline, minimize a loss function with an additional penalty for variability:
\[L = \sum{(y_i - g(x_i))^2 + \lambda \int g''(t)^2dt}.\]
The function \(g(x)\) that minimizes the loss function is a natural cubic spline with knots at each \(x_1, \dots, x_n\). This is called a smoothing spline. The larger g is, the greater the penalty on variation in the spline. In a smoothing spline, you do not optimize the number or location of the knots – there is a knot at each training observation. Instead, you optimize \(\lambda\). One way to optimze \(\lambda\) is cross-validation to minimize RSS. Leave-one-out cross-validation (LOOCV) can be computed efficiently for smoothing splines.
References
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. New York, NY: Springer. http://faculty.marshall.usc.edu/gareth-james/ISL/book.html.