7 Moving Beyond Linearity
Load the usual packages:
library(tidyverse)
library(tidymodels)
library(broom)
library(gt)
library(patchwork)
library(tictoc)
# Load my R package and set the ggplot theme
library(dunnr)
::loadfonts(device = "win", quiet = TRUE)
extrafonttheme_set(theme_td())
set_geom_fonts()
set_palette()
In the previous chapter, we saw how to improve upon standard least squares linear regression using ridge regression, the lasso, PCA, and other techniques. In that setting, the complexity of the linear model is reduced to reduce the variance of the estimates. In this chapter, we relax the linearity assumption while still trying to maintain some interpretability, with these methods:
- Polynomial regression extends the linear model by adding extra predictors by raising original predictors to a power.
- Step functions cut the range of a variable into \(K\) distinct regions to produce a qualitative variable.
- Regression splines are a flexible combination of polynomials and step functions that involve polynomial functions fit to data in \(K\) distinct regions.
- Smoothing splines are similar to regression splines but involve a smoothness penalty.
- Local regression is similar to splines but allows smooth overlaps across regions.
- Generalized additive models allows us to extend the above methods to deal with multiple predictors.
7.1 Polynomial Regression
Polynomial regression involves raising one or more predictors to a degree \(d\), each with its own coefficient:
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i. \]
For a large enough degree \(d\), polynomial regression can produce an extremely non-linear curve, though it is unusual to use \(d\) greater than 3 or 4 because the curve can become overly flexible and produce some strange shapes.
Re-create Figure 7.1 with the wage
data:
<- ISLR2::Wage
wage
<- lm(wage ~ poly(age, 4), data = wage)
wage_poly_4_linear_fit <- glm(I(wage > 250) ~ poly(age, 4),
wage_poly_4_logistic_fit data = wage, family = binomial)
# Grid of age values for predictions
<- seq(18, 80, length.out = 63)
age_grid
<- wage_poly_4_linear_fit %>%
p1 augment(newdata = tibble(age = age_grid), interval = "confidence") %>%
ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "darkblue", size = 1.5) +
geom_line(aes(y = .lower),
lty = 2, color = "darkblue", size = 1) +
geom_line(aes(y = .upper),
lty = 2, color = "darkblue", size = 1) +
labs(y = "wage")
<- wage_poly_4_logistic_fit %>%
p2 augment(newdata = tibble(age = age_grid),
type.predict = "link", se_fit = TRUE) %>%
mutate(
# Have to compute CIs manually for logistic regression
.lower = .fitted - 1.96 * .se.fit,
.upper = .fitted + 1.96 * .se.fit,
# Convert from log-odds to probability scales
across(c(.fitted, .lower, .upper), ~ exp(.x) / (1 + exp(.x)))
%>%
) ggplot(aes(x = age)) +
geom_rug(data = wage %>% filter(wage > 250),
sides = "t", color = "grey50", alpha = 0.5) +
geom_rug(data = wage %>% filter(wage <= 250),
sides = "b", color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "darkblue", size = 1.5) +
geom_line(aes(y = .lower), lty = 2, color = "darkblue", size = 1) +
geom_line(aes(y = .upper), lty = 2, color = "darkblue", size = 1) +
coord_cartesian(ylim = c(0, 0.2)) +
labs(y = "Pr(wage > 250 | age)")
+ p2 p1
7.2 Step Functions
Using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of \(X\). We can instead use step functions in order to avoid imposing such a global structure. Here we break the range of \(X\) into bins, and fit a different constant in each bin. This amounts to converting a continuous variable into an ordered categorical variable.
This involves using a set of \(K\) cutpoints \(c_k\) which corresponds to dummy variables \(C_k(X)\):
\[ \begin{align} C_0(X) &= I(X < c_1), \\ C_1(X) &= I(c_1 \leq X < c_2), \\ C_2(X) &= I(c_2 \leq X < c_3), \\ &\vdots \\ C_{K-1}(X) &= I(c_{K-1} \leq X < c_K), \\ C_K(X) &= I(c_K \leq X), \end{align} \]
where \(I()\) is an indicator function that returns a 1 or 0 if the condition is true or false. The least squares linear model is then:
\[ y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2 (x_i) + \dots + \beta_K C_K (x_i) + \epsilon_i. \]
To re-create Figure 7.2, I’ll use cut
with 4 breaks to separate the age
predictor:
<- lm(wage ~ cut(age, breaks = 4), data = wage)
wage_step_linear_fit <- glm(I(wage > 250) ~ cut(age, breaks = 4),
wage_step_logistic_fit data = wage, family = binomial)
<- wage_step_linear_fit %>%
p1 augment(newdata = tibble(age = age_grid),
interval = "confidence", level = 0.50) %>%
ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "darkgreen", size = 1.5) +
geom_line(aes(y = .lower),
lty = 2, color = "darkgreen", size = 1) +
geom_line(aes(y = .upper),
lty = 2, color = "darkgreen", size = 1) +
labs(y = "wage")
<- wage_step_logistic_fit %>%
p2 augment(newdata = tibble(age = age_grid),
type.predict = "link", se_fit = TRUE) %>%
mutate(
.lower = .fitted - 1.96 * .se.fit,
.upper = .fitted + 1.96 * .se.fit,
across(c(.fitted, .lower, .upper), ~ exp(.x) / (1 + exp(.x)))
%>%
) ggplot(aes(x = age)) +
geom_rug(data = wage %>% filter(wage > 250),
sides = "t", color = "grey50", alpha = 0.5) +
geom_rug(data = wage %>% filter(wage <= 250),
sides = "b", color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "darkgreen", size = 1.5) +
geom_line(aes(y = .lower), lty = 2, color = "darkgreen", size = 1) +
geom_line(aes(y = .upper), lty = 2, color = "darkgreen", size = 1) +
coord_cartesian(ylim = c(0, 0.2)) +
labs(y = "Pr(wage > 250 | age)")
+ p2 p1
Unfortunately, unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the action. For example, in the left- hand panel of Figure 7.2, the first bin clearly misses the increasing trend of
wage
withage
. Nevertheless, step function approaches are very popular in biostatistics and epidemiology, among other disciplines. For example, 5-year age groups are often used to define the bins.
7.3 Basis Functions
Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach. The idea is to have at hand a fam ily of functions or transformations that can be applied to a variable \(X\): \(b_1(X), b_2(X), \dots, b_K(X)\). Instead of fitting a linear model in \(X\), we fit the model
\[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2 (x_i) + \dots + \beta_K b_K (x_i) + \epsilon_i. \]
In polynomial regression, these basis functions were \(b_j(x_i) = x_i^j\). In piecewise-constant regression, they were \(b_j(x_i) = I(c_j \leq x_i < c_{j+1})\). Despite the increased complexity, this still amounts to estimating the unknown regression coefficients \(\beta\), for which all the least squares tools and models apply.
7.4 Regression Splines
7.4.1 Piecewise Polynomials
Instead of fitting a high-degree polynomial over the entire range of \(X\), piecewise polynomial regression involves fitting separate low-degree polynomials over different regions of \(X\). For example, a piecewise cubic polynomial works by fitting a cubic regression model of the form
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i. \]
where the coefficients \(\beta_0, \beta_1, \beta_2,\) and \(\beta_3\) differ in parts of the range of \(X\). The points where the coefficients change are called knots.
With 0 knots, we have the standard cubic polynomial with \(d = 3\) as described in section 7.1. With a single knot at point \(c\), this takes the form:
\[ \begin{align} y_i &= \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i \ \ \ \text{if} \ \ x_i < c \\ &= \beta_{02} + \beta_{22} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i \ \ \ \text{if} \ \ x_i \geq c. \end{align} \]
Functionally, this is essentially fitting two separate regression equations on subsets of \(X\), with 8 degrees of freedom for the eight regression coefficients.
7.4.2 Constraints and Splines
The problem with piecewise polynomials is that the resulting fit can be discontinuous, like in the top left panel of Figure 7.3.
To remedy this, we can fit a piecewise polynomial under the constraint that the fitted curve must be continuous, like in the top right panel.
The bottom left panel shows the result of two additional constraints: that the first and second derivative are continuous at age = 50
– this is called a cubic spline, which generally has \(K + 4\) degrees of freedom (=5 in this example).
The lower right panel shows a linear spline.
7.4.3 The Spline Basis Representation
In order to implement the continuity constraints for regression splines, we can use the basis model. A cubic spline with \(K\) knots can be modeled as
\[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2 (x_i) + \dots + \beta_{K+3} b_{K+3} (x_i) + \epsilon_i, \]
for an appropriate choice of basis functions \(b_1, b_2, \dots, b_{K+3}\). The model can then be fit with least squares.
There are many equivalent representations of cubic splines using different basis functions. The most direct is to start with the cubic polynomial and then add one truncated power basis function per knot:
\[ \begin{align} h(x, \xi) = (x - \xi)^2_+ &= (x - \xi)^3 \ \ \ \text{if} \ \ \ x > \xi \\ &= 0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{otherwise}, \end{align} \]
where \(\xi\) is the knot. One can show that adding the term \(\beta_4 h(x, \xi)\) to the cubic spline model above will lead to a discontinuity in only the third derivative at \(\xi\), but remain continuous in the first and second derivatives.
Unfortunately, these splines have high variance at boundaries of the predictors. This can be reduced with natural splines which have additional constraints at the boundaries to produce more stable estimates. We can show this with confidence intervals of the models fit with cubic and natural cubic splines, as in Figure 7.4:
library(splines)
# Use just a subset of the data to mimic the figure
set.seed(20)
<- wage %>%
d filter(wage < 300) %>%
slice_sample(n = 500)
<- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = d)
wage_bs_linear_fit <- lm(wage ~ ns(age, knots = c(25, 40, 60)), data = d)
wage_ns_linear_fit
bind_rows(
%>%
wage_bs_linear_fit augment(newdata = tibble(age = age_grid), interval = "confidence") %>%
mutate(model = "Cubic spline"),
%>%
wage_ns_linear_fit augment(newdata = tibble(age = age_grid), interval = "confidence") %>%
mutate(model = "Natural cubic spline")
%>%
) mutate(model = fct_rev(model)) %>%
ggplot(aes(x = age)) +
geom_point(data = d, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted, color = model), size = 1.5) +
geom_line(aes(y = .lower, color = model), lty = 2, size = 1) +
geom_line(aes(y = .upper, color = model), lty = 2, size = 1) +
geom_vline(xintercept = c(25, 40, 60), lty = 2) +
coord_cartesian(ylim = c(40, 300)) +
theme(legend.position = c(0.7, 0.9)) +
labs(color = NULL)
7.4.4 Choosing the Number and Locations of the Knots
When we fit a spline, where should we place the knots? The regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable. While this option can work well, in practice it is common to place knots in a uniform fashion. One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.
The splines::ns()
function, when provided the df
argument, computes knot locations based on percentiles.
For the age
data and df = 4
(3 knots):
<- ns(wage$age, df = 4)
age_ns attr(age_ns, "knots")
## 25% 50% 75%
## 33.75 42.00 51.00
Fit the model and re-create Figure 7.5:
<- lm(wage ~ ns(age, df = 4), data = wage)
wage_ns_linear_fit <- glm(I(wage > 250) ~ ns(age, df = 4),
wage_ns_logistic_fit data = wage, family = binomial)
<- wage_ns_linear_fit %>%
p1 augment(newdata = tibble(age = age_grid), interval = "confidence") %>%
ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "red", size = 1.5) +
geom_line(aes(y = .lower), color = "red", lty = 2, size = 1) +
geom_line(aes(y = .upper), color = "red", lty = 2, size = 1) +
geom_vline(xintercept = attr(age_ns, "knots"), lty = 2)
<- wage_ns_logistic_fit %>%
p2 augment(newdata = tibble(age = age_grid),
type.predict = "link", se_fit = TRUE) %>%
mutate(
.lower = .fitted - 1.96 * .se.fit,
.upper = .fitted + 1.96 * .se.fit,
across(c(.fitted, .lower, .upper), ~ exp(.x) / (1 + exp(.x)))
%>%
) ggplot(aes(x = age)) +
geom_rug(data = wage %>% filter(wage > 250),
sides = "t", color = "grey50", alpha = 0.5) +
geom_rug(data = wage %>% filter(wage <= 250),
sides = "b", color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted), color = "red", size = 1.5) +
geom_line(aes(y = .lower), lty = 2, color = "red", size = 1) +
geom_line(aes(y = .upper), lty = 2, color = "red", size = 1) +
geom_vline(xintercept = attr(age_ns, "knots"), lty = 2) +
coord_cartesian(ylim = c(0, 0.2)) +
labs(y = "Pr(wage > 250 | age)")
+ p2 p1
The automatic choice of knot location is usually sufficient, but how do we decide on the number of knots \(K\)?
The same way we usually choose model hyperparameters: resampling.
For this, we turn to tidymodels
, which has step_bs()
and step_ns()
functions for specifying basis splines and natural basis splines.
set.seed(93)
<- vfold_cv(wage, v = 10)
wage_resamples
<- recipe(wage ~ age, data = wage) %>%
wage_lm_bs_rec step_bs(age, deg_free = tune(), degree = tune())
<- recipe(wage ~ age, data = wage) %>%
wage_lm_ns_rec step_ns(age, deg_free = tune())
<- linear_reg()
lm_spec
<- workflow() %>%
wage_lm_bs_workflow add_model(lm_spec) %>%
add_recipe(wage_lm_bs_rec)
<- workflow() %>%
wage_lm_ns_workflow add_model(lm_spec) %>%
add_recipe(wage_lm_ns_rec)
If I understand Figure 7.6 correctly, the results being displayed in the right panel aren’t all cubic splines because those models can’t have <3 degrees of freedom.
For example, if I set degrees of freedom to df = 2
and the polynomial degree of freedom to degree = 3
, then bs()
will calculate a negative number of knots.
It will automatically adjust df
so the number of knots equals 0 and return a warning:
lm(wage ~ bs(age, df = 2, degree = 3), data = wage)
## Warning in bs(age, df = 2, degree = 3): 'df' was too small; have used 3
##
## Call:
## lm(formula = wage ~ bs(age, df = 2, degree = 3), data = wage)
##
## Coefficients:
## (Intercept) bs(age, df = 2, degree = 3)1
## 58.69 102.64
## bs(age, df = 2, degree = 3)2 bs(age, df = 2, degree = 3)3
## 48.76 40.80
So in choosing the hyperparameters for these models, I will vary both deg_free
and degree
so that it fits a linear spline, quadratic spline, and 7 cubic splines (with varying numbers of knots) as follows:
<- bind_rows(
bs_df_grid tibble(deg_free = 1:3, degree = 1:3),
tibble(deg_free = 4:10, degree = 3)
) bs_df_grid
## # A tibble: 10 × 2
## deg_free degree
## <int> <dbl>
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 3
## 5 5 3
## 6 6 3
## 7 7 3
## 8 8 3
## 9 9 3
## 10 10 3
<- tune_grid(
wage_lm_bs_tune resamples = wage_resamples,
wage_lm_bs_workflow, grid = bs_df_grid
)
The way degrees of freedom work in natural splines doesn’t require this adjustment, I can just provide deg_free = 1:10
:
<- tune_grid(
wage_lm_ns_tune resamples = wage_resamples,
wage_lm_ns_workflow, grid = bs_df_grid %>% select(deg_free)
)
Finally, re-create Figure 7.6:
<- collect_metrics(wage_lm_ns_tune) %>%
p1 filter(.metric == "rmse") %>%
mutate(mse = mean^2) %>%
ggplot(aes(x = deg_free, y = mse)) +
geom_line(color = "red") +
geom_point(fill = "red", color = "white", shape = 21, size = 3) +
scale_x_continuous("Degrees of freedom of natural spline",
breaks = seq(2, 10, 2)) +
labs(y = "Mean squared error")
<- collect_metrics(wage_lm_bs_tune) %>%
p2 filter(.metric == "rmse") %>%
mutate(mse = mean^2) %>%
ggplot(aes(x = deg_free, y = mse)) +
geom_line(color = "blue") +
geom_point(fill = "blue", color = "white", shape = 21, size = 3) +
scale_x_continuous("Degrees of freedom of cubic spline",
breaks = seq(2, 10, 2)) +
labs(y = NULL)
+ p2 p1
7.4.5 Comparison to Polynomial Regression
Figure 7.7 compares a natural cubic spline with 15 degrees of freedom to a 15-degree polynomial regression on the wage
data:
<- lm(wage ~ ns(age, df = 15), data = wage)
wage_ns_15_fit <- lm(wage ~ poly(age, degree = 15), data = wage)
wage_poly_15_fit
bind_rows(
%>%
wage_ns_15_fit augment(newdata = tibble(age = age_grid)) %>%
mutate(model = "Natural cubic spline"),
%>%
wage_poly_15_fit # Extend the age range a bit to see more of the fit
augment(newdata = tibble(age = c(age_grid, 80.5))) %>%
mutate(model = "Polynomial")
%>%
) ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted, color = model), size = 1.5) +
theme(legend.position = c(0.8, 0.8)) +
labs(color = NULL)
The extra flexibility in the polynomial produces undesirable results at the boundaries, while the natural cubic spline still provides a reasonable fit to the data. Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree (exponent in the highest monomial term, e.g. \(X^{15}\)) to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed. Generally, this approach produces more stable estimates. Splines also allow us to place more knots, and hence flexibility, over regions where the function \(f\) seems to be changing rapidly, and fewer knots where \(f\) appears more stable.
7.5 Smoothing Splines
7.5.1 An Overview of Smoothing Splines
To fit a curve \(g(x)\) to a set of data, we want \(\text{RSS} = \sum_{i=1}^n (y_i - g(x_i))^2\) to be small. If there are no constraints on \(g(x)\), however, then it will simply overfit the data completely for \(\text{RSS} = 0\).
There are a number of ways to ensure that \(g\) is smooth. One way is to find \(g\) that minimizes
\[ \text{RSS} + \lambda \int g'' (t)^2 dt \]
where \(\lambda\) is a non-negative penalty parameter (that will control the bias-variance trade-off as we’ve seen before). The function \(g\) that minimizes the above is known as a smoothing spline. By placing the constraint on the second derivative \(g''\), we encourage \(g\) to be smooth.
It turns out that the optimal \(g(x)\) is a natural cubic polynomial with knots at unique values of \(x\) (and continuous first and second derivatives at each knot). However, it is not the same one we would get with the basis function approach in Section 7.4.3 with knots at each value of \(x\) – rather, it is a shrunken version of that spline, where \(\lambda\) controls the level of shrinkage
7.5.2 Choosing the Smoothing Parameter \(\lambda\)
It may seem like a smoothing spline, with knots at every unique value of \(x_i\), will have far too many degrees of freedom, but the tuning parameter \(\lambda\) controls the effective degrees of freedom \(df_{\lambda}\). The higher \(df_{\lambda}\), the more flexible (lower bias, higher variance) the smoothing spline.
Unlike the other spline methods, we do not need to select the number or location of the knots. We instead have to choose the value of \(\lambda\), usually via cross-validation. It turns out that the LOOCV error can be computed very efficiently for smoothing splines, with essentially the same cost as computing a single fit, with this formula:
\[ \text{RSS}_{cv} (\lambda) = \sum_{i=1}^n (y_i - \hat{g}_{\lambda}^{(-i)} (x_i))^2 = \sum_{i=1}^n \left[\frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{\textbf{S}_{\lambda}\}_{ii}}\right]. \]
where \(\hat{g}_{\lambda}^{(-i)}\) is the fitted value for this smoothing spline evaluated at \(x_i\), with the fit using all training observations except for \((x_i, y_i)\).
In contrast, \(\hat{g}_{\lambda}(x_i)\) indicates the smoothing spline function fit to all of the training observations and evaluated at \(x_i\). This remarkable formula says that we can compute each of these leave- one-out fits using only \(\hat{g}_{\lambda}\), the original fit to all of the data! We have a very similar formula (5.2) on page 202 in Chapter 5 for least squares linear regression. Using (5.2), we can very quickly perform LOOCV for the regression splines discussed earlier in this chapter, as well as for least squares regression using arbitrary basis functions.
The smooth.spline()
function fits a smoothing spline to data as follows:
<- smooth.spline(x = wage$age, y = wage$wage, df = 16)
wage_smooth_df_16 wage_smooth_df_16
## Call:
## smooth.spline(x = wage$age, y = wage$wage, df = 16)
##
## Smoothing Parameter spar= 0.4732071 lambda= 0.0006537868 (13 iterations)
## Equivalent Degrees of Freedom (Df): 16.00237
## Penalized Criterion (RSS): 61597.01
## GCV: 1599.69
<- smooth.spline(x = wage$age, y = wage$wage, cv = TRUE)
wage_smooth_cv wage_smooth_cv
## Call:
## smooth.spline(x = wage$age, y = wage$wage, cv = TRUE)
##
## Smoothing Parameter spar= 0.6988943 lambda= 0.02792303 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.794596
## Penalized Criterion (RSS): 75215.9
## PRESS(l.o.o. CV): 1593.383
bind_rows(
as_tibble(predict(wage_smooth_df_16, age_grid)) %>%
mutate(model = "16 degrees of freedom"),
as_tibble(predict(wage_smooth_cv, age_grid)) %>%
mutate(model = "6.8 degrees of freedom (LOOCV)")
%>%
) ggplot() +
geom_point(data = wage, aes(x = age, y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(x = x, y = y, color = model), size = 1.5) +
theme(legend.position = "top") +
labs(color = NULL)
The red curve indicates the fit obtained from pre-specifying that we would like a smoothing spline with 16 effective degrees of freedom. The blue curve is the smoothing spline obtained when \(\lambda\) is chosen using LOOCV; in this case, the value of \(\lambda\) chosen results in 6.8 effective degrees of freedom (computed using (7.13)). For this data, there is little discernible difference between the two smoothing splines, beyond the fact that the one with 16 degrees of freedom seems slightly wigglier. Since there is little difference between the two fits, the smoothing spline fit with 6.8 degrees of freedom is preferable, since in general simpler models are better unless the data provides evidence in support of a more complex model.
7.6 Local Regression
Local regression involves computing the fit at a target point \(x_0\) using just the nearby (local) training observations. It involves a number of choices, such the weighting function \(K\), and whether to fit a linear, constant or quadratic regression. The most important choice is the span \(s\) which is the proportion of points used to compute the local regression at \(x_0\). It is analogous to \(\lambda\) in smoothing splines: the smaller the value \(s\), the more local and flexible the fit; the larger the value, the less flexible.
.2 <- loess(wage ~ age, data = wage, span = 0.2)
wage_local_s_0.7 <- loess(wage ~ age, data = wage, span = 0.7)
wage_local_s_0
.2; wage_local_s_0.7 wage_local_s_0
## Call:
## loess(formula = wage ~ age, data = wage, span = 0.2)
##
## Number of Observations: 3000
## Equivalent Number of Parameters: 16.42
## Residual Standard Error: 39.92
## Call:
## loess(formula = wage ~ age, data = wage, span = 0.7)
##
## Number of Observations: 3000
## Equivalent Number of Parameters: 5.3
## Residual Standard Error: 39.9
bind_rows(
augment(wage_local_s_0.2, newdata = tibble(age = age_grid)) %>%
mutate(model = "Span is 0.2 (16.4 degres of freedom)"),
augment(wage_local_s_0.7, newdata = tibble(age = age_grid)) %>%
mutate(model = "Span is 0.7 (5.3 degres of freedom)")
%>%
) ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .fitted, color = model), size = 1.5) +
theme(legend.position = "top") +
labs(color = NULL)
7.7 Generalized Additive Models
In previous sections, we have presented approaches for flexibly predicting a response \(Y\) with a single predictor \(X\). Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.
7.7.1 GAMs for Regression Problems
To extend the multiple linear regression model, we replace the coefficient components \(\beta_j x_{ij}\) with smooth non-linear functions \(f_j (x_{ij})\):
\[ y_i = \beta_0 + f_1 (x_{i1}) + f_2 (x_{i2}) + \dots + f_p (x_{ip}) + \epsilon_i. \]
Consider the task of fitting the model:
\[ \texttt{wage} = \beta_0 + f_1(\texttt{year}) + f_2 (\texttt{age}) + f_3 (\texttt{education}) + \epsilon. \]
In the example in Figure 7.11, the quantitative variables year
and age
are fit using natural splines, and the qualitative variable (with 5 levels) education
is fit via the usual dummy variable approach.
<- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data = wage)
wage_gam_ns_fit # Though the model wasn't fit with `gam::gam()`, we can still use the
# `gam::plot.Gam()` function on the `lm` object to retrieve the smooth functions
# Grab the data from the plot object so I can use it in `ggplot2`
<- invisible(gam::plot.Gam(wage_gam_ns_fit)) d
<- map(
d $preplot,
d~ tibble(x = .$x, y = .$y, y_se = .$se.y, xlab = .$xlab, ylab = .$ylab) %>%
mutate(y_lower = y - y_se, y_upper = y + y_se)
)
<- d$`ns(year, 4)` %>%
p1 ggplot(aes(x)) +
geom_line(aes(y = y), color = "red") +
geom_line(aes(y = y_lower), lty = 2, color = "red") +
geom_line(aes(y = y_upper), lty = 2, color = "red") +
labs(x = NULL, title = d$`ns(year, 4)`$ylab, y = NULL) +
ylim(c(-20, 20))
<- d$`ns(age, 5)` %>%
p2 ggplot(aes(x)) +
geom_line(aes(y = y), color = "red") +
geom_line(aes(y = y_lower), lty = 2, color = "red") +
geom_line(aes(y = y_upper), lty = 2, color = "red") +
labs(x = NULL, title = d$`ns(age, 5)`$ylab, y = NULL)
<- d$education %>%
p3 ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper)) +
labs(x = NULL, title = d$education$ylab, y = NULL) +
theme(axis.text.x = element_text(angle = 45, size = 8, vjust = 0.7))
+ p2 + p3 p1
For Figure 7.12, we use the gam
package and specify smoothing splines:
library(gam)
## Loaded gam 1.20.2
<-
wage_gam_smooth_fit gam(wage ~ s(year, 4) + s(age, 5) + education,
data = wage)
<- plot(wage_gam_smooth_fit, se = TRUE, col = "blue") d
<- map(
d $preplot,
d~ tibble(x = .$x, y = .$y, y_se = .$se.y, xlab = .$xlab, ylab = .$ylab) %>%
mutate(y_lower = y - y_se, y_upper = y + y_se)
)
<- d$`s(year, 4)` %>%
p1 ggplot(aes(x)) +
geom_line(aes(y = y), color = "blue") +
geom_line(aes(y = y_lower), lty = 2, color = "blue") +
geom_line(aes(y = y_upper), lty = 2, color = "blue") +
labs(x = NULL, title = d$`s(year, 4)`$ylab, y = NULL) +
ylim(c(-20, 20))
<- d$`s(age, 5)` %>%
p2 ggplot(aes(x)) +
geom_line(aes(y = y), color = "blue") +
geom_line(aes(y = y_lower), lty = 2, color = "blue") +
geom_line(aes(y = y_upper), lty = 2, color = "blue") +
labs(x = NULL, title = d$`s(age, 5)`$ylab, y = NULL)
<- d$education %>%
p3 ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper)) +
labs(x = NULL, title = d$education$ylab, y = NULL) +
theme(axis.text.x = element_text(angle = 45, size = 8, vjust = 0.7))
+ p2 + p3 p1
The fitted functions are essentially equivalent between the natural and smoothing splines.
Fitting a GAM with a smoothing spline is not quite as simple as fitting a GAM with a natural spline, since in the case of smoothing splines, least squares cannot be used. However, standard software such as the
gam()
function in R can be used to fit GAMs using smoothing splines, via an approach known as backfitting. This method fits a model involving multiple predictors by backfitting repeatedly updating the fit for each predictor in turn, holding the others fixed. The beauty of this approach is that each time we update a function, we simply apply the fitting method for that variable to a partial residual.
We do not have to use splines as the building blocks for GAMs: we can just as well use local regression, polynomial regression, or any combination of the approaches seen earlier in this chapter in order to create a GAM.
Pros and Cons of GAMs
Pros:
- GAMs allow \(f_j\) fits to each \(X_j\), which is a way to flexibly and (nearly) automatically capture non-linearity.
- Non-linear fits can improve prediction accuracy.
- Because the models are additive, we can examine the effect of each \(X_j\) on \(Y\) separately, while holding others fixed.
- Smoothness of each \(f_j\) can be summarized via degrees of freedom.
Cons:
- The main limitation is the additive restriction. We cannot add interaction terms like \(X_j \times X_k\) to GAMs.
GAMs provide as useful compromise between linear and fully non-parametric models.
7.7.2 GAMs for Classification Problems
GAMs are easily extended to other response distributions via the family
argument, such as binomial
for logistic regression classification.
<- gam(
wage_gam_fit_binom I(wage > 250) ~ year + s(age, 5) + education,
family = binomial, data = wage
)<- plot(wage_gam_fit_binom) d
<- map(
d $preplot,
d~ tibble(x = .$x, y = .$y, y_se = .$se.y, xlab = .$xlab, ylab = .$ylab) %>%
mutate(y_lower = y - y_se, y_upper = y + y_se)
)
<- d$year %>%
p1 ggplot(aes(x)) +
geom_line(aes(y = y), color = "green") +
geom_line(aes(y = y_lower), lty = 2, color = "green") +
geom_line(aes(y = y_upper), lty = 2, color = "green") +
labs(x = NULL, title = d$`s(year, 4)`$ylab, y = NULL) +
ylim(c(-4, 4))
<- d$`s(age, 5)` %>%
p2 ggplot(aes(x)) +
geom_line(aes(y = y), color = "green") +
geom_line(aes(y = y_lower), lty = 2, color = "green") +
geom_line(aes(y = y_upper), lty = 2, color = "green") +
labs(x = NULL, title = d$`s(age, 5)`$ylab, y = NULL)
<- d$education %>%
p3 ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper)) +
labs(x = NULL, title = d$education$ylab, y = NULL) +
theme(axis.text.x = element_text(angle = 45, size = 8, vjust = 0.7))
+ p2 + p3 p1
There are 0 occurrences of wage > 250
for education = "1. < HS Grad"
, which is causing very wide SE range.
Re-fit the model excluding that value of education
:
<- gam(
wage_gam_fit_binom I(wage > 250) ~ year + s(age, 5) + education,
family = binomial, data = filter(wage, education != "1. < HS Grad")
)<- plot(wage_gam_fit_binom) d
<- map(
d $preplot,
d~ tibble(x = .$x, y = .$y, y_se = .$se.y, xlab = .$xlab, ylab = .$ylab) %>%
mutate(y_lower = y - y_se, y_upper = y + y_se)
)
<- d$year %>%
p1 ggplot(aes(x)) +
geom_line(aes(y = y), color = "green") +
geom_line(aes(y = y_lower), lty = 2, color = "green") +
geom_line(aes(y = y_upper), lty = 2, color = "green") +
labs(x = NULL, title = d$year$ylab, y = NULL) +
ylim(c(-4, 4))
<- d$`s(age, 5)` %>%
p2 ggplot(aes(x)) +
geom_line(aes(y = y), color = "green") +
geom_line(aes(y = y_lower), lty = 2, color = "green") +
geom_line(aes(y = y_upper), lty = 2, color = "green") +
labs(x = NULL, title = d$`s(age, 5)`$ylab, y = NULL)
<- d$education %>%
p3 ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper)) +
labs(x = NULL, title = d$education$ylab, y = NULL) +
theme(axis.text.x = element_text(angle = 45, size = 8, vjust = 0.7))
+ p2 + p3 p1
7.8 Lab: Non-linear Modeling
7.8.1 Polynomial Regression and Step Functions
The figure was re-produced in section 7.1.
To determine the simplest sufficient polynomial model, we can perform \(F\)-tests with anova()
.
<- lm(wage ~ age, data = wage)
wage_poly_1_linear_fit <- lm(wage ~ poly(age, 2), data = wage)
wage_poly_2_linear_fit <- lm(wage ~ poly(age, 3), data = wage)
wage_poly_3_linear_fit <- lm(wage ~ poly(age, 5), data = wage)
wage_poly_5_linear_fit
anova(wage_poly_1_linear_fit, wage_poly_2_linear_fit, wage_poly_3_linear_fit,
wage_poly_4_linear_fit, wage_poly_5_linear_fit)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The cubic fit is probably the best choice here.
Equivalently, because of the way poly()
created orthogonalized polynomials, we can get the same \(p\)-values and \(F\)-statistics (= square of the \(t\)-statistics) like this:
tidy(wage_poly_5_linear_fit) %>%
transmute(term, t = statistic, `F` = t^2, p = scales::pvalue(p.value)) %>%
gt() %>%
fmt_number(columns = c(t, `F`), n_sigfig = 4)
term | t | F | p |
---|---|---|---|
(Intercept) | 153.3 | 23,490 | <0.001 |
poly(age, 5)1 | 11.20 | 125.4 | <0.001 |
poly(age, 5)2 | −11.98 | 143.6 | <0.001 |
poly(age, 5)3 | 3.145 | 9.889 | 0.002 |
poly(age, 5)4 | −1.952 | 3.810 | 0.051 |
poly(age, 5)5 | −0.8972 | 0.8050 | 0.370 |
However, this only works because the polynomials are orthogonal and there is just the one predictor.
Otherwise, anova()
must be used for these model comparisons.
The step function was fit and visualized in section 7.2.
7.8.2 Splines
The cubic spline and natural spline fits are shown in sections 7.4.3 and 7.4.4. The smoothing spline fits are shown in section 7.5.2, and the local regression fits in section 7.6.
7.8.3 GAMs
Figure 7.11 and 7.12 are re-produced in section 7.7.1.
Use anova()
with different variations of the year
variable:
<-
wage_gam_no_year gam(wage ~ s(age, 5) + education, data = wage)
<-
wage_gam_linear_year gam(wage ~ year + s(age, 5) + education, data = wage)
anova(wage_gam_no_year, wage_gam_linear_year, wage_gam_smooth_fit, test = "F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find evidence that the linear function of year
is preferred, which can also be seen in the model summary()
:
summary(wage_gam_smooth_fit)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see under Anova for Parametric Effects
that all of the terms are statistically significant, for even a linear relationship.
Under Anova for Nonparametric Effects
, we see that only the age
term has a statistically significant improvement over the linear.
The logistic regression example is in section 7.7.2.
7.9 Exercises
Applied
6. Polynomial and step function regression on wage
- Find the optimal degree \(d\) for polynomial regression.
set.seed(201)
<- vfold_cv(wage, v = 10)
wage_resamples
<- linear_reg()
lm_spec <- recipe(wage ~ age, data = wage) %>%
wage_poly_spec step_poly(age, degree = tune())
<- workflow() %>%
wage_poly_workflow add_model(lm_spec) %>%
add_recipe(wage_poly_spec)
<- tibble(degree = 1:10)
degree_grid <- tune_grid(wage_poly_workflow, resamples = wage_resamples,
wage_poly_tune grid = degree_grid)
autoplot(wage_poly_tune) +
scale_x_continuous(breaks = 1:10)
The best degree
values:
show_best(wage_poly_tune, metric = "rmse")
## # A tibble: 5 × 7
## degree .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 rmse standard 39.8 10 0.868 Preprocessor09_Model1
## 2 7 rmse standard 39.8 10 0.878 Preprocessor07_Model1
## 3 6 rmse standard 39.8 10 0.874 Preprocessor06_Model1
## 4 10 rmse standard 39.8 10 0.870 Preprocessor10_Model1
## 5 5 rmse standard 39.8 10 0.872 Preprocessor05_Model1
But it is clear from the above figure that there is little difference past degree
2 or 3.
We can find the least complex model within one standard error of the best model with this function:
<- select_by_one_std_err(wage_poly_tune, metric = "rmse", degree)) (best_degree
## # A tibble: 1 × 9
## degree .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 2 rmse standard 39.9 10 0.882 Preprocessor02_Mod… 39.8 40.7
And here is how the fit looks overlaid on the data:
<- tibble(age = 18:80)
age_grid finalize_workflow(wage_poly_workflow, best_degree) %>%
fit(data = wage) %>%
augment(new_data = age_grid) %>%
ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .pred), color = td_colors$nice$soft_orange, size = 2)
- Find the optimal number of cuts \(K\) for step function regression.
<- recipe(wage ~ age, data = wage) %>%
wage_step_spec step_discretize(age, num_breaks = tune(),
# Need to adjust this from the default (=10)
min_unique = 2)
<- workflow() %>%
wage_step_workflow add_model(lm_spec) %>%
add_recipe(wage_step_spec)
<- tibble(num_breaks = 2:20)
breaks_grid <- tune_grid(wage_step_workflow, resamples = wage_resamples,
wage_step_tune grid = breaks_grid)
autoplot(wage_step_tune)
show_best(wage_step_tune, metric = "rmse")
## # A tibble: 5 × 7
## num_breaks .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 20 rmse standard 40.0 10 0.874 Preprocessor19_Model1
## 2 15 rmse standard 40.0 10 0.853 Preprocessor14_Model1
## 3 14 rmse standard 40.0 10 0.871 Preprocessor13_Model1
## 4 19 rmse standard 40.0 10 0.874 Preprocessor18_Model1
## 5 17 rmse standard 40.0 10 0.872 Preprocessor16_Model1
The best model by RMSE has the most num_breaks
, but within one std_error
gives:
<-
(best_num_breaks select_by_one_std_err(wage_step_tune, metric = "rmse", num_breaks))
## # A tibble: 1 × 9
## num_breaks .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 3 rmse standard 40.7 10 0.831 Preprocessor02… 40.0 40.9
For posterity, here is the best model by RMSE, and best model within one standard error:
bind_rows(
finalize_workflow(wage_step_workflow,
select_best(wage_step_tune, metric = "rmse")) %>%
fit(data = wage) %>%
augment(new_data = age_grid) %>%
mutate(model = "best by RMSE"),
finalize_workflow(wage_step_workflow, best_num_breaks) %>%
fit(data = wage) %>%
augment(new_data = age_grid) %>%
mutate(model = "best within 1 SE")
%>%
) ggplot(aes(x = age)) +
geom_point(data = wage, aes(y = wage),
shape = 21, color = "grey50", alpha = 0.5) +
geom_line(aes(y = .pred, color = model), size = 2) +
theme(legend.position = "top") +
labs(color = NULL)
7. Explore other predictors in wage
In tidymodels
, the implementation of smooth splines and GAMs uses the mgcv
package as the default (and currently only) engine, so I’ll use that for this example.
detach("package:gam", unload = TRUE)
library(mgcv)
glimpse(wage)
## Rows: 3,000
## Columns: 11
## $ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…
Every other feature in this data set is data set is categorical.
Here is the wage
distribution by the different categories:
%>%
wage select(wage, where(is.factor), -education) %>%
rownames_to_column() %>%
pivot_longer(cols = -c(wage, rowname),
names_to = "var", values_to = "val") %>%
ggplot(aes(y = val, x = wage)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free_y", ncol = 2)
Some general observations:
- People with better
health
have slightly higher wages. - People with
health_ins
have higher wages. - People with
jobclass = "Information"
have slightly higher wages. - People with
maritl = "Married"
have have higher wages than"Never Married"
. - People with
race = "Other"
has lower wages. - The
region
variable has a single value.
Since these are all categorical, there aren’t any new smoothing/spline/polynomial/etc. functions to try out. Instead I will add the variables and determine if they improve the fit:
<- gam(wage ~ s(age, k = 5) + s(year, k = 4) + education, data = wage)
wage_gam_1 summary(wage_gam_1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(age, k = 5) + s(year, k = 4) + education
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.529 2.152 39.746 < 2e-16 ***
## education2. HS Grad 10.896 2.429 4.487 7.51e-06 ***
## education3. Some College 23.415 2.556 9.159 < 2e-16 ***
## education4. College Grad 38.058 2.541 14.980 < 2e-16 ***
## education5. Advanced Degree 62.568 2.759 22.682 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 3.381 3.787 57.96 < 2e-16 ***
## s(year) 1.073 1.142 11.44 0.000392 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.289 Deviance explained = 29.1%
## GCV = 1241.3 Scale est. = 1237.4 n = 3000
Like we saw in the lab (section 7.8.3), the year
variable can be modeled as linear.
<- gam(wage ~ s(age, k = 5) + year + education, data = wage)
wage_gam_2 summary(wage_gam_2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(age, k = 5) + year + education
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2261.3683 637.2311 -3.549 0.000393 ***
## year 1.1701 0.3177 3.683 0.000235 ***
## education2. HS Grad 10.9002 2.4285 4.488 7.45e-06 ***
## education3. Some College 23.4227 2.5564 9.163 < 2e-16 ***
## education4. College Grad 38.0606 2.5406 14.981 < 2e-16 ***
## education5. Advanced Degree 62.5714 2.7586 22.683 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 3.381 3.787 57.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.289 Deviance explained = 29.1%
## GCV = 1241.3 Scale est. = 1237.4 n = 3000
Now add the new categorical variables:
<- gam(
wage_gam_3 ~ s(age, k = 5) + year + education + health + health_ins + jobclass +
wage + race,
maritl data = wage
)summary(wage_gam_3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(age, k = 5) + year + education + health + health_ins +
## jobclass + maritl + race
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2485.6495 612.4533 -4.059 5.07e-05 ***
## year 1.2794 0.3053 4.190 2.87e-05 ***
## education2. HS Grad 7.6142 2.3528 3.236 0.00122 **
## education3. Some College 18.1558 2.5025 7.255 5.10e-13 ***
## education4. College Grad 30.7018 2.5315 12.128 < 2e-16 ***
## education5. Advanced Degree 53.2910 2.7935 19.077 < 2e-16 ***
## health2. >=Very Good 6.2775 1.4111 4.449 8.96e-06 ***
## health_ins2. No -16.5462 1.4018 -11.803 < 2e-16 ***
## jobclass2. Information 3.5045 1.3148 2.665 0.00773 **
## maritl2. Married 13.8176 1.7956 7.695 1.91e-14 ***
## maritl3. Widowed 0.7310 7.9577 0.092 0.92681
## maritl4. Divorced 0.5206 2.9195 0.178 0.85849
## maritl5. Separated 7.7757 4.8456 1.605 0.10867
## race2. Black -4.7753 2.1317 -2.240 0.02516 *
## race3. Asian -2.7584 2.5858 -1.067 0.28616
## race4. Other -5.7192 5.6258 -1.017 0.30943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 2.779 3.31 18.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.346 Deviance explained = 35%
## GCV = 1146.5 Scale est. = 1139.3 n = 3000
par(mar = c(2, 2, 2, 2))
plot(wage_gam_3, pages = 1, all.terms = TRUE, shade = TRUE)
It does provide a statistically singificant improvement:
anova(wage_gam_2, wage_gam_3, test = "F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, k = 5) + year + education
## Model 2: wage ~ s(age, k = 5) + year + education + health + health_ins +
## jobclass + maritl + race
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990.2 3700651
## 2 2980.7 3396564 9.5239 304087 28.024 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8. Non-linear models of auto
<- ISLR2::Auto %>%
auto # Origin is a factor coded as numeric
mutate(
origin = factor(origin, levels = c(1, 2, 3),
labels = c("American", "European", "Japanese"))
)glimpse(auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <fct> American, American, American, American, American, America…
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
The variables (excluding name
) vs mpg
:
<- auto %>%
p1 select(mpg, acceleration, displacement, year, horsepower) %>%
rownames_to_column() %>%
pivot_longer(cols = -c(rowname, mpg), names_to = "var") %>%
ggplot(aes(x = value, y = mpg)) +
geom_point() +
geom_smooth(method = "gam") +
facet_wrap(~ var, scales = "free_x")
<- auto %>%
p2 transmute(mpg, cylinders = factor(cylinders), origin) %>%
rownames_to_column() %>%
pivot_longer(cols = -c(rowname, mpg), names_to = "var") %>%
ggplot(aes(y = factor(value), x = mpg)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free_y")
/ p2 +
p1 plot_layout(heights = c(2, 1))
Fit a GAM with all of the variables:
<- gam(
auto_gam ~ s(acceleration, k = 5) + s(displacement, k = 5) + s(horsepower, k = 5) +
mpg s(year, k = 5) + s(cylinders, k = 3) + origin,
data = auto
)summary(auto_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## mpg ~ s(acceleration, k = 5) + s(displacement, k = 5) + s(horsepower,
## k = 5) + s(year, k = 5) + s(cylinders, k = 3) + origin
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.25555 0.22312 104.227 <2e-16 ***
## originEuropean 0.09133 0.52543 0.174 0.8621
## originJapanese 0.86600 0.51939 1.667 0.0963 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(acceleration) 2.364 2.942 10.931 1.64e-06 ***
## s(displacement) 3.305 3.735 11.353 < 2e-16 ***
## s(horsepower) 2.635 3.185 31.694 < 2e-16 ***
## s(year) 3.398 3.796 82.425 < 2e-16 ***
## s(cylinders) 2.000 2.000 3.212 0.0414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.873 Deviance explained = 87.8%
## GCV = 8.086 Scale est. = 7.7415 n = 392
This suggests that there are non-linear relationships in acceleration
, displacement
, horsepower
, and year
.
Visualize the smooth functions:
plot(auto_gam, pages = 1, shade = TRUE)
9. Predicting nitrogen oxide concentration in boston
<- ISLR2::Boston
boston glimpse(boston)
## Rows: 506
## Columns: 13
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
Visualize the nox
and dis
variables:
<- boston %>%
p1 select(nox, dis) %>%
pivot_longer(cols = c(nox, dis)) %>%
ggplot(aes(x = value)) +
geom_boxplot(aes(y = 0, fill = name), outlier.shape = NA) +
geom_jitter(aes(y = 1, color = name), alpha = 0.5) +
facet_wrap(~ name, scales = "free_x") +
::remove_axis("y") +
dunnrtheme(legend.position = "none")
<- boston %>%
p2 ggplot(aes(x = dis, y = nox)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "gam")
/ p2 p1
Looks like higher values of dis
(the distance from Boston employment centers) are associated with lower nox
(nitrogen oxide concentration).
- Cubic polynomial regression.
<- linear_reg()
lm_spec <- recipe(nox ~ dis, data = boston) %>%
boston_poly_3_recipe step_poly(dis, degree = 3)
<- workflow() %>%
boston_poly_3_fit add_recipe(boston_poly_3_recipe) %>%
add_model(lm_spec) %>%
fit(data = boston)
summary(boston_poly_3_fit %>% extract_fit_engine())
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## dis_poly_1 -2.003096 0.062071 -32.271 < 2e-16 ***
## dis_poly_2 0.856330 0.062071 13.796 < 2e-16 ***
## dis_poly_3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
Visualize the cubic fit:
<- tibble(dis = seq(min(boston$dis), max(boston$dis),
dis_grid length.out = 100))
<- bind_cols(
boston_poly_3_pred
dis_grid,# I'll use `predict` instead of `augment` because I don't think I can get
# SEs/CIs easily with the latter
predict(boston_poly_3_fit, new_data = dis_grid),
predict(boston_poly_3_fit, new_data = dis_grid, type = "conf_int")
)
<- ggplot(boston, aes(x = dis)) +
p_base geom_point(aes(y = nox), color = "grey50", alpha = 0.5) +
scale_x_continuous(breaks = seq(2, 12, 2))
+
p_base geom_line(data = boston_poly_3_pred, aes(y = .pred),
color = td_colors$nice$spanish_blue, size = 2) +
geom_ribbon(data = boston_poly_3_pred,
aes(ymin = .pred_lower, ymax = .pred_upper),
fill = td_colors$nice$spanish_blue, alpha = 0.5)
- Polynomial fits from 1 to 10 degrees.
Since we’re not tune()
ing the degree
parameter on resamples here (just fitting a model for each degree
to the full data set) I can’t use the typical tune_grid()/fit_resamples()
approach.
<- recipe(nox ~ dis, data = boston)
boston_recipe
<-
boston_poly_fits tibble(
degree = 1:10,
poly_fit = map(
degree,~ workflow() %>%
add_recipe(boston_recipe %>% step_poly(dis, degree = .x)) %>%
add_model(lm_spec) %>%
fit(data = boston)
)%>%
) mutate(
poly_pred = map(
poly_fit,~ bind_cols(
dis_grid,predict(.x, new_data = dis_grid),
predict(.x, new_data = dis_grid, type = "conf_int")
)
),mod_rss = map_dbl(poly_fit, ~ glance(.x)$deviance)
)
+
p_base geom_line(
data = boston_poly_fits %>%
mutate(
degree_label = fct_inorder(paste0(degree, " (", round(mod_rss, 3), ")"),
ordered = TRUE)
%>%
) unnest(poly_pred),
aes(y = .pred, color = degree_label), size = 1
+
) scale_color_viridis_d(end = 0.8) +
labs(color = "Degree (RSS)") +
coord_cartesian(ylim = c(0.35, 0.9))
- Choose the optimal degree by cross-validation.
Now I can tune()
:
<- recipe(nox ~ dis, data = boston) %>%
boston_poly_recipe step_poly(dis, degree = tune())
<- workflow() %>%
boston_poly_workflow add_recipe(boston_poly_recipe) %>%
add_model(lm_spec)
<- tibble(degree = 1:10)
degree_grid
set.seed(4928)
<- vfold_cv(boston, v = 10)
boston_resamples
<-
boston_poly_tune tune_grid(boston_poly_workflow, resamples = boston_resamples,
grid = degree_grid)
autoplot(boston_poly_tune) +
scale_x_continuous(breaks = 1:10)
Both the 3-degree and 4-degree fits appear best. I would choose the 3-degree since it is simpler. You could also make a case for the quadratic fit by the one SE rule:
select_by_one_std_err(boston_poly_tune, metric = "rmse", degree)
## # A tibble: 1 × 9
## degree .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 2 rmse standard 0.0636 10 0.00211 Preprocessor02_M… 0.0618 0.0641
- Regression splines.
With 4 degrees of freedom, a cubic spline has a single knot.
By default, it will be placed at the median of dis
:
attr(bs(boston$dis, df = 4), "knots")
## 50%
## 3.20745
<- recipe(nox ~ dis, data = boston) %>%
boston_spline_4_recipe step_bs(dis, deg_free = 4)
<- workflow() %>%
boston_spline_4_fit add_model(lm_spec) %>%
add_recipe(boston_spline_4_recipe) %>%
fit(data = boston)
extract_fit_engine(boston_spline_4_fit) %>%
summary()
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## dis_bs_1 -0.05810 0.02186 -2.658 0.00812 **
## dis_bs_2 -0.46356 0.02366 -19.596 < 2e-16 ***
## dis_bs_3 -0.19979 0.04311 -4.634 4.58e-06 ***
## dis_bs_4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
<- bind_cols(
boston_spline_4_pred
dis_grid,predict(boston_spline_4_fit, new_data = dis_grid),
predict(boston_spline_4_fit, new_data = dis_grid, type = "conf_int")
)
+
p_base geom_line(data = boston_spline_4_pred, aes(y = .pred),
color = td_colors$nice$ruby_red, size = 2) +
geom_ribbon(data = boston_spline_4_pred,
aes(ymin = .pred_lower, ymax = .pred_upper),
fill = td_colors$nice$ruby_red, alpha = 0.5)
- Regression splines for a range of degrees of freedom.
I’ll range deg_free
from 3 (no knots) to 10 (7 knots):
<-
boston_spline_fits tibble(
deg_free = 3:10,
spline_fit = map(
deg_free,~ workflow() %>%
add_recipe(boston_recipe %>% step_bs(dis, deg_free = .x)) %>%
add_model(lm_spec) %>%
fit(data = boston)
)%>%
) mutate(
spline_pred = map(
spline_fit,~ bind_cols(
dis_grid,predict(.x, new_data = dis_grid),
predict(.x, new_data = dis_grid, type = "conf_int")
)
),mod_rss = map_dbl(spline_fit, ~ glance(.x)$deviance)
)
+
p_base geom_line(
data = boston_spline_fits %>%
mutate(
df_label = fct_inorder(paste0(deg_free, " (", round(mod_rss, 3), ")"),
ordered = TRUE)
%>%
) unnest(spline_pred),
aes(y = .pred, color = df_label), size = 1
+
) scale_color_viridis_d(end = 0.8) +
labs(color = "df (RSS)") +
coord_cartesian(ylim = c(0.35, 0.9))
- Choose the optimal degrees of freedom by cross-validation
<- recipe(nox ~ dis, data = boston) %>%
boston_spline_recipe step_bs(dis, deg_free = tune())
<- workflow() %>%
boston_spline_workflow add_recipe(boston_spline_recipe) %>%
add_model(lm_spec)
<- tibble(deg_free = 3:15) deg_free_grid
<-
boston_spline_tune tune_grid(boston_spline_workflow, resamples = boston_resamples,
grid = deg_free_grid)
autoplot(boston_spline_tune) +
scale_x_continuous(breaks = 3:15)
The RMSE estimates are a bit noisier than I would like. The best model within one SE:
select_by_one_std_err(boston_spline_tune, metric = "rmse", deg_free)
## # A tibble: 1 × 9
## deg_free .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 3 rmse standard 0.0618 10 0.00235 Preprocessor01… 0.0603 0.0626
From the plot, it might be surprising that the best model has 0 knots (deg_free = 3
), but relative to the standard errors, there isn’t a huge improvement for larger values.
An alternative way of choosing the model is by percent loss via tune::select_by_pct_loss()
.
The best model here has an RMSE of
0.0603.
The deg_free = 3
model has an RMSE of
0.0618.
This corresponds to a percentage loss of
(0.0618 - 0.0603) /
0.0603 * 100 =
2.5%.
We can choose an upper limit
of acceptable percent loss in tune::select_by_pct_loss()
– the default is limit = 2
which eliminates deg_free = 3
as an option:
select_by_pct_loss(boston_spline_tune, metric = "rmse", deg_free)
## # A tibble: 1 × 9
## deg_free .metric .estimator mean n std_err .config .best .loss
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 9 rmse standard 0.0607 10 0.00245 Preprocessor07_… 0.0603 0.712
By percent loss, deg_free = 9
is chosen.
10. Predicting tuition with college
<- ISLR2::College %>% janitor::clean_names()
college glimpse(college)
## Rows: 777
## Columns: 18
## $ private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ apps <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ accept <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ top10perc <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ top25perc <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ f_undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ p_undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ room_board <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ books <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ personal <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ ph_d <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ terminal <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ s_f_ratio <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc_alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ expend <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ grad_rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…
Some quick EDA:
<- college %>%
p1 ggplot(aes(x = outstate)) +
geom_density(fill = td_colors$nice$mellow_yellow, alpha = 0.5) +
geom_rug() +
::remove_axis("y")
dunnr<- college %>%
p2 ggplot(aes(y = private, x = outstate)) +
geom_boxplot(fill = td_colors$nice$mellow_yellow, alpha = 0.5)
<- college %>%
p3 select(where(is.double)) %>%
pivot_longer(cols = -outstate) %>%
ggplot(aes(x = value, y = outstate)) +
geom_point(color = "grey50", alpha = 0.3) +
geom_smooth(method = "gam", color = td_colors$nice$mellow_yellow) +
facet_wrap(~ name, scales = "free_x", ncol = 4)
+ p2) / p3 +
(p1 plot_layout(heights = c(1, 5))
- Split the data and select a subset of predictors.
set.seed(10291)
<- initial_split(college, prop = 0.7, strata = outstate)
college_splits <- training(college_splits)
college_train <- testing(college_splits)
college_test <- vfold_cv(college_train, v = 10) college_resamples
<-
college_regsubsets ::regsubsets(outstate ~ ., data = college_train,
leapsmethod = "forward", nvmax = 17) %>%
tidy() %>%
select(-`(Intercept)`) %>%
rownames_to_column(var = "n_vars") %>%
pivot_longer(cols = privateYes:grad_rate, names_to = "var") %>%
filter(value) %>%
select(-value) %>%
group_by(across(-var)) %>%
summarise(var = paste0(sort(var), collapse = ", "), .groups = "drop") %>%
mutate(n_vars = as.integer(n_vars)) %>%
arrange(n_vars)
%>%
college_regsubsets gt() %>%
fmt_number(columns = c(r.squared, adj.r.squared, BIC, mallows_cp),
n_sigfig = 4) %>%
data_color(columns = r.squared, colors = c("red", "green"), alpha = 0.5) %>%
data_color(columns = adj.r.squared, colors = c("red", "green"), alpha = 0.5) %>%
data_color(columns = BIC, colors = c("green", "red"), alpha = 0.5) %>%
data_color(columns = mallows_cp, colors = c("green", "red"), alpha = 0.5)
n_vars | r.squared | adj.r.squared | BIC | mallows_cp | var |
---|---|---|---|---|---|
1 | 0.4762 | 0.4752 | −337.9 | 664.3 | expend |
2 | 0.6187 | 0.6173 | −503.7 | 339.3 | expend, privateYes |
3 | 0.7010 | 0.6994 | −629.2 | 152.3 | expend, privateYes, room_board |
4 | 0.7319 | 0.7299 | −682.1 | 83.33 | expend, perc_alumni, privateYes, room_board |
5 | 0.7475 | 0.7451 | −708.2 | 49.65 | expend, perc_alumni, ph_d, privateYes, room_board |
6 | 0.7568 | 0.7541 | −722.2 | 30.27 | expend, grad_rate, perc_alumni, ph_d, privateYes, room_board |
7 | 0.7606 | 0.7574 | −724.4 | 23.63 | expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board |
8 | 0.7623 | 0.7587 | −722.1 | 21.58 | expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, top10perc |
9 | 0.7632 | 0.7592 | −717.8 | 21.60 | accept, expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, top10perc |
10 | 0.7681 | 0.7637 | −722.8 | 12.33 | accept, apps, expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, top10perc |
11 | 0.7699 | 0.7651 | −720.8 | 10.16 | accept, apps, enroll, expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, top10perc |
12 | 0.7705 | 0.7653 | −716.0 | 10.73 | accept, apps, enroll, expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, terminal, top10perc |
13 | 0.7710 | 0.7653 | −710.7 | 11.75 | accept, apps, books, enroll, expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room_board, terminal, top10perc |
14 | 0.7713 | 0.7653 | −705.3 | 12.86 | accept, apps, books, enroll, expend, grad_rate, p_undergrad, perc_alumni, personal, ph_d, privateYes, room_board, terminal, top10perc |
15 | 0.7716 | 0.7651 | −699.7 | 14.18 | accept, apps, books, enroll, expend, grad_rate, p_undergrad, perc_alumni, personal, ph_d, privateYes, room_board, s_f_ratio, terminal, top10perc |
16 | 0.7717 | 0.7647 | −693.5 | 16.07 | accept, apps, books, enroll, expend, f_undergrad, grad_rate, p_undergrad, perc_alumni, personal, ph_d, privateYes, room_board, s_f_ratio, terminal, top10perc |
17 | 0.7717 | 0.7643 | −687.3 | 18.00 | accept, apps, books, enroll, expend, f_undergrad, grad_rate, p_undergrad, perc_alumni, personal, ph_d, privateYes, room_board, s_f_ratio, terminal, top10perc, top25perc |
By forward stepwise selection, the best model has 12, 7, or 11 variables by highest adjusted \(R^2\), lowest BIC, and lowest \(C_p\) respectively. I’ll go with the 7 variable model to keep it simpler:
%>%
college_regsubsets filter(BIC == min(BIC)) %>%
select(n_vars, BIC, var)
## # A tibble: 1 × 3
## n_vars BIC var
## <int> <dbl> <chr>
## 1 7 -724. expend, grad_rate, perc_alumni, personal, ph_d, privateYes, room…
- Fit a GAM and evaluate on the test set.
Here is the GAM spec in tidymodels
with mgcv
as the engine and tune()
-able degrees of freedom:
<- gen_additive_mod(adjust_deg_free = tune()) %>%
gam_spec # This needs to be specificied (can be regression or classification)
set_mode("regression") %>%
# This doesn't, as it is the default engine
set_engine("mgcv")
translate(gam_spec)
## GAM Model Specification (regression)
##
## Main Arguments:
## adjust_deg_free = tune()
##
## Computational engine: mgcv
##
## Model fit template:
## mgcv::gam(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
## gamma = tune())
Note that the adjust_deg_free
parameter corresponds to gamma
in mgcv
, which controls model smoothness.
The workflow
for GAMs doesn’t use the regular recipe
approach – see this example.
Instead, smoothing terms are specified with the formula
argument:
<- workflow() %>%
college_gam_workflow add_model(
gam_spec,formula = outstate ~ s(expend) + s(grad_rate) + s(perc_alumni) +
s(personal) + s(ph_d) + s(room_board) + private
%>%
) add_variables(predictors = everything(), outcomes = outstate)
An explanation from the ?add_model()
documentation:
Typically, the terms are extracted from the formula or recipe preprocessing methods. However, some models (like survival and bayesian models) use the formula not to preprocess, but to specify the structure of the model. In those cases, a formula specifying the model structure must be passed unchanged into the model call itself.
Now tune the model for a range of adjust_deg_free
:
<-
adjust_deg_free_grid grid_regular(adjust_deg_free(range = c(0.25, 5)), levels = 30)
<-
college_gam_tune tune_grid(college_gam_workflow, resamples = college_resamples,
grid = adjust_deg_free_grid)
autoplot(college_gam_tune)
Take the best model, fit to the full training set, and print the model output:
<-
college_gam_final_workflow finalize_workflow(college_gam_workflow,
select_best(college_gam_tune, metric = "rmse"))
<- college_gam_final_workflow %>%
college_gam_fit_training fit(data = college_train)
extract_fit_engine(college_gam_fit_training) %>%
summary()
##
## Family: gaussian
## Link function: identity
##
## Formula:
## outstate ~ s(expend) + s(grad_rate) + s(perc_alumni) + s(personal) +
## s(ph_d) + s(room_board) + private
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8736.4 180.3 48.45 <2e-16 ***
## privateYes 2326.8 227.1 10.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(expend) 5.731 6.914 36.158 < 2e-16 ***
## s(grad_rate) 4.132 5.172 7.176 1.52e-06 ***
## s(perc_alumni) 1.335 1.602 6.965 0.00189 **
## s(personal) 1.384 1.681 3.794 0.01889 *
## s(ph_d) 2.366 3.031 3.414 0.01668 *
## s(room_board) 1.735 2.204 26.541 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.81 Deviance explained = 81.6%
## GCV = 3.2722e+06 Scale est. = 3.0705e+06 n = 542
- Evaluate the model obtained on the test set.
<- college_gam_fit_training %>%
college_test_pred augment(new_data = college_test)
%>%
college_test_pred rmse(truth = outstate, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2087.
As a comparison, here is the test RMSE for a purely linear model:
linear_reg() %>%
fit(
~ expend + grad_rate + perc_alumni +
outstate + ph_d + room_board + private,
personal data = college_train
%>%
) augment(new_data = college_test) %>%
rmse(truth = outstate, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2148.
Plot the actual vs predicted from the GAM fit:
%>%
college_test_pred ggplot(aes(x = outstate, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(lty = 2)
- For which variables, if any, is there evidence of a non-linear relationship?
Visualize the smooth functions:
extract_fit_engine(college_gam_fit_training) %>%
plot(pages = 1, shade = TRUE)
The expend
, grad_rate
, and ph_d
exhibit non-linear relationships
11. Backfitting with simulated data
- Generate response \(Y\) and predictors \(X_1\) and \(X_2\) for \(n = 100\).
Generate some \(\beta\)s:
set.seed(2939)
<- runif(3, min = -10, max = 10)) (betas
## [1] -0.7859099 9.1994185 -9.6552425
Then the data-generating model will be:
\[ \begin{align} Y &= -0.785 + 9.199 X_1 - 9.655 X_2 + \epsilon \\ X_1 &\sim N(0, 2) \\ X_2 &\sim N(0, 2) \\ \epsilon &\sim N(0, 5) \end{align} \]
<- tibble(
d x1 = rnorm(100, 0, 2),
x2 = rnorm(100, 0, 2),
epsilon = rnorm(100, 0, 5)
%>%
) mutate(y = betas[1] + betas[2] * x1 + betas[3] * x2 + epsilon)
glimpse(d)
## Rows: 100
## Columns: 4
## $ x1 <dbl> -0.57226622, -2.51516733, 2.86762452, -0.84437709, 2.69834786,…
## $ x2 <dbl> -6.3077791, 0.9674951, 0.1522335, -0.9901994, 0.6297863, -2.23…
## $ epsilon <dbl> -3.794032194, -4.425177727, 0.422247058, -1.828715019, -5.5262…
## $ y <dbl> 51.0586784, -37.6905641, 24.5469637, -0.8217875, 12.4303000, 6…
- Initialize \(\hat{\beta}_1\).
<- -3.14 beta_hat_1
- Keeping \(\hat{\beta}_1\) fixed, fit the model
\[ Y - \hat{\beta}_1 X_1 = \beta_0 + \beta_2 X_2 + \epsilon \]
<- d %>% mutate(y_diff = y - beta_hat_1 * x1)
d <- lm(y_diff ~ x2, data = d)$coefficients[2]) (beta_hat_2
## x2
## -11.15446
- Keeping \(\hat{\beta}_2\) fixed, fit the model
\[ Y - \hat{\beta}_2 X_2 = \beta_0 + \beta_1 X_1 + \epsilon \]
<- d %>% mutate(y_diff = y - beta_hat_2 * x2)
d <- lm(y_diff ~ x1, data = d)$coefficients[2]) (beta_hat_1
## x1
## 9.062398
- Repeat 1000 times and plot the coefficient estimates.
<- numeric(0)
beta_hat_0 <- c(-3.14)
beta_hat_1 <- numeric(0)
beta_hat_2 for (i in 1:1000) {
<- d %>% mutate(y_diff = y - beta_hat_1[i] * x1)
d <- lm(y_diff ~ x2, data = d)
m <- c(beta_hat_2, m$coefficients[2])
beta_hat_2
<- d %>% mutate(y_diff = y - beta_hat_2[i] * x2)
d <- lm(y_diff ~ x1, data = d)
m <- c(beta_hat_0, m$coefficients[1])
beta_hat_0 <- c(beta_hat_1, m$coefficients[2])
beta_hat_1
}
<- tibble(
d_estimates beta_0 = c(NA_real_, beta_hat_0),
beta_1 = beta_hat_1,
beta_2 = c(NA_real_, beta_hat_2)
%>%
) mutate(i = 0:1000) %>%
pivot_longer(cols = -i, names_to = "coef", values_to = "estimate") %>%
filter(!is.na(estimate))
<- ggplot(d_estimates, aes(x = i, y = estimate, color = coef)) +
p geom_line(size = 1.5)
p
The convergence happens immediately.
- Compare to a multiple linear regression.
<- lm(y ~ x1 + x2, data = d) %>%
d_estimates_mult tidy() %>%
mutate(coef = factor(term, levels = c("(Intercept)", "x1", "x2"),
labels = c("beta_0", "beta_1", "beta_2")))
<- p +
p geom_hline(data = d_estimates_mult,
aes(yintercept = estimate, color = coef), lty = 2, size = 1)
p
Basically the exact same estimates.
- How many backfitting iterations were required.
Just a single iteration was needed.
%>%
d_estimates filter(i < 5) %>%
pivot_wider(names_from = coef, values_from = estimate)
## # A tibble: 5 × 4
## i beta_1 beta_0 beta_2
## <int> <dbl> <dbl> <dbl>
## 1 0 -3.14 NA NA
## 2 1 9.06 -1.66 -11.2
## 3 2 9.17 -1.57 -9.86
## 4 3 9.17 -1.57 -9.84
## 5 4 9.17 -1.57 -9.84
+
p coord_cartesian(xlim = c(0, 5)) +
facet_wrap(~ coef, scales = "free_y", ncol = 1)
12. Backfitting with simulated data (continued)
I’ll use matrix
to initialize such a large data set:
set.seed(1009)
<- runif(100, min = -5, max = 5)
betas # Intercept
<- runif(1, min = -5, max = 5)
beta_0
# Increase number of observations to account for the increase in predictors
<- 1000
n
# A matrix of 100 predictors times 100 observations
<- matrix(rnorm(n * 100, mean = 0, sd = 1), ncol = 100)
x
# Matrix multiplication to get the response
<- beta_0 + x %*% betas +
y # Using a smaller random noise term
rnorm(n, mean = 0, sd = 1)
I’ll use a vector to keep track of the \(\hat{\beta}_j\)s, and a data frame to track estimates over time:
# Initial coefficient guesses
<- runif(100, min = -10, max = 10)
beta_hat # A tibble for keeping track of beta hats
<- tibble(
beta_hat_estimates iter = 0,
coef = paste0("beta_", 0:100),
# Including intercept estimate here (beta_0)
estimate = c(NA_real_, beta_hat)
)
I’ll try 100 iterations of backfitting:
for (i in 1:100) {
for (j in 1:100) {
<- y - x[, -j] %*% beta_hat[-j]
y_resid <- lm(y_resid ~ x[, j])
m <- m$coefficients[2]
beta_hat[j]
}
<- beta_hat_estimates %>%
beta_hat_estimates bind_rows(
tibble(
iter = i,
coef = paste0("beta_", 0:100),
estimate = c(m$coefficients[1], beta_hat)
)
) }
And here is how each coefficient estimate evolves:
%>%
beta_hat_estimates filter(!is.na(estimate)) %>%
ggplot(aes(x = iter, y = estimate, group = coef)) +
geom_line(alpha = 0.5)
Fit the full multiple regression model.
<- lm(y ~ x)
m_mult tidy(m_mult)
## # A tibble: 101 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.71 0.0332 -142. 0
## 2 x1 3.56 0.0340 105. 0
## 3 x2 3.88 0.0338 115. 0
## 4 x3 -4.21 0.0335 -126. 0
## 5 x4 -3.86 0.0335 -115. 0
## 6 x5 4.96 0.0339 147. 0
## 7 x6 2.63 0.0343 76.8 0
## 8 x7 -2.64 0.0330 -79.8 0
## 9 x8 0.570 0.0323 17.6 6.47e-60
## 10 x9 2.65 0.0322 82.3 0
## # … with 91 more rows
## # ℹ Use `print(n = ...)` to see more rows
Plot the absolute difference between the multiple regression coefficients and the backfitting coefficient estimates:
tidy(m_mult) %>%
transmute(coef = paste0("beta_", 0:100), estimate_mult = estimate) %>%
left_join(
%>%
beta_hat_estimates filter(!is.na(estimate)),
by = "coef"
%>%
) mutate(abs_diff = abs(estimate_mult - estimate)) %>%
ggplot(aes(x = iter, y = abs_diff, group = coef)) +
geom_line() +
coord_cartesian(xlim = c(0, 10)) +
scale_x_continuous(breaks = 1:10)
Looks like it only takes a few iterations to obtain a good approximation.
Reproducibility
Sys.time()
## [1] "2022-09-19 10:34:56 AST"
if ("git2r" %in% installed.packages()) {
if (git2r::in_repository()) {
::repository()
git2r
} }
## Local: main C:/Users/tdunn/Documents/learning/islr-tidy
## Remote: main @ origin (https://github.com/taylordunn/islr-tidy)
## Head: [b60974c] 2022-09-19: Working on the empirical examples in Chapter 4 (issue #1)
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23 ucrt)
## os Windows 10 x64 (build 19044)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Canada.utf8
## ctype English_Canada.utf8
## tz America/Curacao
## date 2022-09-19
## pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.1)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
## bayestestR 0.12.1 2022-05-02 [1] CRAN (R 4.2.1)
## BiocParallel 1.30.3 2022-06-07 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.1)
## bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.1)
## bookdown 0.27 2022-06-14 [1] CRAN (R 4.2.1)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.1)
## broom * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
## bslib 0.4.0 2022-07-16 [1] CRAN (R 4.2.1)
## cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.1)
## car 3.1-0 2022-06-15 [1] CRAN (R 4.2.1)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.2.1)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.1)
## checkmate 2.1.0 2022-04-21 [1] CRAN (R 4.2.1)
## class 7.3-20 2022-01-16 [2] CRAN (R 4.2.1)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.1)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.1)
## combinat 0.0-8 2012-10-29 [1] CRAN (R 4.2.0)
## conflicted 1.1.0 2021-11-26 [1] CRAN (R 4.2.1)
## corpcor 1.6.10 2021-09-16 [1] CRAN (R 4.2.0)
## corrr * 0.4.4 2022-08-16 [1] CRAN (R 4.2.1)
## crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.1)
## datawizard 0.4.1 2022-05-16 [1] CRAN (R 4.2.1)
## DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.1)
## dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.1)
## dials * 1.0.0 2022-06-14 [1] CRAN (R 4.2.1)
## DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.1)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.1)
## discrim * 1.0.0 2022-06-23 [1] CRAN (R 4.2.1)
## distill 1.4 2022-05-12 [1] CRAN (R 4.2.1)
## distributional 0.3.0 2022-01-05 [1] CRAN (R 4.2.1)
## doParallel * 1.0.17 2022-02-07 [1] CRAN (R 4.2.1)
## downlit 0.4.2 2022-07-05 [1] CRAN (R 4.2.1)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.1)
## dunnr * 0.2.6 2022-08-07 [1] Github (taylordunn/dunnr@e2a8213)
## ellipse 0.4.3 2022-05-31 [1] CRAN (R 4.2.1)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
## equatiomatic 0.3.1 2022-01-30 [1] CRAN (R 4.2.1)
## evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.1)
## extrafont 0.18 2022-04-12 [1] CRAN (R 4.2.0)
## extrafontdb 1.0 2012-06-11 [1] CRAN (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.1)
## farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.1)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.1)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.2.1)
## foreach * 1.5.2 2022-02-02 [1] CRAN (R 4.2.1)
## fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.1)
## furrr 0.3.0 2022-05-04 [1] CRAN (R 4.2.1)
## future 1.27.0 2022-07-22 [1] CRAN (R 4.2.1)
## future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.1)
## gargle 1.2.0 2021-07-02 [1] CRAN (R 4.2.1)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
## GGally 2.1.2 2021-06-21 [1] CRAN (R 4.2.1)
## ggdist * 3.2.0 2022-07-19 [1] CRAN (R 4.2.1)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.1)
## ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.2.1)
## git2r 0.30.1 2022-03-16 [1] CRAN (R 4.2.1)
## glmnet * 4.1-4 2022-04-15 [1] CRAN (R 4.2.1)
## globals 0.15.1 2022-06-24 [1] CRAN (R 4.2.1)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
## googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.1)
## googlesheets4 1.0.0 2021-07-21 [1] CRAN (R 4.2.1)
## gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.1)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.1)
## gt * 0.6.0 2022-05-24 [1] CRAN (R 4.2.1)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.1)
## hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.1)
## haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.1)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.1)
## highr 0.9 2021-04-16 [1] CRAN (R 4.2.1)
## hms 1.1.1 2021-09-26 [1] CRAN (R 4.2.1)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.1)
## httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.1)
## httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.1)
## igraph 1.3.4 2022-07-19 [1] CRAN (R 4.2.1)
## infer * 1.0.2 2022-06-10 [1] CRAN (R 4.2.1)
## insight 0.18.0 2022-07-05 [1] CRAN (R 4.2.1)
## ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.1)
## ISLR2 * 1.3-1 2022-01-10 [1] CRAN (R 4.2.1)
## iterators * 1.0.14 2022-02-05 [1] CRAN (R 4.2.1)
## janitor 2.1.0 2021-01-05 [1] CRAN (R 4.2.1)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.1)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.1)
## kknn 1.3.1 2016-03-26 [1] CRAN (R 4.2.1)
## klaR 1.7-1 2022-06-27 [1] CRAN (R 4.2.1)
## knitr 1.39 2022-04-26 [1] CRAN (R 4.2.1)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
## labelled 2.9.1 2022-05-05 [1] CRAN (R 4.2.1)
## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
## lava 1.6.10 2021-09-02 [1] CRAN (R 4.2.1)
## leaps * 3.1 2020-01-16 [1] CRAN (R 4.2.1)
## lhs 1.1.5 2022-03-22 [1] CRAN (R 4.2.1)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.1)
## listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.1)
## lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.1)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
## MASS 7.3-57 2022-04-22 [2] CRAN (R 4.2.1)
## Matrix * 1.4-1 2022-03-23 [2] CRAN (R 4.2.1)
## matrixStats 0.62.0 2022-04-19 [1] CRAN (R 4.2.1)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
## MetBrewer 0.2.0 2022-03-21 [1] CRAN (R 4.2.1)
## mgcv * 1.8-40 2022-03-29 [2] CRAN (R 4.2.1)
## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
## mixOmics 6.20.0 2022-04-26 [1] Bioconductor (R 4.2.0)
## modeldata * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.2.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
## mvtnorm * 1.1-3 2021-10-08 [1] CRAN (R 4.2.0)
## nlme * 3.1-157 2022-03-25 [2] CRAN (R 4.2.1)
## nnet 7.3-17 2022-01-16 [2] CRAN (R 4.2.1)
## parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1)
## parsnip * 1.0.0 2022-06-16 [1] CRAN (R 4.2.1)
## patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.2.1)
## performance 0.9.1 2022-06-20 [1] CRAN (R 4.2.1)
## pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.1)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
## pls * 2.8-1 2022-07-16 [1] CRAN (R 4.2.1)
## plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.1)
## poissonreg * 1.0.0 2022-06-15 [1] CRAN (R 4.2.1)
## prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.1)
## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.1)
## questionr 0.7.7 2022-01-31 [1] CRAN (R 4.2.1)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
## rARPACK 0.11-0 2016-03-10 [1] CRAN (R 4.2.1)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
## Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
## readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.1)
## readxl 1.4.0 2022-03-28 [1] CRAN (R 4.2.1)
## recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.2.1)
## repr 1.1.4 2022-01-04 [1] CRAN (R 4.2.1)
## reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.1)
## reshape 0.8.9 2022-04-12 [1] CRAN (R 4.2.1)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.1)
## rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1)
## rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.1)
## rpart 4.1.16 2022-01-24 [2] CRAN (R 4.2.1)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1)
## rsample * 1.1.0 2022-08-08 [1] CRAN (R 4.2.1)
## RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.2.1)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.1)
## Rttf2pt1 1.3.8 2020-01-10 [1] CRAN (R 4.2.1)
## rvest 1.0.2 2021-10-16 [1] CRAN (R 4.2.1)
## sass 0.4.2 2022-07-16 [1] CRAN (R 4.2.1)
## scales * 1.2.0 2022-04-13 [1] CRAN (R 4.2.1)
## see 0.7.1 2022-06-20 [1] CRAN (R 4.2.1)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
## shape 1.4.6 2021-05-19 [1] CRAN (R 4.2.0)
## shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.1)
## skimr 2.1.4 2022-04-15 [1] CRAN (R 4.2.1)
## snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.2.1)
## stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.2.1)
## survival 3.3-1 2022-03-03 [2] CRAN (R 4.2.1)
## tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
## tictoc * 1.0.1 2021-04-19 [1] CRAN (R 4.2.0)
## tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.1)
## tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.1)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.1)
## tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.1)
## timeDate 4021.104 2022-07-19 [1] CRAN (R 4.2.1)
## tune * 1.0.0 2022-07-07 [1] CRAN (R 4.2.1)
## tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.1)
## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.1)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.1)
## viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.2.1)
## vroom 1.5.7 2021-11-30 [1] CRAN (R 4.2.1)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.1)
## workflows * 1.0.0 2022-07-05 [1] CRAN (R 4.2.1)
## workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.1)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.1)
## xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.1)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
## yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
## yardstick * 1.0.0 2022-06-06 [1] CRAN (R 4.2.1)
##
## [1] C:/Users/tdunn/AppData/Local/R/win-library/4.2
## [2] C:/Program Files/R/R-4.2.1/library
##
## ──────────────────────────────────────────────────────────────────────────────