Chapter 4 Contrained least squares

We have seen that OLS regression falls short in the high-dimensional context. It leads to overfitting and as a result in large estimates of regression coefficients. Augmentation of the estimation procedure with a constraint on the regression coefficients is a simple remedy to large parameter estimates. As a consequence it decreases the probability of overfitting. In the following we will discuss methods which minimize the residual sum of squares, \(\rm{RSS}(\beta)\), under some contstraints on the parameter \(\beta\).

4.1 Subset selection

The most traditional approach to impose constraints is subset selection. In this approach we retain only a subset of the variables, and eliminate the rest from the model. OLS is used to estimate the coefficients of the inputs that are retained. More formally, given a subset \(S\subset\{1,\ldots,p\}\) we solve the optimization problem

\[\begin{align*} \hat{\beta}_{S}&={\rm argmin}_{\beta_j=0\;\rm{for\; all}\;j\notin S}\rm{RSS}(\beta). \end{align*}\]

It is easy to show that this is equivalent to OLS regression based on subset \(S\) covariates, i.e.

\[ \hat{\beta}_{S}=({\bf X}_S^T {\bf X}_S)^{-1}{\bf X}_S^T {\bf y}. \]

In practice we need to explore a sequence of subsets \(S_1,\ldots,S_K\) and choose an optimal subset by either a re-sampling approach or by using an information criteria (see Section 1.2). There are a number of different strategies available. Best subsets regression consists of looking at all possible combinations of covariates. Rather than search though all possible subsets, we can seek a good path through them. Two popular approaches are backward stepwise selection which starts with the full model and sequentially deletes covariates, whereas forward stepwise selection starts with the intercept, and then sequentially adds into the model the covariate that most improves the fit.

In R we can use regsubsets from the leaps package or stepAIC from the MASS package to perform subset selection. For example to perform forward stepwise selection based on AIC we proceed as follows.

# Forward regression
fit0 <- lm(y~1,data=dtrain)
fit.fw <- stepAIC(fit0,
                         upper=paste("~", paste(colnames(dtrain[,-10]), 
                                                collapse=" + "))
                  trace = FALSE

We can summarize the stepwise selection process.

Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 9 22.468 10.095
+ X1 1 20.017 8 2.450 -10.064
+ X4 1 0.883 7 1.567 -12.535
+ X9 1 0.376 6 1.191 -13.277

Finally we can retrieve the regression coefficients of the chosen model.

term estimate std.error statistic p.value
(Intercept) 0.210 0.157 1.334 0.231
X1 1.611 0.243 6.624 0.001
X4 -0.508 0.205 -2.475 0.048
X9 -0.322 0.234 -1.376 0.218

4.2 Ridge regression

Subset selection as outlined above works by either including or excluding covariates, i.e. contrain specific regression coefficients to be zero.

An alternative is Ridge Regression, which constrains (or regularizes) the optimization problem by shrinking regression coefficients towards zero. This discourages complex models because models that overfit tend to have larger coefficients. Ridge regression can be formulated as a constrained optimization problem

\[\begin{align*} \hat{\beta}^{\rm Ridge}_{c}&=\rm{argmin}_{\|\beta\|_2^2\leq c}\rm{RSS}(\beta). \end{align*}\]

The geometry of the optimization problem is illustrated in Figure 4.1. It shows the levels sets of \({\rm RSS}(\beta)\), ellipsoids centered around the OLS estimate, and the circular ridge parameter constraint, centered around zero with radius \(c > 0\). The Ridge estimator is then the point where the smallest level set hits the constraint. Exactly at that point the \(\rm{RSS}(\beta)\) is minimized over those \(\beta\)’s that “live” inside the constraint.

Ridge regression geometry.

Figure 4.1: Ridge regression geometry.

Alternatively, ridge regression can be cast as the optimization of the penalised residual sum of squares with a penalty on the magnitude of the coefficients, i.e. 

\[\begin{align*} \hat{\beta}^{\rm Ridge}_{\lambda}&=\rm{argmin}\; \rm{RSS}(\beta)+\lambda\|\beta\|^2_2. \end{align*}\]

Both formulations are equivalent in the sense that there is a one-to-one relationship between the tuning parameters \(c\) and \(\lambda\). We will use more often the latter “penalisation” formulation. The parameter \(\lambda\) is the amount of penalisation. Note that with no penalization, \(\lambda=0\), ridge regression coincides with OLS. Increasing \(\lambda\) has the effect of shrinking the regression coefficients to zero.

The Ridge optimization problem has the closed form solution (see exercises, Section 7)

\[\begin{align*} \hat{\beta}^{\rm Ridge}_{\lambda}&=(\bf X^T \bf X+\lambda \bf I)^{-1}\bf X^T \bf y. \end{align*}\]

Note that for \(\lambda>0\) the matrix \({\bf X}^T {\bf X}+\lambda \bf I\) has always full rank and therefore Ridge regression is well defined even in the high-dimensional context (in contrast to OLS).

Ridge regression is implemented in the package glmnet. We can call

fit.ridge.glmnet <-glmnet(x=xtrain,y=ytrain,alpha=0) 

4.2.1 Choice of penalty parameter

We need to select the tuning parameters is \(\lambda\). We proceed by choosing a grid values \(0<\lambda_1<\lambda_2<\ldots<\lambda_K<\infty\) and proceed as explained in Section 1.2, that is we choose the optimal \(\lambda_{\rm opt}\) by either re-sampling or information criteria. In glmnet we use cross-validation using the command cv.glmnet.

cv.ridge.glmnet <-cv.glmnet(x=xtrain,y=ytrain,alpha=0) 

The tuning parameter with smallest cross-validation error is stored in the argument lambda.min.

## [1] 0.8286695

Another choice is lambda.1se which denotes the largest \(\lambda\) within 1 standard error of the smallest cross-validation error.

## [1] 3.671521

4.2.2 Effective degrees of freedom

Although ridge regression involves all \(p\) covariates the effective degrees of freedom is smaller than \(p\) as we have imposed constraints through the penalty. In the book Hastie, Tibshirani, and Friedman (2001) it is shown that the effective degrees of freedom for Ridge regression, \(\nu^{\rm ridge}_{\lambda}\), is given by

\[\nu^{\rm ridge}_{\lambda}=\sum_{j=1}^{p}\frac{d_j^2}{d_j^2+\lambda},\] where \(d_1,\ldots,d_p\) are the singular values of \(\bf X\).

# get singular values
fit.svd <- svd(xtrain) #fit.svd$d

# ridge degree of freedom for lambdaopt
df_lambdaopt <- sum(fit.svd$d^2/(fit.svd$d^2+cv.ridge.glmnet$lambda.1se))
## [1] 4.390408

Thus, the residual degrees of freedom in our example is \(n-\nu^{\rm ridge}_{\lambda}\)=5.61. Note that this is inline with the “rule of thumb” which we mentioned in Section 3.2.1.

4.2.3 Shrinkage property

In Section 3.2.2 we have seen that in presence of collinearity the OLS estimator becomes unstable (high variance) and that this poses a challenge especially in high-dimensions. A nice property of ridge regression is that it counteracts this by shrinking low-variance components more than high-variance components.

The OLS vector of fitted values \({\bf \hat y}^{\rm OLS}\) is the orthogonal projection of \({\bf y}\) onto the column space of \(\bf X\). In terms of normalized principle components we have

\[{\bf\hat y}^{\rm OLS}=\sum_{j=1}^{p}{\bf u}_j {\bf u}_j^T {\bf y}.\]

Similarly we can represent the Ridge projection as

\[{\bf\hat y}^{\rm Ridge}=\sum_{j=1}^{p}{\bf u}_j \frac{d_j^2}{d_j^2+\lambda}{\bf u}_j^T {\bf y}.\]

This shows that the level of shrinkage \(\frac{d_j^2}{d_j^2+\lambda}\) is largest in the direction of the last principle component, which in return is the direction in the column space of \(\bf X\) with largest variance (see Figure 4.2).

Principle components of 2-dimensional input data.

Figure 4.2: Principle components of 2-dimensional input data.

4.2.4 Bayesian interpretation

Ridge regression has a nice interpretation as the maximum a posteriori (MAP) estimate of a hierarchical bayesian model where the data follows a multivariate regression model \[Y_i|X_i,\beta\sim N(X_i^T\beta,\sigma^2),\; i=1,\ldots,n\] and the regression coefficients are equipped with prior \[\beta_j \sim N(0,\tau^2),\; j=1,\ldots,p.\]

We will elaborate on this connection in the exercises.

4.2.5 Towards non-parametric regression

Ridge regression and high-dimensionality play a role in many subfields of statistics. We illustrate this with the example of smoothing splines for univariate non-parametric regression.

Sometimes it is extremely unlikely that the true function \(f(X)\) is actually linear in \(X\). Consider the following simulation example

# define a non-linear function
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)

# generate noisy data
y <- fx + rnorm(n, sd = 0.5)

plot(x, y)             # plot of the data
lines(x, fx, lwd = 2)  # plot of true non-linear function f(x)
legend("topright", legend = "f(x)", lty = 1, lwd = 2, bty = "n")

One approach to find a non-parametric approximation are so-called splines which use piecewise polynomials to approximate the unknown function. We assume \(f\) to be represented by a spline function which can be expressed by a set of basis functions

\[ f(X)=\sum_{m=1}^{p}\beta_m B_m(X).\] For example for a cubic spline with \(K\) fixed knots and fixed polynomial degree \(d=3\) (“cubic”) we have \(p=K+d+1\) and the \(B_m(x)\)’s form a B-spline basis (one could also use the truncated-power basis). The coefficients \(\beta_m\) are estimated using OLS. Although we have only one single variable \(X\) the design matrix consists of \(p=K+d+1\) features and the estimation problem becomes quickly high-dimensional.

In R we obtain a B-spline basis with bs and we can plot the basis functions \(B_m(x)\) as follows.

spl <- bs(x,df=10) # cubic spline with p=10 degrees of freedom
plot(spl[,1]~x, ylim=c(0,max(spl)), type='l', lwd=2, col=1, 
     xlab="Cubic B-spline basis", ylab="")
for (j in 2:ncol(spl)) lines(spl[,j]~x, lwd=2, col=j)

The estimated coefficients \(\hat \beta_m\) are obtain using lm.

fit.csp <- lm(y~spl)
#fit.csp <- lm(y~bs(x,df=10))
## (Intercept)        spl1        spl2        spl3        spl4        spl5 
##  -0.1664090   0.6710022   1.0956429   1.2056968   0.9713568   0.2323033 
##        spl6        spl7        spl8        spl9       spl10 
##  -0.2876482  -1.2456044  -0.3914716   0.2894841  -0.4376537

The cubic spline with \(p=10\) degrees of freedom fits the data well as shown in the next plot (in violet).

plot(x, y)
lines(x, fx, lwd = 2)
lines(x, predict(fit.csp), lty = 2, col = "violet",lwd=3)

An alternative approach are so-called smoothing splines, where we take \(p=n\) and the \(B_m(x)\)’s are an n-dimensional set of basis functions representing the family of natural cubic splines with knots at the unique values of \(x_i\), \(i=1,\ldots,n\). The coefficients \(\beta_m\) cannot be estimated using OLS as the number \(p\) of basis function (columns of the design matrix) equals the number of observations \(n\). Smoothing splines overcome this hurdle by imposing a generalized ridge penalty on the spline coefficients \(\beta_j\), i.e.

\[\hat{\beta}_{\lambda}=\rm{argmin}\;\|\bf y- \bf B \beta\|^2+\lambda \beta^T\Omega\beta,\]

where \(\bf B\) is the design matrix with \(jth\) column \((B_j(x_1),\ldots,B_j(x_n))^T\). In practice we can fit smoothing spline using the function smooth.spline. The penalty term is specified by setting the effective degrees of freedom (\(\nu\)) or by selecting \(\lambda\) using cross-validation (see Section 4.2.1).

We fit a smoothing splines to our simulation example.

fit.smsp.df10 <- smooth.spline(x, y, df = 10) # smoothing spline with 10 effective degrees of freedom
fit.smsp.df30 <- smooth.spline(x, y, df = 30) # smoothing spline with 30 effective degrees of freedom <- smooth.spline(x, y) # smoothing spline with effective degrees of freedom estimated by cv
plot(x, y)
lines(x, fx, lwd = 2)
lines(x, fit.smsp.df10$y, lty = 2, col = "blue",lwd=3)
lines(x, fit.smsp.df30$y, lty = 2, col = "green",lwd=3)
lines(x,$y, lty = 2, col="red",lwd=3)

The smoothing spline with \(\nu=30\) (in green) leads to overfitting. The smoothing splines obtained by cross-validation (in red) or by fixing \(\nu=10\) (in blue) are both good approximation of the truth. The corresponding effective degrees of freedom of the cross-validation solution can be retrieved from the model fit.$df
## [1] 6.458247

4.3 Lasso regression

We have seen that ridge regression works in the high-dimensional context and we have also discussed some nice properties. However, a disadvantage of ridge regression is that it does not perform variable selection and therefore the interpretation of the model is more challenging.

In ridge regression we minimize \(\rm RSS(\beta)\) given constraints on the so-called L2-norm of the regression coefficients

\[\|\beta\|^2_2=\sum_{j=1}^p \beta^2_j \leq c.\]

Another very popular approach in high-dimensional statistics is Lasso Regression. The Lasso works very similarly. The only difference is that constraints are imposed on the L1-norm of the coefficients

\[\|\beta\|_1=\sum_{j=1}^p |\beta_j| \leq c.\]

Therefore the Lasso is referred to as L1 regularization (penalization). The change in the form of the constraints (L2 vs L1) has important implications. Figure 4.3 illustrates the geometry of the Lasso optimization. Geometrically the Lasso constraint is a diamond with “corners” (the Ridge constraint is a circle). If the sum of squares “hits” one of these corners then the coefficient corresponding to the axis is shrunk to zero. As \(p\) increases, the multidimensional diamond has an increasing number of corners, and so it is highly likely that some coefficients will be set to zero. Hence, the Lasso performs not only shrinkage but it also sets some coefficients to zero, in other words the Lasso simultaneously performs variable selection. A disadvantage of the “diamond” geometry is that in general there is no closed form solution for the Lasso (the Lasso optimisation problem is not differentiable at the corners of the diamond).

Lasso regression geometry.

Figure 4.3: Lasso regression geometry.

Similar to ridge regression the Lasso can be formulated as a penalisation problem

\[\begin{align*} \hat{\beta}^{\rm Lasso}_{\lambda}&=\rm{argmin}\;\rm{RSS}(\beta)+\lambda\|\beta\|_1. \end{align*}\]

To fit the Lasso we use glmnet.

fit.lasso.glmnet <-glmnet(x=xtrain,y=ytrain,alpha=1) 

The following Figure shows the Lasso solution for a grid of \(\lambda\) values. We note that the Lasso shrinks some coefficients to exactly zero.


We again choose the optimal tuning parameter \(\lambda_{\rm opt}\) by cross-validation.

cv.lasso.glmnet <-cv.glmnet(x=xtrain,y=ytrain,alpha=1) 

## [1] 0.2201019

The coefficient for the optimal model can be extracted using the coef function.

beta.lasso <- coef(fit.lasso.glmnet, s = cv.lasso.glmnet$lambda.min)
names(beta.lasso) <- colnames(xtrain)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  0.08727244
## V1           1.44830414
## V2          -0.04302609
## V3           .         
## V4          -0.07325330
## V5           .         
## V6           .         
## V7           .         
## V8          -0.24778236
## V9           .

We now discuss some properties of the Lasso.

4.3.1 Numerical optimization and soft thresholding

We mentioned in the beginning that in general the Lasso optimization problem has not a closed from solution. The reason is that the absolute value function is not differential at zero.

In the case of orthonormal design matrix \(\bf X\), ie. \(\bf X^T\bf X=\bf I\), we have

\[\begin{align*} \rm{RSS}(\beta)&=({\bf y}-{\bf X}\beta)^T({\bf y}-{\bf X}\beta)\\ &={\bf y}{\bf y}-2\beta^T\hat\beta^{\rm OLS}+beta^T\hat\beta \end{align*}\]

and therefore the Lasso optimization reduces to \(j=1,\ldots,p\) univariate problems

\[\rm{minimize}\; -\hat\beta_j^{\rm OLS}\beta_j+\beta_j^2+\lambda |\beta_j|.\]

In the exercises we will show that the solution is

\[\hat \beta_{\lambda,j}^{\rm Lasso}=\rm{sign}(\hat\beta_j^{\rm OLS})\left(|\hat\beta_j^{\rm OLS}|-0.5\lambda\right)_{+}.\]

That is, in the orthonormal case, the Lasso is a function of the OLS estimator. This function, depicted in the next figure, is referred to as soft-thresholding.

In general there is no closed-form solution for the Lasso. The optimization has to be performed numerically. An efficient algorithm is implemented in glmnet and is referred to as “Pathwise Coordinate Optimization.” The algorithm updates one regression coefficient at a time using the soft-thresholding function. This is done iteratively until some convergence criteria is met.

4.3.2 Variable selection

The Lasso does not only shrink coefficients to zero but also performs variable selection and therefore leads to more interpretabel models. For the Lasso we can define the set of selected variables

\[\hat S^{\rm Lasso}_{\lambda}=\{j\in (1,\ldots,p); \hat\beta^{\rm Lasso}_{\lambda,j}\neq 0\}\]

In our example this set can be obtained as follows.

Shat <- rownames(beta.lasso)[which(beta.lasso != 0)]
## [1] "(Intercept)" "V1"          "V2"          "V4"          "V8"

An interesting question is whether the Lasso does a good job in variable selection. That is, does \(\hat S^{\rm Lasso}_{\lambda}\) tend to agree with the true set of active variables \(S_0\)? Or, does the Lasso typically under - or over select covariates? These questions are an active field of mathematical statistical research.

4.3.3 Elastic Net

We have encountered the L1 and L2 penalty. The Lasso penalty has the nice property that it leads to sparse solutions, i.e. it simultaneously performs variable selection. A disadvantage is that the Lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables. The Ridge penalty, on the other hand, tends to shrink the coefficients of correlated variables toward each other. An attempt to take the best of both worlds is the Elastic Net penalty which has the form

\[\lambda \big(\alpha \|\beta\|_2^2+(1-\alpha)\|\beta\|_1\big).\]

The second term encourages highly correlated features to be averaged, while the first term encourages a sparse solution in the coefficients of these averaged features.

In glmnet the Elastic Net penalty is implemented with the argument \(\alpha\). The default is \(\alpha=0\), i.e. the Lasso.

4.4 Diabetes example

We now review what we have learned with an example.The data which we consider consist of observations on 442 patients, with the response of interest being a quantitative measure of disease progression one year after baseline. There are ten baseline variables — age, sex, body-mass index, average blood pressure, and six blood serum measurements — plus quadratic terms, giving a total of \(p=64\) features. The task for a statistician is to construct a model that predicts response \(Y\) from the covariates. The two hopes are, that the model would produce accurate baseline predictions of response for future patients, and also that the form of the model would suggest which covariates were important factors in disease progression.

We start by splitting the data into training and test data.

library(lars) # lars package contains the diabetes data
data <-$y,diabetes$x2))
colnames(data) <- gsub(":",".",colnames(data))
train_ind <- sample(seq(nrow(data)),size=nrow(data)/2)
data_train <- data[train_ind,]
xtrain <- as.matrix(data_train[,-1])
ytrain <- data_train[,1]
data_test <- data[-train_ind,]
xtest <- as.matrix(data_test[,-1])
ytest <- data_test[,1]

We perform subset selection using forward regression.

# Forward regression
fit0 <- lm(y~1,data=data_train)
fit.fw <- stepAIC(fit0,direction="forward",
                                           collapse=" + ")
                  trace = FALSE

The selection process is depicted in the following table.

Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 220 1262297.5 1913.71
+ bmi 1 434735.33 219 827562.1 1822.40
+ ltg 1 155835.95 218 671726.2 1778.30
+ 1 47106.62 217 624619.6 1764.23
+ map 1 29740.28 216 594879.3 1755.45
+ bmi.glu 1 22952.37 215 571926.9 1748.75
+ hdl 1 19077.03 214 552849.9 1743.25
+ sex 1 15702.72 213 537147.2 1738.89
+ hdl.tch 1 9543.83 212 527603.3 1736.92
+ sex.ldl 1 5735.62 211 521867.7 1736.51
+ tch.ltg 1 6279.00 210 515588.7 1735.83
+ 1 5342.10 209 510246.6 1735.53

The regression coefficients and corresponding statistics of the model selected by forward regression are shown next.

term estimate std.error statistic p.value
(Intercept) 155.72 3.36 46.29 0.00
bmi 466.07 81.82 5.70 0.00
ltg 497.33 94.05 5.29 0.00 274.22 76.35 3.59 0.00
map 315.78 80.98 3.90 0.00
bmi.glu 206.59 74.57 2.77 0.01
hdl -392.14 94.40 -4.15 0.00
sex -201.94 80.87 -2.50 0.01
hdl.tch -210.17 87.81 -2.39 0.02
sex.ldl 118.77 74.81 1.59 0.11
tch.ltg -146.12 89.83 -1.63 0.11 119.49 80.78 1.48 0.14

We continue by fitting ridge regression. We show the trace plot and the cross-validation plot.

# Ridge
fit.ridge <- glmnet(xtrain,ytrain,alpha=0) <- cv.glmnet(xtrain,ytrain,alpha=0)


Finally, we run the Lasso approach and show trace and cross-validation plots.

# Lasso
fit.lasso <- glmnet(xtrain,ytrain,alpha=1) <- cv.glmnet(xtrain,ytrain,alpha=1)


We calculate the root-mean-square errors (RMSE) on the test data and compare the generalization error with the full model.

# Full model
fit.full <- lm(y~.,data=data_train)
pred.full <- predict(fit.full,newdata=data_test)
pred.fw <- predict(fit.fw,newdata=data_test)
pred.ridge <- as.vector(predict(fit.ridge,newx=xtest,$lambda.1se))
pred.lasso <- as.vector(predict(fit.lasso,newx=xtest,$lambda.1se))
res.rmse <- data.frame(
kable(res.rmse,digits = 2)
method rmse
full 84.51
forward 59.89
ridge 62.63
lasso 58.47

The Lasso has the lowest RMSE on test data. We plot the regression coefficients for all 3 methods.

We point out that the same analysis can be conducted with the caret package. The code to do so is provided next.

tc <- trainControl(method = "cv", number = 10)

## Ridge
lambda.grid <-$lambda
                       method = "glmnet",
                       tuneGrid = expand.grid(alpha = 0,lambda=lambda.grid),
                       trControl = tc

# CV curve
# Best lambda
# Model coefficients
# Make predictions
fit.ridge.caret %>% predict(xtest,$lambda.1se)%>%head

## Lasso
lambda.grid <-$lambda
                       method = "glmnet",
                       tuneGrid = expand.grid(alpha = 1,lambda=lambda.grid),
                       trControl = tc

# CV curve
# Best lambda
# Model coefficients
# Make predictions
fit.lasso.caret %>% predict(xtest,$lambda.1se)%>%head

## Compare Ridge and Lasso
models <- list(ridge= fit.ridge.caret,ridge = fit.lasso.caret)
resamples(models) %>% summary( metric = "RMSE")


Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.