## 4.2 Constrained linear models

As outlined in the previous section, after doing variable selection with lasso124, two possibilities are: (i) fit a linear model on the lasso-selected predictors; (ii) run a stepwise selection starting from the lasso-selected model to try to further improve the model125.

Let’s explore the intuitive idea behind (i) in more detail. For the sake of exposition, assume that among $$p$$ predictors, lasso zeroed out the first $$q$$ of them126. Then, once $$q$$ is known, we would seek to fit the model

\begin{align*} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to} \quad\beta_1=\ldots=\beta_q=0. \end{align*}

This is a very simple constraint that we know how to solve: just include the $$p-q$$ remaining predictors in the model and fit it. It is however a specific case of a linear constraint on $$\boldsymbol\beta,$$ since $$\beta_1=\ldots=\beta_q=0$$ is expressible as

\begin{align} \begin{pmatrix} \mathbf{I}_q & \mathbf{0}_{q\times (p-q)} \end{pmatrix}_{q\times p}\boldsymbol\beta_{-1}=\mathbf{0}_q,\tag{4.11} \end{align}

where $$\mathbf{I}_q$$ is an $$q\times q$$ identity matrix and $$\boldsymbol\beta_{-1}=(\beta_1,\ldots,\beta_p)'.$$ The constraint in (4.11) can be generalized as $$\mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},$$ which results in the (linearly) constrained linear model

\begin{align} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to}\quad \mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},\tag{4.12} \end{align}

where $$\mathbf{A}$$ is an $$q\times p$$ matrix127 of rank $$q$$ and $$\mathbf{c}\in\mathbb{R}^q.$$ The constrained linear model (4.12) is useful when there is prior information available about a linear relation that the coefficients of the linear model must satisfy (e.g., in piecewise polynomial fitting).

Before fitting the model (4.12), let’s assume from now on that the variables $$Y$$ and $$X_1,\ldots,X_p,$$ as well the sample $$\{(\mathbf{X}_i,Y_i)\}_{i=1}^n,$$ are centered (see the tip at the end of Section 2.4.4). This means that $$\bar Y=0$$ and that $$\bar{\mathbf{X}}:=(\bar X_1,\ldots,\bar X_p)'$$ is zero. More importantly, it also means that $$\beta_0$$ and $$\hat\beta_0$$ are null, hence they are not included in the model. That is, that the model

\begin{align} Y = \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon \tag{4.13} \end{align}

is considered. In this setting, $$\boldsymbol\beta=(\beta_1,\ldots,\beta_p)'\;$$128 and $$\hat{\boldsymbol\beta}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$$ is the least squares estimator, with the design matrix $$\mathbf{X}$$ now omitting the first column of ones.

Now, the estimator of $$\boldsymbol\beta$$ in (4.13) from a sample $$\{(\mathbf{X}_i,Y_i)\}_{i=1}^n$$ under the linear constraint $$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$$ is defined as

\begin{align} \hat{\boldsymbol{\beta}}_{\mathbf{A}}:=\arg\min_{\substack{\boldsymbol{\beta}\in\mathbb{R}^{p}\\\mathbf{A}\boldsymbol{\beta}=\mathbf{c}}}\text{RSS}_0(\boldsymbol{\beta}),\quad \text{RSS}_0(\boldsymbol{\beta}):=\sum_{i=1}^n(Y_i-\beta_1X_{i1}-\cdots-\beta_p X_{ip})^2.\tag{4.14} \end{align}

Solving (4.14) analytically is possible using Lagrange multipliers, and the explicit solution to (4.14) can be seen to be

\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}=\hat{\boldsymbol\beta}+(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} (\mathbf{c}-\mathbf{A}\hat{\boldsymbol\beta}).\tag{4.15} \end{align}

For the general case given in (4.12), in which neither $$Y$$ and $$\mathbf{X}$$ nor the sample are centered, the estimator of $$\boldsymbol\beta$$ in (4.12) is unaltered for the slopes and equals (4.15). The intercept is given by \begin{align*} \hat{\beta}_{\mathbf{A},0}=\bar Y-\bar{\mathbf{X}}'\hat{\boldsymbol{\beta}}_{\mathbf{A}}. \end{align*}

The next code illustrates how to fit a linear model with constraints in practice.

# Simulate data
set.seed(123456)
n <- 50
p <- 3
x1 <- rnorm(n, mean = 1)
x2 <- rnorm(n, mean = 2)
x3 <- rnorm(n, mean = 3)
eps <- rnorm(n, sd = 0.5)
y <- 1 + 2 * x1 - 3 * x2 + x3 + eps

# Center the data and compute design matrix
x1Cen <- x1 - mean(x1)
x2Cen <- x2 - mean(x2)
x3Cen <- x3 - mean(x3)
yCen <- y - mean(y)
X <- cbind(x1Cen, x2Cen, x3Cen)

# Linear restriction: use that
# beta_1 + beta_2 + beta_3 = 0
# beta_2 = -3
# In this case q = 2. The restriction is codified as
A <- rbind(c(1, 1, 1),
c(0, 1, 0))
c <- c(0, -3)

# Fit model without intercept
S <- solve(crossprod(X))
beta_hat <- S %*% t(X) %*% yCen
beta_hat
##             [,1]
## x1Cen  1.9873776
## x2Cen -3.1449015
## x3Cen  0.9828062

# Restricted fit enforcing A * beta = c
beta_hat_A <- beta_hat +
S %*% t(A) %*% solve(A %*% S %*% t(A)) %*% (c - A %*% beta_hat)
beta_hat_A
##             [,1]
## x1Cen  2.0154729
## x2Cen -3.0000000
## x3Cen  0.9845271

# Intercept of the constrained fit
beta_hat_A_0 <- mean(y) - c(mean(x1), mean(x2), mean(x3)) %*% beta_hat_A
beta_hat_A_0
##         [,1]
## [1,] 1.02824

What about inference? In principle, it can be obtained analogously to how the inference for the unconstrained linear model was obtained in Section 2.4, since the distribution of $$\hat{\boldsymbol\beta}_{\mathbf{A}}$$ under the assumptions of the linear model is straightforward to obtain. We keep assuming that the model is centered. Then, recall that (4.15) can be expressed as

\begin{align*} \hat{\boldsymbol\beta}_{\mathbf{A}}=&\,(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{c}\\ &+\left(\mathbf{I}-(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} \mathbf{A}\right)\hat{\boldsymbol\beta}. \end{align*}

Therefore, using (1.4) and proceeding similarly to (2.11),

\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}\sim\mathcal{N}_p\left(\boldsymbol{\beta}+b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X}),\sigma^2(\mathbf{X}'\mathbf{X})^{-1}-v(\sigma^2,\mathbf{A},\mathbf{X})\right),\tag{4.16} \end{align}

where

\begin{align*} b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})&:=(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}(\mathbf{c}- \mathbf{A}\boldsymbol{\beta}),\\ v(\sigma^2,\mathbf{A},\mathbf{X})&:=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{A}(\mathbf{X}'\mathbf{X})^{-1}. \end{align*}

The inference for constrained linear models is not built within base R. Therefore, we just give a couple of insights about (4.16) and do not pursue inference further. Note that:

• The variances of $$\hat{\beta}_{\mathbf{A},j},$$ $$j=1,\ldots, p,$$ decrease with respect to the variances of $$\hat{\beta}_j,$$ given by the diagonal elements of $$\sigma^2(\mathbf{X}'\mathbf{X})^{-1}.$$ This is perfectly coherent, after all we are constraining the possible values that the estimator of $$\boldsymbol\beta$$ can take in order to accommodate $$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}.$$ More importantly, these variances remain the same irrespective of whether $$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$$ holds or not129.

• The bias of $$\hat{\boldsymbol\beta}_{\mathbf{A}}$$ depends on the veracity of $$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$$. If the restriction is verified, then $$b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})=\mathbf{0}$$ and $$\hat{\boldsymbol\beta}_{\mathbf{A}}$$ is still unbiased. However, if $$\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c},$$ then $$\hat{\boldsymbol\beta}_{\mathbf{A}}$$ is severely biased in estimating $$\boldsymbol\beta.$$

Verify by Monte Carlo that the covariance matrix in (4.16) is correct. To do so:

1. Choose $$\boldsymbol\beta,$$ $$\mathbf{A},$$ and $$\mathbf{c}$$ at your convenience.
2. Sample $$n=50$$ observations for the predictors.
3. Sample $$n=50$$ observations for the responses from a linear model based on $$\boldsymbol\beta.$$ Use the same $$n$$ observations for the predictors from Step 2.
4. Compute $$\hat{\boldsymbol\beta}_{\mathbf{A}}.$$
5. Repeat Steps 3–4 $$M=500$$ times, saving each time $$\hat{\boldsymbol\beta}_{\mathbf{A}}.$$
6. Compute the sample covariance matrix of the $$\hat{\boldsymbol\beta}_{\mathbf{A}}$$’s.
7. Compare it with the covariance matrix in (4.16).
Do the same study for checking the expectation in (4.16), for the cases in which $$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$$ and $$\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c}.$$

1. For example, based on the data-driven penalization parameters $$\hat\lambda_{k\text{-CV}}$$ or $$\hat\lambda_{k\text{-1SE}}.$$↩︎

2. Note that with this approach we assign to the more computationally efficient lasso the “hard work” of coming up with a set of relevant predictors from the whole dataset, whereas the betterment of that model is done with the more demanding stepwise regression (if the number of predictors is smaller than $$n$$).↩︎

3. Note that this is a random quantity, but we ignore this fact for the sake of exposition.↩︎

4. Therefore, usually not invertible.↩︎

5. We do not require the previous notation $$\boldsymbol{\beta}_{-1}$$!↩︎

6. They do not depend on $$\mathbf{c}$$!↩︎