4.2 Constrained linear models

As outlined in the previous section, after doing variable selection with lasso,¹²⁶ two possibilities are: (i) fit a linear model on the lasso-selected predictors; (ii) run a stepwise selection starting from the lasso-selected model to try to further improve the model.¹²⁷

Let’s explore the intuitive idea behind (i) in more detail. For the sake of exposition, assume that among $p$ predictors, lasso zeroed out the first $q$ of them.¹²⁸ Then, once $q$ is known, we would seek to fit the model

$\begin{align*} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to} \quad\beta_1=\ldots=\beta_q=0. \end{align*}$

This is a very simple constraint that we know how to solve: just include the $p-q$ remaining predictors in the model and fit it. It is however a specific case of a linear constraint on $\boldsymbol\beta,$ since $\beta_1=\ldots=\beta_q=0$ is expressible as

$\begin{align} \begin{pmatrix} \mathbf{I}_q & \mathbf{0}_{q\times (p-q)} \end{pmatrix}_{q\times p}\boldsymbol\beta_{-1}=\mathbf{0}_q,\tag{4.11} \end{align}$

where $\mathbf{I}_q$ is an $q\times q$ identity matrix and $\boldsymbol\beta_{-1}=(\beta_1,\ldots,\beta_p)'.$ The constraint in (4.11) can be generalized as $\mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},$ which results in the (linearly) constrained linear model

$\begin{align} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to}\quad \mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},\tag{4.12} \end{align}$

where $\mathbf{A}$ is an $q\times p$ matrix¹²⁹ of rank $q$ and $\mathbf{c}\in\mathbb{R}^q.$ The constrained linear model (4.12) is useful when there is prior information available about a linear relation that the coefficients of the linear model must satisfy (e.g., in piecewise polynomial fitting).

Before fitting the model (4.12), let’s assume from now on that the variables $Y$ and $X_1,\ldots,X_p,$ as well the sample $\{(\mathbf{X}_i,Y_i)\}_{i=1}^n,$ are centered (see the tip at the end of Section 2.4.4). This means that $\bar Y=0$ and that $\bar{\mathbf{X}}:=(\bar X_1,\ldots,\bar X_p)'$ is zero. More importantly, it also means that $\beta_0$ and $\hat\beta_0$ are null, hence they are not included in the model. That is, that the model

$\begin{align} Y = \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon \tag{4.13} \end{align}$

is considered. In this setting, $\boldsymbol\beta=(\beta_1,\ldots,\beta_p)'\;$ ¹³⁰ and $\hat{\boldsymbol\beta}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$ is the least squares estimator, with the design matrix $\mathbf{X}$ now omitting the first column of ones.

Now, the estimator of $\boldsymbol\beta$ in (4.13) from a sample $\{(\mathbf{X}_i,Y_i)\}_{i=1}^n$ under the linear constraint $\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$ is defined as

$\begin{align} \hat{\boldsymbol{\beta}}_{\mathbf{A}}:=\arg\min_{\substack{\boldsymbol{\beta}\in\mathbb{R}^{p}\\\mathbf{A}\boldsymbol{\beta}=\mathbf{c}}}\text{RSS}_0(\boldsymbol{\beta}),\quad \text{RSS}_0(\boldsymbol{\beta}):=\sum_{i=1}^n(Y_i-\beta_1X_{i1}-\cdots-\beta_p X_{ip})^2.\tag{4.14} \end{align}$

Solving (4.14) analytically is possible using Lagrange multipliers, and the explicit solution to (4.14) can be seen to be

$\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}=\hat{\boldsymbol\beta}+(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} (\mathbf{c}-\mathbf{A}\hat{\boldsymbol\beta}).\tag{4.15} \end{align}$

For the general case given in (4.12), in which neither $Y$ and $\mathbf{X}$ nor the sample are centered, the estimator of $\boldsymbol\beta$ in (4.12) is unaltered for the slopes and equals (4.15). The intercept is given by $\begin{align*} \hat{\beta}_{\mathbf{A},0}=\bar Y-\bar{\mathbf{X}}'\hat{\boldsymbol{\beta}}_{\mathbf{A}}. \end{align*}$

The next code illustrates how to fit a linear model with constraints in practice.

# Simulate data
set.seed(123456)
n <- 50
p <- 3
x1 <- rnorm(n, mean = 1)
x2 <- rnorm(n, mean = 2)
x3 <- rnorm(n, mean = 3)
eps <- rnorm(n, sd = 0.5)
y <- 1 + 2 * x1 - 3 * x2 + x3 + eps

# Center the data and compute design matrix
x1Cen <- x1 - mean(x1)
x2Cen <- x2 - mean(x2)
x3Cen <- x3 - mean(x3)
yCen <- y - mean(y)
X <- cbind(x1Cen, x2Cen, x3Cen)

# Linear restriction: use that
# beta_1 + beta_2 + beta_3 = 0
# beta_2 = -3
# In this case q = 2. The restriction is codified as
A <- rbind(c(1, 1, 1),
           c(0, 1, 0))
c <- c(0, -3)

# Fit model without intercept
S <- solve(crossprod(X))
beta_hat <- S %*% t(X) %*% yCen
beta_hat
##             [,1]
## x1Cen  1.9873776
## x2Cen -3.1449015
## x3Cen  0.9828062

# Restricted fit enforcing A * beta = c
beta_hat_A <- beta_hat +
  S %*% t(A) %*% solve(A %*% S %*% t(A)) %*% (c - A %*% beta_hat)
beta_hat_A
##             [,1]
## x1Cen  2.0154729
## x2Cen -3.0000000
## x3Cen  0.9845271

# Intercept of the constrained fit
beta_hat_A_0 <- mean(y) - c(mean(x1), mean(x2), mean(x3)) %*% beta_hat_A
beta_hat_A_0
##         [,1]
## [1,] 1.02824

What about inference? In principle, it can be obtained analogously to how the inference for the unconstrained linear model was obtained in Section 2.4, since the distribution of $\hat{\boldsymbol\beta}_{\mathbf{A}}$ under the assumptions of the linear model is straightforward to obtain. We keep assuming that the model is centered. Then, recall that (4.15) can be expressed as

$\begin{align*} \hat{\boldsymbol\beta}_{\mathbf{A}}=&\,(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{c}\\ &+\left(\mathbf{I}-(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} \mathbf{A}\right)\hat{\boldsymbol\beta}. \end{align*}$

Therefore, using (1.4) and proceeding similarly to (2.11),

$\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}\sim\mathcal{N}_p\left(\boldsymbol{\beta}+b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X}),\sigma^2(\mathbf{X}'\mathbf{X})^{-1}-v(\sigma^2,\mathbf{A},\mathbf{X})\right),\tag{4.16} \end{align}$

where

$\begin{align*} b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})&:=(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}(\mathbf{c}- \mathbf{A}\boldsymbol{\beta}),\\ v(\sigma^2,\mathbf{A},\mathbf{X})&:=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{A}(\mathbf{X}'\mathbf{X})^{-1}. \end{align*}$

The inference for constrained linear models is not built within base R. Therefore, we just give a couple of insights about (4.16) and do not pursue inference further. Note that:

The variances of $\hat{\beta}_{\mathbf{A},j},$ $j=1,\ldots, p,$ decrease with respect to the variances of $\hat{\beta}_j,$ given by the diagonal elements of $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}.$ This is perfectly coherent, after all we are constraining the possible values that the estimator of $\boldsymbol\beta$ can take in order to accommodate $\mathbf{A}\boldsymbol{\beta}=\mathbf{c}.$ More importantly, these variances remain the same irrespective of whether $\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$ holds or not.¹³¹
The bias of $\hat{\boldsymbol\beta}_{\mathbf{A}}$ depends on the veracity of $\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$ . If the restriction is verified, then $b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})=\mathbf{0}$ and $\hat{\boldsymbol\beta}_{\mathbf{A}}$ is still unbiased. However, if $\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c},$ then $\hat{\boldsymbol\beta}_{\mathbf{A}}$ is severely biased in estimating $\boldsymbol\beta.$

Verify by Monte Carlo that the covariance matrix in (4.16) is correct. To do so:

Choose $\boldsymbol\beta,$ $\mathbf{A},$ and $\mathbf{c}$ at your convenience.
Sample $n=50$ observations for the predictors.
Sample $n=50$ observations for the responses from a linear model based on $\boldsymbol\beta.$ Use the same $n$ observations for the predictors from Step 2.
Compute $\hat{\boldsymbol\beta}_{\mathbf{A}}.$
Repeat Steps 3–4 $M=500$ times, saving each time $\hat{\boldsymbol\beta}_{\mathbf{A}}.$
Compute the sample covariance matrix of the $\hat{\boldsymbol\beta}_{\mathbf{A}}$ ’s.
Compare it with the covariance matrix in (4.16).

Do the same study for checking the expectation in (4.16), for the cases in which

$\mathbf{A}\boldsymbol{\beta}=\mathbf{c}$ and

$\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c}.$

For example, based on the data-driven penalization parameters $\hat\lambda_{k\text{-CV}}$ or $\hat\lambda_{k\text{-1SE}}.$ ↩︎
Note that with this approach we assign to the more computationally efficient lasso the “hard work” of coming up with a set of relevant predictors from the whole dataset, whereas the betterment of that model is done with the more demanding stepwise regression (if the number of predictors is smaller than $n$ ).↩︎
Note that this is a random quantity, but we ignore this fact for the sake of exposition.↩︎
Therefore, usually not invertible.↩︎
We do not require the previous notation $\boldsymbol{\beta}_{-1}$ !↩︎
They do not depend on $\mathbf{c}$ !↩︎