4.2 Constrained linear models

As outlined in the previous section, after doing variable selection with lasso124, two possibilities are: (i) fit a linear model on the lasso-selected predictors; (ii) run a stepwise selection starting from the lasso-selected model to try to further improve the model125.

Let’s explore the intuitive idea behind (i) in more detail. For the sake of exposition, assume that among \(p\) predictors, lasso zeroed out the first \(q\) of them126. Then, once \(q\) is known, we would seek to fit the model

\[\begin{align*} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to} \quad\beta_1=\ldots=\beta_q=0. \end{align*}\]

This is a very simple constraint that we know how to solve: just include the \(p-q\) remaining predictors in the model and fit it. It is however a specific case of a linear constraint on \(\boldsymbol\beta,\) since \(\beta_1=\ldots=\beta_q=0\) is expressible as

\[\begin{align} \begin{pmatrix} \mathbf{I}_q & \mathbf{0}_{q\times (p-q)} \end{pmatrix}_{q\times p}\boldsymbol\beta_{-1}=\mathbf{0}_q,\tag{4.11} \end{align}\]

where \(\mathbf{I}_q\) is an \(q\times q\) identity matrix and \(\boldsymbol\beta_{-1}=(\beta_1,\ldots,\beta_p)'.\) The constraint in (4.11) can be generalized as \(\mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},\) which results in the (linearly) constrained linear model

\[\begin{align} Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon,\quad \text{subject to}\quad \mathbf{A}\boldsymbol{\beta}_{-1}=\mathbf{c},\tag{4.12} \end{align}\]

where \(\mathbf{A}\) is an \(q\times p\) matrix127 of rank \(q\) and \(\mathbf{c}\in\mathbb{R}^q.\) The constrained linear model (4.12) is useful when there is prior information available about a linear relation that the coefficients of the linear model must satisfy (e.g., in piecewise polynomial fitting).

Before fitting the model (4.12), let’s assume from now on that the variables \(Y\) and \(X_1,\ldots,X_p,\) as well the sample \(\{(\mathbf{X}_i,Y_i)\}_{i=1}^n,\) are centered (see the tip at the end of Section 2.4.4). This means that \(\bar Y=0\) and that \(\bar{\mathbf{X}}:=(\bar X_1,\ldots,\bar X_p)'\) is zero. More importantly, it also means that \(\beta_0\) and \(\hat\beta_0\) are null, hence they are not included in the model. That is, that the model

\[\begin{align} Y = \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon \tag{4.13} \end{align}\]

is considered. In this setting, \(\boldsymbol\beta=(\beta_1,\ldots,\beta_p)'\;\)128 and \(\hat{\boldsymbol\beta}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\) is the least squares estimator, with the design matrix \(\mathbf{X}\) now omitting the first column of ones.

Now, the estimator of \(\boldsymbol\beta\) in (4.13) from a sample \(\{(\mathbf{X}_i,Y_i)\}_{i=1}^n\) under the linear constraint \(\mathbf{A}\boldsymbol{\beta}=\mathbf{c}\) is defined as

\[\begin{align} \hat{\boldsymbol{\beta}}_{\mathbf{A}}:=\arg\min_{\substack{\boldsymbol{\beta}\in\mathbb{R}^{p}\\\mathbf{A}\boldsymbol{\beta}=\mathbf{c}}}\text{RSS}_0(\boldsymbol{\beta}),\quad \text{RSS}_0(\boldsymbol{\beta}):=\sum_{i=1}^n(Y_i-\beta_1X_{i1}-\cdots-\beta_p X_{ip})^2.\tag{4.14} \end{align}\]

Solving (4.14) analytically is possible using Lagrange multipliers, and the explicit solution to (4.14) can be seen to be

\[\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}=\hat{\boldsymbol\beta}+(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} (\mathbf{c}-\mathbf{A}\hat{\boldsymbol\beta}).\tag{4.15} \end{align}\]

For the general case given in (4.12), in which neither \(Y\) and \(\mathbf{X}\) nor the sample are centered, the estimator of \(\boldsymbol\beta\) in (4.12) is unaltered for the slopes and equals (4.15). The intercept is given by \[\begin{align*} \hat{\beta}_{\mathbf{A},0}=\bar Y-\bar{\mathbf{X}}'\hat{\boldsymbol{\beta}}_{\mathbf{A}}. \end{align*}\]

The next code illustrates how to fit a linear model with constraints in practice.

# Simulate data
set.seed(123456)
n <- 50
p <- 3
x1 <- rnorm(n, mean = 1)
x2 <- rnorm(n, mean = 2)
x3 <- rnorm(n, mean = 3)
eps <- rnorm(n, sd = 0.5)
y <- 1 + 2 * x1 - 3 * x2 + x3 + eps

# Center the data and compute design matrix
x1Cen <- x1 - mean(x1)
x2Cen <- x2 - mean(x2)
x3Cen <- x3 - mean(x3)
yCen <- y - mean(y)
X <- cbind(x1Cen, x2Cen, x3Cen)

# Linear restriction: use that 
# beta_1 + beta_2 + beta_3 = 0
# beta_2 = -3
# In this case q = 2. The restriction is codified as
A <- rbind(c(1, 1, 1),
           c(0, 1, 0))
c <- c(0, -3)

# Fit model without intercept
S <- solve(crossprod(X))
beta_hat <- S %*% t(X) %*% yCen
beta_hat
##             [,1]
## x1Cen  1.9873776
## x2Cen -3.1449015
## x3Cen  0.9828062

# Restricted fit enforcing A * beta = c
beta_hat_A <- beta_hat + 
  S %*% t(A) %*% solve(A %*% S %*% t(A)) %*% (c - A %*% beta_hat)
beta_hat_A
##             [,1]
## x1Cen  2.0154729
## x2Cen -3.0000000
## x3Cen  0.9845271

# Intercept of the constrained fit
beta_hat_A_0 <- mean(y) - c(mean(x1), mean(x2), mean(x3)) %*% beta_hat_A
beta_hat_A_0
##         [,1]
## [1,] 1.02824

What about inference? In principle, it can be obtained analogously to how the inference for the unconstrained linear model was obtained in Section 2.4, since the distribution of \(\hat{\boldsymbol\beta}_{\mathbf{A}}\) under the assumptions of the linear model is straightforward to obtain. We keep assuming that the model is centered. Then, recall that (4.15) can be expressed as

\[\begin{align*} \hat{\boldsymbol\beta}_{\mathbf{A}}=&\,(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{c}\\ &+\left(\mathbf{I}-(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1} \mathbf{A}\right)\hat{\boldsymbol\beta}. \end{align*}\]

Therefore, using (1.4) and proceeding similarly to (2.11),

\[\begin{align} \hat{\boldsymbol\beta}_{\mathbf{A}}\sim\mathcal{N}_p\left(\boldsymbol{\beta}+b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X}),\sigma^2(\mathbf{X}'\mathbf{X})^{-1}-v(\sigma^2,\mathbf{A},\mathbf{X})\right),\tag{4.16} \end{align}\]

where

\[\begin{align*} b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})&:=(\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}(\mathbf{c}- \mathbf{A}\boldsymbol{\beta}),\\ v(\sigma^2,\mathbf{A},\mathbf{X})&:=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{A}'[\mathbf{A} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{A}']^{-1}\mathbf{A}(\mathbf{X}'\mathbf{X})^{-1}. \end{align*}\]

The inference for constrained linear models is not built within base R. Therefore, we just give a couple of insights about (4.16) and do not pursue inference further. Note that:

  • The variances of \(\hat{\beta}_{\mathbf{A},j},\) \(j=1,\ldots, p,\) decrease with respect to the variances of \(\hat{\beta}_j,\) given by the diagonal elements of \(\sigma^2(\mathbf{X}'\mathbf{X})^{-1}.\) This is perfectly coherent, after all we are constraining the possible values that the estimator of \(\boldsymbol\beta\) can take in order to accommodate \(\mathbf{A}\boldsymbol{\beta}=\mathbf{c}.\) More importantly, these variances remain the same irrespective of whether \(\mathbf{A}\boldsymbol{\beta}=\mathbf{c}\) holds or not129.

  • The bias of \(\hat{\boldsymbol\beta}_{\mathbf{A}}\) depends on the veracity of \(\mathbf{A}\boldsymbol{\beta}=\mathbf{c}\). If the restriction is verified, then \(b(\boldsymbol{\beta}, \mathbf{A}, \mathbf{c}, \mathbf{X})=\mathbf{0}\) and \(\hat{\boldsymbol\beta}_{\mathbf{A}}\) is still unbiased. However, if \(\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c},\) then \(\hat{\boldsymbol\beta}_{\mathbf{A}}\) is severely biased in estimating \(\boldsymbol\beta.\)

Verify by Monte Carlo that the covariance matrix in (4.16) is correct. To do so:

  1. Choose \(\boldsymbol\beta,\) \(\mathbf{A},\) and \(\mathbf{c}\) at your convenience.
  2. Sample \(n=50\) observations for the predictors.
  3. Sample \(n=50\) observations for the responses from a linear model based on \(\boldsymbol\beta.\) Use the same \(n\) observations for the predictors from Step 2.
  4. Compute \(\hat{\boldsymbol\beta}_{\mathbf{A}}.\)
  5. Repeat Steps 3–4 \(M=500\) times, saving each time \(\hat{\boldsymbol\beta}_{\mathbf{A}}.\)
  6. Compute the sample covariance matrix of the \(\hat{\boldsymbol\beta}_{\mathbf{A}}\)’s.
  7. Compare it with the covariance matrix in (4.16).
Do the same study for checking the expectation in (4.16), for the cases in which \(\mathbf{A}\boldsymbol{\beta}=\mathbf{c}\) and \(\mathbf{A}\boldsymbol{\beta}\neq\mathbf{c}.\)

  1. For example, based on the data-driven penalization parameters \(\hat\lambda_{k\text{-CV}}\) or \(\hat\lambda_{k\text{-1SE}}.\)↩︎

  2. Note that with this approach we assign to the more computationally efficient lasso the “hard work” of coming up with a set of relevant predictors from the whole dataset, whereas the betterment of that model is done with the more demanding stepwise regression (if the number of predictors is smaller than \(n\)).↩︎

  3. Note that this is a random quantity, but we ignore this fact for the sake of exposition.↩︎

  4. Therefore, usually not invertible.↩︎

  5. We do not require the previous notation \(\boldsymbol{\beta}_{-1}\)!↩︎

  6. They do not depend on \(\mathbf{c}\)!↩︎