Chapter 9 Variables Selections

It has shown that lasso can help dropping off some variables. To reduce variance, lasso allow the least squares estimates shrinking towards zero. This method is called shrinkage. PCR is a dimension reduction method which projecting the original predictors into a lower-dimension space. This chapter gives more approaches for systematic variable selections.

9.1 Model Evaluation Criteria

Coefficient of determination $R^2$ is a basic measure of model performance. It has known that adding more predictor always increases $R^2$ . So the subset regression will stop to add new variables when the change of $R^2$ is not significant.

The improvement of $R^2_{adj}$ is that it is not a monotone increasing function. So one can select a maximum value on a convex curve. Maximizing $R^2_{adj}$ is equivalent to minimizing residual mean square $\mathrm{MSE}$

When prediction of the mean response is the interest, $R^2_{pred}$ based on prediction mean square error (PRESS) statistic is more preferred. PRESS is useful for selecting from two competing models.

9.1.1 Mallows $C_p$

Beside above criteria, Mallows $C_p$ statistic is an important criteria related to the mean square error. Suppose the fitted subset model has $p$ variables and expected response $\hat y_i$ . $\mathrm{SSE}(p)$ is the total sum square error including two variance components. $\mathrm{SSE}$ is the true sum square error from the ‘true’ model, while the sum square bias is $\mathrm{SSE}_B(p)=\sum_{i=1}^n(E[y_i]-E[\hat y_i])^2= \mathrm{SSE}(p) - \mathrm{SSE}$ . Then Mallows $C_p$ is

$\begin{equation} \begin{split} C_p=&\frac{1}{\hat\sigma^2}( \mathrm{SSE}_B(p) + \sum_{i=1}^n\mathrm{Var}[\hat y_i] )\\ =&\frac{1}{\hat\sigma^2}( \mathrm{SSE}(p) - \mathrm{SSE} + \sum_{i=1}^n\mathrm{Var}[\hat y_i] )\\ =&\frac{1}{\hat\sigma^2}( \mathrm{SSE}(p) - (n-p)\hat\sigma^2 + p\hat\sigma^2 )\\ =&\frac{\mathrm{SSE}(p)}{\hat\sigma^2} - n + 2p \end{split} \end{equation}$

If the supposed model is true, $\mathrm{SSE}_B(p)=0$ , it gives $E[C_p|\mathrm{Bias}=0] = \frac{(n-p)\sigma^2}{\sigma^2}-(n-2p)=p$ Hence, a plot of $C_p$ versus $p$ can help to find the best one from many points. The proper model should have $C_p\approx p$ and smaller $C_p$ is preferred. $C_p$ is often increase when $\mathrm{SSE}(p)$ decrease by adding predictors. A personal judgment can choose the best tradeoff between samller $C_p$ and smaller $\mathrm{SSE}(p)$ .

9.1.2 Akaike/Bayesian Information Criterion (AIC/BIC)

Akaike Information Criterion (AIC) is a penalized measure using maximum entropy. AIC has a similar characteristic with $C_p$ that it will decrease when adding extra terms into the model. Then one can justify when the model can stop adding the new terms.

$\begin{equation} \mathrm{AIC}=n\ln\left(\frac1n \mathrm{SSE} \right)+ 2p \tag{9.1} \end{equation}$

Bayesian information criterion (BIC) is the extension of AIC. Schwarz (1978) proposed a version of BIC with higher penalty for adding predictors when sample size is large.

$\begin{equation} \mathrm{BIC}=n\ln\left(\frac{1}{n} \mathrm{SSE} \right)+ p\ln(n) \tag{9.2} \end{equation}$

9.2 Selecting Procedure

9.2.1 All Possible Regressions

Suppose data has $p$ candidate predictors. There will be $2^p$ possible models. For example, one can fit 1024 models using $10$ candidate predictors. Then one can select the best one based on aobve criteria. For high-dimension data, fitting all possible regressions is very computing intensive. In practice, people often choose other more efficient procedures.

9.2.2 Best Subset selection

Given a number of selected variables $k\le p$ , there could be $p\choose k$ possible combinations. By fitting all $p\choose k$ models with $k$ predictors, denote the best model with smallest $SSE$ , or largest $R^2$ as $M_k$ . For each $k=1,2,...,p$ , there will be $M_0,M_1,...,M_p$ models. The final winner could be identified by comparing PRESS,

9.2.3 Stepwise Regression

Forward Selection

Forward selection starts from null model with only intercept. In each step of this procedure, a variable with greatest simple correlation with the response will be added into the model. If the new variable $x_1$ gets a large $F$ statistic and shows a significance effect on response, the second step will calculate the partial correlations between two sets of residuals. One is from the new fitted model $\hat y=\beta_0+\beta_1x_1$ . Another one is the model of other candidates on $x_1$ , that is $\hat x_j=\alpha_{0j}+\alpha_{1j}x_1$ , $j=2,3,...,(p-1)$ . Then the variable with largest partial correlation with $y$ is added into the model. The two steps will repeat until the partial $F$ statistic is small at a given significant level.

Backward Elimination

Backward elimination starts from the full model with all candidates. Given a preselected value of $F_0$ , each round will remove the variable with smallest $F$ and refit the model with rest predictors. Then repeat to drop off one variable each round until all remaining predictors have a partial $F_j>F_0$ .

Stepwise Regression

Stepwise regression combines forward selection and backward elimination together. During the forward steps, if some added predictors have a partial $F_j<F_0$ , they also can be removed from the model by backward elimination.

It is common that some candidate predictors are correlated. At the beginning, a predictor $x_1$ having greater simple correlation with response was added into the model. However, along with a subset of related predictors were added, $x_1$ could become ‘useless’ in the model. In this case, backward elimination is necessary for achieving the best solution.

9.3 Underfitting and Overfitting

Suppose the true model is $\mathbf{y}=\mathbf{X}\boldsymbol\beta +\boldsymbol\varepsilon=\mathbf{X}_1\boldsymbol\beta_1 + \mathbf{X}_2\boldsymbol\beta_2 + \boldsymbol\varepsilon$ . $\mathbf{X}$ is full rank $r(\mathbf{X})=r =r_1+r_2$ , $E[\boldsymbol\varepsilon]=0$ , and $Cov[\boldsymbol\varepsilon]= \sigma^2\mathbf{I}_n$ . The normal equation $\mathbf{X'X}\boldsymbol\beta=\mathbf{X'y}$ can be rewrite as

$\begin{equation} \begin{split} \mathbf{X}_1'\mathbf{X}_1\boldsymbol\beta^0_1+\mathbf{X}_1'\mathbf{X}_2\boldsymbol\beta^0_2&=\mathbf{X}_1'\mathbf{y}\\ \mathbf{X}_2'\mathbf{X}_1\boldsymbol\beta^0_1+\mathbf{X}_2'\mathbf{X}_2\boldsymbol\beta^0_2&=\mathbf{X}_2'\mathbf{y}\\ \end{split} \end{equation}$

Let $\mathbf{P}_i=\mathbf{X}_i(\mathbf{X}_i'\mathbf{X}_i)^{-}\mathbf{X}'_i$ , $i=1,2$ , and

$\begin{equation} \begin{split} \mathbf{M}_1=&(\mathbf{X}'_1\mathbf{X}_1)^{-}\mathbf{X}'_1\mathbf{X}_2\\ \mathbf{M}_2=&\mathbf{X}'_2(\mathbf{I}-\mathbf{P}_1)\mathbf{X}_2 \end{split} \end{equation}$

Then,

$\begin{equation} \begin{split} \boldsymbol\beta^0_1=&(\mathbf{X}'_1\mathbf{X}_1)^{-}\mathbf{X}'_1(\mathbf{y}-\mathbf{X}_2\boldsymbol\beta^0_2)\\ \boldsymbol\beta^0_2=&[\mathbf{X}'_2(\mathbf{I}-\mathbf{P}_1)\mathbf{X}_2]^{-}\mathbf{X}'_2(\mathbf{I}-\mathbf{P}_1)\mathbf{y}=\mathbf{M}^{-}_2\mathbf{X}'_2(\mathbf{I}-\mathbf{P}_1)\mathbf{y}\\ \hat\sigma^2=&\frac{1}{n-r}(\mathbf{y}-\mathbf{X}_1\boldsymbol\beta^0_1-\mathbf{X}_2\boldsymbol\beta^0_2)'(\mathbf{y}-\mathbf{X}_1\boldsymbol\beta^0_1-\mathbf{X}_2\boldsymbol\beta^0_2) \end{split} \end{equation}$

9.3.1 Underfitting

In practice, due to data limitation or other reasons, one may only use a subset of the true predictors to fit the model. If the fitted model $\mathbf{y}=\mathbf{X}_1\boldsymbol\beta_1 + \boldsymbol\varepsilon$ doesn’t contain $\mathbf{X}_2$ and $\boldsymbol\beta_2$ , the least squares solutions are

$\begin{equation} \begin{split} \boldsymbol\beta^0_{1,H}=&(\mathbf{X}'_1\mathbf{X}_1)^{-}\mathbf{X}'_1\mathbf{y}\\ \hat\sigma^2_{1,H}=&\frac{1}{n-r_1}\mathbf{y}'(\mathbf{I}-\mathbf{P}_1)\mathbf{y} \end{split} \end{equation}$

It is clear that $\boldsymbol\beta^0_{1,H}$ and $\hat\sigma^2_{1,H}$ are biased estimates because

$\begin{equation} \begin{split} E[\boldsymbol\beta^0_{1,H}]=&(\mathbf{X}'_1\mathbf{X}_1)^{-}\mathbf{X}'_1\mathbf{X}_1\boldsymbol\beta_1+(\mathbf{X}'_1\mathbf{X}_1)^{-}\mathbf{X}'_1\mathbf{X}_2\boldsymbol\beta_2\\ =&\mathbf{H}\boldsymbol\beta_1+\mathbf{M}_1\boldsymbol\beta_2 \end{split} \end{equation}$

and

$\begin{equation} E[\hat\sigma^2_{1,H}]=\sigma^2 + \frac{1}{n-r_1}\boldsymbol\beta'_2\mathbf{X}'_2(\mathbf{I}-\mathbf{P}_1)\mathbf{X}_2\boldsymbol\beta_2 =\sigma^2 + \frac{1}{n-r_1}\boldsymbol\beta'_2\mathbf{M}\boldsymbol\beta_2 \end{equation}$

$E[\boldsymbol\beta^0_{1,H}]=\boldsymbol\beta_1$ and $E[\hat\sigma^2_{1,H}]=\sigma^2$ only when $\boldsymbol\beta_2=0$ or $\mathbf{M}_1=0$ . The later is $\mathbf{X}_1\perp\mathbf{X}_2$ or $\mathbf{X}'_1\mathbf{X}_2=0$ .

Since $\mathbf{\hat Y}_{0,H}=\mathbf{X}_{0,1}\boldsymbol\beta^0_{1,H}$ , $\mathbf{\hat Y}_{0,H}$ is also biased unless $\boldsymbol\beta_2=0$ or $\mathbf{X}_1$ is orthogonal to $\mathbf{X}_2$ .

Denote $MSE_{H}$ as the error mean squares of underfitting model. $MSE=\text{Var-cov}[\boldsymbol{\hat\beta}]+\text{Bias}\cdot\text{Bias}'$ . Then

$\begin{equation} \begin{split} MSE_{H}=&\sigma^2(\mathbf{X}'_1\mathbf{X}_1)^{-} + \mathbf{M}_1\boldsymbol\beta_2\boldsymbol\beta_2'\mathbf{M}'_1\\ MSE=&\sigma^2(\mathbf{X}'_1\mathbf{X}_1)^{-} + \mathbf{M}_1Cov[\boldsymbol\beta_2^0]\mathbf{M}'_1\\ \end{split} \end{equation}$

Since $Cov[\boldsymbol\beta_2^0]-\boldsymbol\beta_2\boldsymbol\beta_2'$ is a positive semidefinite matrix (p.s.d.), $MSE\ge MSE_{H}$ always holds.

9.3.2 Overfitting

In contrast, One may fit a model with extra irrelevant factors. That is, the true model is $\mathbf{y}=\mathbf{X}_1\boldsymbol\beta_1 + \boldsymbol\varepsilon$ and the fitted model is $\mathbf{y}=\mathbf{X}_1\boldsymbol\beta_1 + \mathbf{X}_2\boldsymbol\beta_2 + \boldsymbol\varepsilon$ .

This case implies $\boldsymbol\beta_2=0$ . Then all above estimates are unbiased.

$\begin{equation} \begin{split} E[\boldsymbol\beta^0_{1,H}]=&\mathbf{H}\boldsymbol\beta_1+\mathbf{M}_1\boldsymbol\beta_2=\mathbf{H}\boldsymbol\beta_1\\ E[\hat\sigma^2_{1,H}]=&\sigma^2 + \frac{1}{n-r_1}\boldsymbol\beta'_2\mathbf{M}\boldsymbol\beta_2=\sigma^2\\ MSE_{H}=&\sigma^2(\mathbf{X}'_1\mathbf{X}_1)^{-} + \mathbf{M}_1\boldsymbol\beta_2\boldsymbol\beta_2'\mathbf{M}'_1=\sigma^2(\mathbf{X}'_1\mathbf{X}_1)^{-}\\ \end{split} \end{equation}$

Overfitting model fits the data too closely and may only capture the random noise. Or the extra factors are accidentally related to the response in this data. Hence, the overfitting models produce false positive relationship and perform badly in prediction.

References

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64. https://www.jstor.org/stable/2958889.