Chapter 9 Variables Selections
It has shown that lasso can help dropping off some variables. To reduce variance, lasso allow the least squares estimates shrinking towards zero. This method is called shrinkage. PCR is a dimension reduction method which projecting the original predictors into a lower-dimension space. This chapter gives more approaches for systematic variable selections.
9.1 Model Evaluation Criteria
Coefficient of determination R2R2is a basic measure of model performance. It has known that adding more predictor always increases R2R2. So the subset regression will stop to add new variables when the change of R2R2 is not significant.
The improvement of R2adjR2adj is that it is not a monotone increasing function. So one can select a maximum value on a convex curve. Maximizing R2adjR2adj is equivalent to minimizing residual mean square MSEMSE
When prediction of the mean response is the interest, R2predR2pred based on prediction mean square error (PRESS) statistic is more preferred. PRESS is useful for selecting from two competing models.
9.1.1 Mallows CpCp
Beside above criteria, Mallows CpCp statistic is an important criteria related to the mean square error. Suppose the fitted subset model has pp variables and expected response ˆyi^yi. SSE(p)SSE(p) is the total sum square error including two variance components. SSESSE is the true sum square error from the ‘true’ model, while the sum square bias is SSEB(p)=∑ni=1(E[yi]−E[ˆyi])2=SSE(p)−SSESSEB(p)=∑ni=1(E[yi]−E[^yi])2=SSE(p)−SSE. Then Mallows Cp is
Cp=1ˆσ2(SSEB(p)+n∑i=1Var[ˆyi])=1ˆσ2(SSE(p)−SSE+n∑i=1Var[ˆyi])=1ˆσ2(SSE(p)−(n−p)ˆσ2+pˆσ2)=SSE(p)ˆσ2−n+2p
If the supposed model is true, SSEB(p)=0, it gives E[Cp|Bias=0]=(n−p)σ2σ2−(n−2p)=p Hence, a plot of Cp versus p can help to find the best one from many points. The proper model should have Cp≈p and smaller Cp is preferred. Cp is often increase when SSE(p) decrease by adding predictors. A personal judgment can choose the best tradeoff between samller Cp and smaller SSE(p).
9.1.2 Akaike/Bayesian Information Criterion (AIC/BIC)
Akaike Information Criterion (AIC) is a penalized measure using maximum entropy. AIC has a similar characteristic with Cp that it will decrease when adding extra terms into the model. Then one can justify when the model can stop adding the new terms.
AIC=nln(1nSSE)+2p
Bayesian information criterion (BIC) is the extension of AIC. Schwarz (1978) proposed a version of BIC with higher penalty for adding predictors when sample size is large.
BIC=nln(1nSSE)+pln(n)
9.2 Selecting Procedure
9.2.1 All Possible Regressions
Suppose data has p candidate predictors. There will be 2p possible models. For example, one can fit 1024 models using 10 candidate predictors. Then one can select the best one based on aobve criteria. For high-dimension data, fitting all possible regressions is very computing intensive. In practice, people often choose other more efficient procedures.
9.2.2 Best Subset selection
Given a number of selected variables k≤p, there could be (pk) possible combinations. By fitting all (pk) models with k predictors, denote the best model with smallest SSE, or largest R2 as Mk. For each k=1,2,...,p, there will be M0,M1,...,Mp models. The final winner could be identified by comparing PRESS,
9.2.3 Stepwise Regression
- Forward Selection
Forward selection starts from null model with only intercept. In each step of this procedure, a variable with greatest simple correlation with the response will be added into the model. If the new variable x1 gets a large F statistic and shows a significance effect on response, the second step will calculate the partial correlations between two sets of residuals. One is from the new fitted model ˆy=β0+β1x1. Another one is the model of other candidates on x1, that is ˆxj=α0j+α1jx1, j=2,3,...,(p−1). Then the variable with largest partial correlation with y is added into the model. The two steps will repeat until the partial F statistic is small at a given significant level.
- Backward Elimination
Backward elimination starts from the full model with all candidates. Given a preselected value of F0, each round will remove the variable with smallest F and refit the model with rest predictors. Then repeat to drop off one variable each round until all remaining predictors have a partial Fj>F0.
- Stepwise Regression
Stepwise regression combines forward selection and backward elimination together. During the forward steps, if some added predictors have a partial Fj<F0, they also can be removed from the model by backward elimination.
It is common that some candidate predictors are correlated. At the beginning, a predictor x1 having greater simple correlation with response was added into the model. However, along with a subset of related predictors were added, x1 could become ‘useless’ in the model. In this case, backward elimination is necessary for achieving the best solution.
9.3 Underfitting and Overfitting
Suppose the true model is y=Xβ+ε=X1β1+X2β2+ε. X is full rank r(X)=r=r1+r2, E[ε]=0, and Cov[ε]=σ2In. The normal equation X′Xβ=X′y can be rewrite as
X′1X1β01+X′1X2β02=X′1yX′2X1β01+X′2X2β02=X′2y
Let Pi=Xi(X′iXi)−X′i, i=1,2, and
M1=(X′1X1)−X′1X2M2=X′2(I−P1)X2
Then,
β01=(X′1X1)−X′1(y−X2β02)β02=[X′2(I−P1)X2]−X′2(I−P1)y=M−2X′2(I−P1)yˆσ2=1n−r(y−X1β01−X2β02)′(y−X1β01−X2β02)
9.3.1 Underfitting
In practice, due to data limitation or other reasons, one may only use a subset of the true predictors to fit the model. If the fitted model y=X1β1+ε doesn’t contain X2 and β2, the least squares solutions are
β01,H=(X′1X1)−X′1yˆσ21,H=1n−r1y′(I−P1)y
It is clear that β01,H and ˆσ21,H are biased estimates because
E[β01,H]=(X′1X1)−X′1X1β1+(X′1X1)−X′1X2β2=Hβ1+M1β2
and
E[ˆσ21,H]=σ2+1n−r1β′2X′2(I−P1)X2β2=σ2+1n−r1β′2Mβ2
E[β01,H]=β1 and E[ˆσ21,H]=σ2 only when β2=0 or M1=0. The later is X1⊥X2 or X′1X2=0.
Since ˆY0,H=X0,1β01,H, ˆY0,H is also biased unless β2=0 or X1 is orthogonal to X2.
Denote MSEH as the error mean squares of underfitting model. MSE=Var-cov[ˆβ]+Bias⋅Bias′. Then
MSEH=σ2(X′1X1)−+M1β2β′2M′1MSE=σ2(X′1X1)−+M1Cov[β02]M′1
Since Cov[β02]−β2β′2 is a positive semidefinite matrix (p.s.d.), MSE≥MSEH always holds.
9.3.2 Overfitting
In contrast, One may fit a model with extra irrelevant factors. That is, the true model is y=X1β1+ε and the fitted model is y=X1β1+X2β2+ε.
This case implies β2=0. Then all above estimates are unbiased.
E[β01,H]=Hβ1+M1β2=Hβ1E[ˆσ21,H]=σ2+1n−r1β′2Mβ2=σ2MSEH=σ2(X′1X1)−+M1β2β′2M′1=σ2(X′1X1)−
Overfitting model fits the data too closely and may only capture the random noise. Or the extra factors are accidentally related to the response in this data. Hence, the overfitting models produce false positive relationship and perform badly in prediction.