Chapter 8 Multicollinearity

Multicollinearity or near-linear dependence refers to the models with highly correlated predictors. When data is generated from experimental design, the treatments XX could be fixed variables and be orthogonal. But travel-urban form model is observational studies and nothing can be controlled as in lab. It is known that there are complex correlations among the built-environment predictors themselves.

Although, the basic IID assumptions do not require that all predictors XX are independent, when the predictors are near-linear dependent, the model is ill-conditioned and the least-square estimators are unstable.

8.1 Variance Inflation

multicollinearity can make the variances inflated and impact model precision seriously. If some of predictors are exact linear dependent, the matrix (XX)1 is symmetric but non-invertible. By spectral decomposition of symmetric matrix, XX=PΛP where Λ=diag(λ1,...,λp), λi’s are eigenvalues of XX, P is an orthogonla matrix whose columns are normalize eigenvectors. Then the total-variance of ˆβLS is σ2pj=11/λj. If the predictors are near-linear dependent or nearly singular, λjs may be very small and the total-variance of ˆβLS is highly inflated.

For the same reason, the correlation matrix using unit length scaling ZZwill has a inverse matrix with inflated variances. That means that the diagonal elements of (ZZ)1 are not all equal to one. The diagnoal elements are called Variance Inflation Factors, which can be used to examine multicollinearity. The VIF for a particular predictor is examined as below

VIFj=11R2j

where R2j is the coefficient of determination by regressing xj on all the remaining predictors.

A common approach is to drop off the predictor with greatest VIF and refit the model until all VIFs are less than 10. However, dropping off one or more predictors will lose many information which might be valuable for explaining response. Due to the complexity among predictors, dropping off the predictor with the greatest VIF is not always the best choice. Sometimes, removing a predictor with moderate VIF can make all VIFs less than 10 in the refitted model. Moreover, there is not an unique criteria for VIF value. When the relationship between predictor and response is weak, or the R2 is low, the VIFs less than 10 may also affect the ability of estimation dramatically.

Orthogonalization before fitting the model might be helpful. Other approaches such as ridge regression or principal components regression could deal with multicollinearity better.

8.2 Ridge Regression

Least squares method gives the unbiased estimates of regression coefficients. However, multicollinearity will lead to inflated variance and make the estimates unstable and unreliable. To get a smaller variance, a tradeoff is to release the requirement of unbiasedness. Hoerl and Kennard (1970) proposed ridge regression to address the nonorthogonal problems. Denote ˆβR are biased estimates but its variance is small enough.

MSE(ˆβR)=E[ˆβRβ]2=Var[ˆβR]+Bias[ˆβR]2<MSE(ˆβLS)=Var[ˆβLS]

The estimates of ridge regression are

ˆβR=(XX+kI)1Xy where k0 is a selected constant and is called a biasing parameter. When k=0, the ridge estimator reduces to least squares estimators.

When X is nonsingular and (XX)1 exists, the ridge estimator is a linear transformation of ˆβLS. That is ˆβR=ZkˆβLS where Zk=(XX+kI)1XX

Recall the total-variance of ˆβLS is σ2pj=11/λj. The total-variance of ˆβR is

tr(Cov[ˆβR])=σ2pj=1λj(λj+k)2

Thus, introducing k into the model can avoid tiny denominators and eliminate the inflated variance. Choosing a proper value of k is to keep the balance of MSE and Bias. The bias in ˆβR is

Bias(ˆβR)2=k2β(XX+kI)2β

Hence,increasing k will reduce MSE but make greater bias. Ridge trace is a plot of ˆβR versus k that can help to select a suitable value of k. First, at the value of k, the estimates should be stable. Second, the estimated coefficients should have proper sign and reasonable values. Third, the SSE also should has a reasonable value.

Ridge regression will not give a greater R2 than least squares method. Because the total sum of squares is fixed.

SSE(ˆβR)=(yXˆβR)(yXˆβR)=(yXˆβLS)(yXˆβLS)+(ˆβLSˆβR)XX(ˆβLSˆβR)=SSE(ˆβLS)+(ˆβLSˆβR)XX(ˆβLSˆβR)SSE(ˆβLS)

The advantage of ridge regression is to abtain a suitable set of parameter estimates rather than to improve the fitness. It could have a better prediction ability than least squares. It can also be useful for variable selection. The variables with unstable ridge trace or tending toward the value of zero can be removed from the model.

In many case, the ridge trace is erratic divergence and may revert back to least square estimates. (D. R. Jensen and Ramirez 2010; Donald R. Jensen and Ramirez 2012) proposed surrogate model to further improve ridge regression. Surrogate model chooses k depend on matrix X and free to Y.

Using a compact singular value decomposition (SVD), the original can be decomposed to maxtixX=PDξQ. P and Q are orthogonal. The columns of P and Q are left-singular vectors and right-singular vectors of X. It satisfies PP=I and Dξ=diag(ξ1,...,ξp) is decreasing singular values. Then Xk=PD((ξ2i+ki)1/2)Q and

XX=QD2ξQXkXk=Q(D2ξ+K)Qgeneralized surrogateXkXk=QD2ξQ+kIordinary surrogate

and the surrogate solution ˆβS is

Q(D2ξ+K)QˆβS=Xk=QD((ξ2i+ki)1/2)Py

Jensen and Ramirez proved that SSE(ˆβS)<SSE(ˆβS) and surrogate model’s canonical traces are monotone in k.

8.3 Lasso Regression

Ridge regression can be understood as a restricted least squares problem. Denote the constraint s, the solution of ridge coefficient estimates satisfies

minβ{ni=1(yiβ0pj=1βjxj)2} subject to pj=1β2js

Another approach is to replace the constraint term pj=1β2js with pj=1|βj|s. This method is called lasso regression.

minβ{ni=1(yiβ0pj=1βjxj)2} subject to pj=1|βj|s

Suppose the case of two predictors, the quadratic loss function creates a spherical constraint for a geometric illustration, while the norm loss function is a diamond. The contours of SSE are many expanding ellipses centered around least square estimate ˆβLS. Each ellipse represents a k value.

If the restriction s also called ‘budget’ is very large, the restriction area will cover the point of ˆβLS. That means ˆβLS=ˆβR and k=0. When s is small, the solution is to choose the ellipse contacting the constraint area with corresponding k and SSE.

Here lasso constraint has sharp corners at each axes. When the ellipse has a intersect point on one corner, that means one of the coefficient equals zero. But it will not happen on ridge constraint. Therefore, an improvement of lasso with respect to ridge regression is that lasso allow some estimates βj=0. It makes the results more interpretative. Moreover, lasso regression can make variable selection

8.4 Principal Components Regression

Principal Components Regression (PCR) is a dimension reduction method which that can handle multicollinearity. It still uses a singular value decomposition (SVD) and get XX=QΛQ Q are the matrix who columns are orthogonal eigenvectors of XX. Λ=diag(λ1,...,λp) is decreasing eigenvalues with λ1λ1λp. Then the linear model can transfer to

y=XQQβ+ε=Zθ+ε

where Z=XQ, θ=Qβ. θ is called the regression parameters of the principal components. Z={z1,...,zp} is known as the matrix of principal components of XX. Then zjzj=λj is the jth largest eigenvalue of XX.

PCR usually chooses several z_js with largest λjs and can eliminate multicollinearity. Its estimates ˆβP results in low bias but the mean squared error MSE(ˆβP) is smaller than that of least square MSE(ˆβLS).

Therefore, some disaggregated travel models’ R2 can be over 0.5. But the limitation is that the principal components are hard to interpret the meaning. The results of PCR may just describe the data themselves and they are reproducible but not replicable for other data.

References

Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67. https://doi.org/10.1080/00401706.1970.10488634.
Jensen, D. R., and D. E. Ramirez. 2010. “Surrogate Models in Ill-Conditioned Systems.” Journal of Statistical Planning and Inference 140 (7): 2069–77. https://doi.org/10.1016/j.jspi.2010.02.001.
Jensen, Donald R., and Donald E. Ramirez. 2012. “Variations on Ridge Traces in Regression.” Communications in Statistics - Simulation and Computation 41 (2): 265–78. https://doi.org/10.1080/03610918.2011.586482.