Chapter 8 Multicollinearity
Multicollinearity or near-linear dependence refers to the models with highly correlated predictors. When data is generated from experimental design, the treatments \(X\) could be fixed variables and be orthogonal. But travel-urban form model is observational studies and nothing can be controlled as in lab. It is known that there are complex correlations among the built-environment predictors themselves.
Although, the basic IID assumptions do not require that all predictors \(\mathbf{X}\) are independent, when the predictors are near-linear dependent, the model is ill-conditioned and the least-square estimators are unstable.
8.1 Variance Inflation
multicollinearity can make the variances inflated and impact model precision seriously. If some of predictors are exact linear dependent, the matrix \((\mathbf{X'X})^{-1}\) is symmetric but non-invertible. By spectral decomposition of symmetric matrix, \(\mathbf{X'X}=\mathbf{P'\Lambda P}\) where \(\Lambda=\text{diag}(\lambda_1,...,\lambda_p)\), \(\lambda_i\)’s are eigenvalues of \(\mathbf{X'X}\), \(\mathbf{P}\) is an orthogonla matrix whose columns are normalize eigenvectors. Then the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is \(\sigma^2\sum_{j=1}^p1/\lambda_j\). If the predictors are near-linear dependent or nearly singular, \(\lambda_j\)s may be very small and the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is highly inflated.
For the same reason, the correlation matrix using unit length scaling \(\mathbf{Z'Z}\)will has a inverse matrix with inflated variances. That means that the diagonal elements of \((\mathbf{Z'Z})^{-1}\) are not all equal to one. The diagnoal elements are called Variance Inflation Factors, which can be used to examine multicollinearity. The VIF for a particular predictor is examined as below
\[\begin{equation} \mathrm{VIF}_j=\frac{1}{1-R_j^2} \tag{8.1} \end{equation}\]
where \(R_j^2\) is the coefficient of determination by regressing \(x_j\) on all the remaining predictors.
A common approach is to drop off the predictor with greatest VIF and refit the model until all VIFs are less than 10. However, dropping off one or more predictors will lose many information which might be valuable for explaining response. Due to the complexity among predictors, dropping off the predictor with the greatest VIF is not always the best choice. Sometimes, removing a predictor with moderate VIF can make all VIFs less than 10 in the refitted model. Moreover, there is not an unique criteria for VIF value. When the relationship between predictor and response is weak, or the \(R^2\) is low, the VIFs less than 10 may also affect the ability of estimation dramatically.
Orthogonalization before fitting the model might be helpful. Other approaches such as ridge regression or principal components regression could deal with multicollinearity better.
8.2 Ridge Regression
Least squares method gives the unbiased estimates of regression coefficients. However, multicollinearity will lead to inflated variance and make the estimates unstable and unreliable. To get a smaller variance, a tradeoff is to release the requirement of unbiasedness. Hoerl and Kennard (1970) proposed ridge regression to address the nonorthogonal problems. Denote \(\boldsymbol{\hat\beta}_{R}\) are biased estimates but its variance is small enough.
\[\begin{equation} \begin{split} \mathrm{MSE}(\boldsymbol{\hat\beta}_{R})&=E[\boldsymbol{\hat\beta}_{R}-\boldsymbol{\beta}]^2=\mathrm{Var}[\boldsymbol{\hat\beta}_{R}]+\mathrm{Bias}[\boldsymbol{\hat\beta}_{R}]^2\\ &<\mathrm{MSE}(\boldsymbol{\hat\beta}_{LS})=\mathrm{Var}[\boldsymbol{\hat\beta}_{LS}] \end{split} \end{equation}\]
The estimates of ridge regression are
\[\begin{equation} \boldsymbol{\hat\beta}_{R}=(\mathbf{X'X}+k\mathbf{I})^{-1}\mathbf{X'}\mathbf{y} \tag{8.2} \end{equation}\] where \(k\ge0\) is a selected constant and is called a biasing parameter. When \(k=0\), the ridge estimator reduces to least squares estimators.
When \(\mathbf{X}\) is nonsingular and \((\mathbf{X'X})^{-1}\) exists, the ridge estimator is a linear transformation of \(\boldsymbol{\hat\beta}_{LS}\). That is \(\boldsymbol{\hat\beta}_{R}=\mathbf{Z}_k\boldsymbol{\hat\beta}_{LS}\) where \(\mathbf{Z}_k=(\mathbf{X'X}+k\mathbf{I})^{-1}\mathbf{X'X}\)
Recall the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is \(\sigma^2\sum_{j=1}^p1/\lambda_j\). The total-variance of \(\boldsymbol{\hat\beta}_{R}\) is
\[\begin{equation} \mathrm{tr}(\mathrm{Cov}[\boldsymbol{\hat\beta}_{R}])=\sigma^2\sum_{j=1}^p\frac{\lambda_j}{(\lambda_j+k)^2} \end{equation}\]
Thus, introducing \(k\) into the model can avoid tiny denominators and eliminate the inflated variance. Choosing a proper value of \(k\) is to keep the balance of \(\mathrm{MSE}\) and \(\mathrm{Bias}\). The bias in \(\boldsymbol{\hat\beta}_{R}\) is
\[\begin{equation} \mathrm{Bias}(\boldsymbol{\hat\beta}_{R})^2=k^2\boldsymbol{\beta}'(\mathbf{X'X}+k\mathbf{I})^{-2}\boldsymbol{\beta} \end{equation}\]
Hence,increasing \(k\) will reduce \(MSE\) but make greater \(bias\). Ridge trace is a plot of \(\boldsymbol{\hat\beta}_{R}\) versus \(k\) that can help to select a suitable value of \(k\). First, at the value of \(k\), the estimates should be stable. Second, the estimated coefficients should have proper sign and reasonable values. Third, the \(SSE\) also should has a reasonable value.
Ridge regression will not give a greater \(R^2\) than least squares method. Because the total sum of squares is fixed.
\[\begin{equation} \begin{split} \mathrm{SSE}(\boldsymbol{\hat\beta}_{R})&=(\mathbf{y-X}\boldsymbol{\hat\beta}_{R})'(\mathbf{y-X}\boldsymbol{\hat\beta}_{R})\\ &=(\mathbf{y-X}\boldsymbol{\hat\beta}_{LS})'(\mathbf{y-X}\boldsymbol{\hat\beta}_{LS})+(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})'\mathbf{X'X}(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})\\ &=\mathrm{SSE}(\boldsymbol{\hat\beta}_{LS})+(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})'\mathbf{X'X}(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})\\ &\ge \mathrm{SSE}(\boldsymbol{\hat\beta}_{LS}) \end{split} \end{equation}\]
The advantage of ridge regression is to abtain a suitable set of parameter estimates rather than to improve the fitness. It could have a better prediction ability than least squares. It can also be useful for variable selection. The variables with unstable ridge trace or tending toward the value of zero can be removed from the model.
In many case, the ridge trace is erratic divergence and may revert back to least square estimates. (D. R. Jensen and Ramirez 2010; Donald R. Jensen and Ramirez 2012) proposed surrogate model to further improve ridge regression. Surrogate model chooses \(k\) depend on matrix \(\mathbf{X}\) and free to \(\mathbf{Y}\).
Using a compact singular value decomposition (SVD), the original can be decomposed to maxtix\(\mathbf{X}=\mathbf{PD_{\xi}Q}'\). \(\mathbf{P}\) and \(\mathbf{Q}\) are orthogonal. The columns of \(\mathbf{P}\) and \(\mathbf{Q}\) are left-singular vectors and right-singular vectors of \(\mathbf{X}\). It satisfies \(\mathbf{P'P}=\mathbf{I}\) and \(\mathbf{D}_{\xi}=\text{diag}(\xi_1,...,\xi_p)\) is decreasing singular values. Then \(\mathbf{X}_k=\mathbf{PD}((\xi_i^2+k_i)^{1/2})\mathbf{Q}'\) and
\[\begin{equation} \begin{split} \mathbf{X'X}=&\mathbf{QD}_\xi^2\mathbf{Q}'\\ \mathbf{X}_k'\mathbf{X}_k=&\mathbf{Q(D_\xi^2+K)}\mathbf{Q}'\quad\text{generalized surrogate}\\ \mathbf{X}_k'\mathbf{X}_k=&\mathbf{QD}_\xi^2\mathbf{Q}'+k\mathbf{I}\quad\text{ordinary surrogate} \end{split} \end{equation}\]
and the surrogate solution \(\boldsymbol{\hat\beta}_{S}\) is
\[\begin{equation} \mathbf{Q(D^2_{\xi}+K)Q}'\boldsymbol{\hat\beta}_{S}=\mathbf{X}_k=\mathbf{QD}((\xi_i^2+k_i)^{1/2})\mathbf{P}'\mathbf{y} \tag{8.3} \end{equation}\]
Jensen and Ramirez proved that \(\mathrm{SSE}(\boldsymbol{\hat\beta}_{S})< \mathrm{SSE}(\boldsymbol{\hat\beta}_{S})\) and surrogate model’s canonical traces are monotone in \(k\).
8.3 Lasso Regression
Ridge regression can be understood as a restricted least squares problem. Denote the constraint \(s\), the solution of ridge coefficient estimates satisfies
\[\begin{equation} \min_{\boldsymbol\beta}\left\{\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_j\right)^2\right\}\text{ subject to } \sum_{j=1}^p\beta_j^2\le s \end{equation}\]
Another approach is to replace the constraint term \(\sum_{j=1}^p\beta_j^2\le s\) with \(\sum_{j=1}^p|\beta_j|\le s\). This method is called lasso regression.
\[\begin{equation} \min_{\boldsymbol\beta}\left\{\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_j\right)^2\right\}\text{ subject to } \sum_{j=1}^p|\beta_j|\le s \end{equation}\]
Suppose the case of two predictors, the quadratic loss function creates a spherical constraint for a geometric illustration, while the norm loss function is a diamond. The contours of \(\mathrm{SSE}\) are many expanding ellipses centered around least square estimate \(\hat\beta_{LS}\). Each ellipse represents a \(k\) value.
If the restriction \(s\) also called ‘budget’ is very large, the restriction area will cover the point of \(\hat\beta_{LS}\). That means \(\hat\beta_{LS}=\hat\beta_{R}\) and \(k=0\). When \(s\) is small, the solution is to choose the ellipse contacting the constraint area with corresponding \(k\) and \(\mathrm{SSE}\).
Here lasso constraint has sharp corners at each axes. When the ellipse has a intersect point on one corner, that means one of the coefficient equals zero. But it will not happen on ridge constraint. Therefore, an improvement of lasso with respect to ridge regression is that lasso allow some estimates \(\beta_j=0\). It makes the results more interpretative. Moreover, lasso regression can make variable selection
8.4 Principal Components Regression
Principal Components Regression (PCR) is a dimension reduction method which that can handle multicollinearity. It still uses a singular value decomposition (SVD) and get \(\mathbf{X'X}=\mathbf{Q\Lambda Q}'\) \(\mathbf{Q}\) are the matrix who columns are orthogonal eigenvectors of \(\mathbf{X'X}\). \(\Lambda=\text{diag}(\lambda_1,...,\lambda_p)\) is decreasing eigenvalues with \(\lambda_1\ge\lambda_1\ge\cdots\ge\lambda_p\). Then the linear model can transfer to
\[\begin{equation} \mathbf{y} = \mathbf{XQQ}'\boldsymbol\beta + \varepsilon = \mathbf{Z}\boldsymbol\theta + \varepsilon \end{equation}\]
where \(\mathbf{Z}=\mathbf{XQ}\), \(\boldsymbol\theta=\mathbf{Q}'\boldsymbol\beta\). \(\boldsymbol\theta\) is called the regression parameters of the principal components. \(\mathbf{Z}=\{\mathbf{z}_1,...,\mathbf{z}_p\}\) is known as the matrix of principal components of \(\mathbf{X'X}\). Then \(\mathbf{z}'_j\mathbf{z}_j=\lambda_j\) is the \(j\)th largest eigenvalue of \(\mathbf{X'X}\).
PCR usually chooses several \(\mathbf{z}\)_js with largest \(\lambda_j\)s and can eliminate multicollinearity. Its estimates \(\boldsymbol{\hat\beta}_{P}\) results in low bias but the mean squared error \(MSE(\boldsymbol{\hat\beta}_{P})\) is smaller than that of least square \(MSE(\boldsymbol{\hat\beta}_{LS})\).
Therefore, some disaggregated travel models’ \(R^2\) can be over 0.5. But the limitation is that the principal components are hard to interpret the meaning. The results of PCR may just describe the data themselves and they are reproducible but not replicable for other data.