Chapter 5 Linear Models
This Chapter will not give a complete introduction of linear regression. It only focus on some critical issues which are related with Travel-Urban Form models. This chapter only discusses the VMT-urban form models because the response are continue variables in regular linear models. The part of mode choice model is placed in Chapter of Generalized linear models.
5.1 Assumptions
5.1.1 Additive and linearity
For linear models, the most important assumptions are the additive and linear relationship between the response and predictors. Gravity Law discloses that travel distance has a multiplicative (inverse) relationship with the ‘masses’ of two places. If the population size can be a representative of built environment, the additive relationship will not hold. Previous studies also shows that the effect sizes of built environment with respect of travel are small and complex. There is not sufficient evidence to support or against the linear hypothesis.
5.1.2 Independent Identically Distributed (IID)
Another essential assumption is that random error are Independent Identically Distributed (IID). Random error is also called residual, which refer to the difference between observed \(\mathbf{y}\) and fitted \(\mathbf{\hat y}\). \(\mathbf{\hat y}\) are the linear combinations of predictors \(\mathbf{X}\). residuals represent the part can not be explained by the model.
\[\begin{equation} \mathbf{e}=\mathbf{y}-\mathbf{\hat y} \tag{5.1} \end{equation}\]
The expected value, the variances, and the covariances among the random errors are the first- and second-moment of residuals. ‘Identical’ means that random errors should have zero mean and constant variance. The homogeneity of variance is also called homoscedasticity.
\[\begin{equation} E(\varepsilon) = 0 \\ Var(\varepsilon) = \sigma^2\\ \tag{5.2} \end{equation}\]
‘Independent’ requires the random errors are uncorrelated. That is
\[\begin{equation} Cov[\varepsilon_i,\varepsilon_j] = 0,\quad i\neq j \tag{5.3} \end{equation}\]
Once the conditions of IID are satisfied, the Gauss - Markov theorem 5.1 proves that least-square method could give the minimum-variance unbiased estimators (MVUE) or called the best linear unbiased estimators (BLUE). These conditions are not strict and make regression method widely applicable.
Theorem 5.1 (Gauss - Markov theorem) For the regression model (1.1) with the assumptions \(E(\varepsilon) = 0\), \(Var(\varepsilon) = \sigma^2\), and uncorrelated errors, the least-squares estimators are unbiased and have minimum variance when compared with all other unbiased estimators that are linear combinations of the \(y_i\). (Montgomery et al., 2021)
Another version is that: Under Models II - VII, if \(\lambda'\beta\) is estimable and \(\hat\beta\) is any solution to the normal equations, then \(\lambda'\hat\beta\) is a linear unbiased estimator of \(\lambda'\beta\) and, under Model II, the variance of \(\lambda'\hat\beta\) is uniformly less than that of any other linear unbiased estimator of \(\lambda'\beta\) (IX, Theorem E13, p38)
Unfortunately, many of the predictors are correlated. Moreover, the observations from various cities, regions, or counties are very unlikely identical. This issue is called heteroscedasticity. Related contents are in Section of Diagonusis and Validation.
5.1.3 Normality
When conducting hypothesis test and confidence intervals, the required assumption is \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\). Maximum Likelihood Methods also requires this assumption.
Evidence has demonstrated that travel distance is not Normal distributed. The Zipf’s law also prove that travel distance follows a power distribution. Using logarithm transformations, the skewed distribution can be converted to an approximate normal distribution.
There are some quantitative methods which can examine nomalirty of the transformed distributions.
5.2 Estimations
5.2.1 Least Squares
- Ordinary Least Squares
Least-Squares method can be used to estimate the coefficients \(\beta\) in equation (1.1) The dimension of \(\mathbf{X}\) is \(n\times p\), which means the data contain \(n\) observations and \(p-1\) predictors. The \(p\times1\) vector of least-squares estimators is denoted as \(\hat\beta\) and the solution to the normal equations is
\[\begin{equation} \boldsymbol{\hat\beta}=(\mathbf{X'X})^{-1}\mathbf{X'}\mathbf{y} \tag{5.4} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta})'(\mathbf{y-X}\boldsymbol{\hat\beta}) \tag{5.5} \end{equation}\]
Here requires \(\mathbf{X'X}\) are invertible, that is, the covariates are linearly independent if \(\mathbf{X}\) has rank \(p\) (V., Definition, p.22).
Given the estimated coefficients, the model can give the fitted values of response as:
\[\begin{equation} \mathbf{\hat y}=\mathbf{X}\boldsymbol{\hat\beta}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'y}= \mathbf{Hy} \tag{5.6} \end{equation}\]
where \(\mathbf{H}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}\) is hat matrix and \(\mathbf{e}=\mathbf{y}-\mathbf{\hat y}=\mathbf{y}-\mathbf{X}\boldsymbol{\hat\beta}=(\mathbf{I}-\mathbf{H})\mathbf{y}\)
- Generalized Least Squares
When the observations are not independent or have unequal variances, the covariance matrix of error is not identity matrix. The assumption of regression model \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{I}\) doesn’t hold. Denote \(\mathbf{V}\) is a known \(n\times n\) positive definite matrix and \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{V}\). Then, there exists an \(n\times n\) symmetric matrix \(\mathbf{K}\) with rank \(n\) and \(\mathbf{V}=\mathbf{KK'}\). Let
\[\begin{equation} \mathbf{z}=\mathbf{K'y},\ \mathbf{B}=\mathbf{K^{-1}X}, \text{and}\ \boldsymbol{\eta}=\mathbf{K'}\boldsymbol{\varepsilon} \end{equation}\]
The linear model becomes \(\mathbf{z}=\mathbf{B}\boldsymbol{\beta}+\boldsymbol{\eta}\) and \(V[\boldsymbol{\eta}]=\sigma^2\mathbf{I}\). If the model is full rank, that is \(rank(\mathbf{X})=p\) then \(\mathbf{X'V^{-1}X}\) is invertible and the generalized least squares solution is
\[\begin{equation} \boldsymbol{\hat\beta}_{GLS}=(\mathbf{X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y} \tag{5.7} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2_{GLS}=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})'\mathbf{V^{-1}}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS}) \tag{5.8} \end{equation}\]
5.2.2 Standardized coefficients
The value of \(\hat \beta_j\) means that, given all other coefficients fixed, for each change of one unit in \(x_j\), the average change in the mean of \(Y\). However, the units of predictors \(\mathbf{X}\) are very different. Hence, the values of coefficients are not comparable.
Unit normal scaling or Unit length scaling can convert \(\hat \beta_j\) to dimensionless regression coefficient, which is called standardized regression coefficients. Let
\[\begin{equation} \begin{split} z_{ij}=&\frac{x_{ij}-\bar x_j}{\sqrt{\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}},\quad \\ y^{0}_{i}=&\frac{y_{i}-\bar y}{\sqrt{\sum_{i=1}^{n}(y_{i}-\bar y)^2}} \end{split} \tag{5.9} \end{equation}\]
\[\begin{equation} \begin{split} \mathbf{\hat b}=&(\mathbf{Z'Z})^{-1}\mathbf{Z'}\mathbf{y^{0}},\ \text{or}\\ \hat b_j= &\hat\beta_j\sqrt{\frac{\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}{\sum_{i=1}^{n}(y_{i}-\bar y)^2}},\ j=1,2,...(p-1),\text{ and}\\ \hat\beta_0=&\bar y - \sum_{j=1}^{p-1}\hat\beta_j\bar x_j \end{split} \tag{5.10} \end{equation}\]
Note that \(\mathbf{Z'Z}\) correlations matrix.
\[\begin{equation} \begin{split} \mathbf{Z'Z}=&\begin{bmatrix} 1 & r_{12} & r_{13} & \dots & r_{1k} \\ r_{21} & 1 & r_{23} & \dots & r_{2k} \\ r_{31} & _{32} & 1 & \dots & r_{3k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ r_{k1} & r_{k2} & _{k3} & \dots & 1 \end{bmatrix},\quad \mathbf{Z'}\mathbf{y^{0}}=&\begin{bmatrix} r_{1y} \\ r_{2y} \\ r_{3y} \\ \vdots \\ r_{ky} \end{bmatrix} \end{split} \tag{5.11} \end{equation}\]
where
\[\begin{equation} \begin{split} r_{ij}=&\frac{\sum_{u=1}^{n}(x_{ui}-\bar x_i)(x_{uj}-\bar x_j)}{\sqrt{\sum_{u=1}^{n}(x_{ui}-\bar x_i)^2\sum_{u=1}^{n}(x_{uj}-\bar x_j)^2}}\\ r_{jy}=&\frac{\sum_{u=1}^{n}(x_{uj}-\bar x_j)(y_{u}-\bar y)}{\sqrt{\sum_{u=1}^{n}(x_{uj}-\bar x_j)^2\sum_{u=1}^{n}(y_{u}-\bar y)^2}} \end{split} \tag{5.12} \end{equation}\]
where \(r_{ij}\) is the simple correlation between \(x_i\) and \(x_j\). \(r_{jy}\) is the simple correlation between \(x_j\) and \(y\)
It seems that standardized regression coefficients are comparable. However, the value of \(\hat b_j\) depends on other predictors. Therefore, comparison between different models is still problematic.
5.2.3 Elasticity
Definition: Commonly used to determine the relative importance of a variable in terms of its influence on a dependent variable. It is generally interpreted as the percent change in the dependent variable induced by a 1% change in the independent variable.16
\[e_i=\beta_i\frac{X_i}{Y_i}\approx\frac{\partial Y_i}{\partial X_i}\frac{X_i}{Y_i}\]
Model | Marginal Effects | Elasticity |
---|---|---|
Linear | \(\beta\) | \(\beta\frac{X_i}{Y_i}\) |
Log-linear | \(\beta Y_i\) | \(\beta X_i\) |
Linear-log | \(\beta\frac{1}{X_i}\) | \(\beta\frac{1}{Y_i}\) |
Log-log | \(\beta\frac{Y_i}{X_i}\) | \(\beta\) |
Logit | \(\beta p_i(1-p_i)\) | \(\beta X_i(1-p_i)\) |
Poisson | \(\beta \lambda_i\) | \(\beta X_i\) |
NB | \(\beta \lambda_i\) | \(\beta X_i\) |
It is strange that Reid Ewing and Cervero (2010) use the formula of \(\beta \bar X\left(1-\frac{\bar Y}{n}\right)\) for Logit model. In Poisson model and Negative Binomial model, \(\lambda_i=\exp[\mathbf{x}_i'\boldsymbol{\beta}]\) (Greene, 2018, eq.18-17,21). For truncated Poisson model: \(\delta_i=\frac{(1-P_{i,0}-\lambda_i P_{i,0})}{(1-P_{i,0})^2}\cdot\lambda_i\beta\) (Greene, 2018, eq.18-23). Hurdle model will give separate marginal(partial) effects (Greene, 2018, example 18.20)
5.2.4 Combined effects?
Some studies sums up the standardized coefficients or elasticities and called the summation as combined effects. Although these values are dimensionless, this method is problematic because different model specifications and data ranges are not comparable.
5.3 Inference
5.3.1 Analysis of Variance
Analysis of Variance (ANOVA) is the fundamental approach in regression analysis. Actually, this method analysis the variation in means rather than variances themselves (Casella & Berger, 2002, Ch.11).
Once the linear relationship holds, the response \(\mathbf{y}\) can be decomposed to
\[\begin{equation} \begin{split} \mathbf{y'y}=&\mathbf{y'Hy}+\mathbf{y'}(\mathbf{I}-\mathbf{H})\mathbf{y}\\ \mathbf{y'y}=&\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}+\mathbf{y'y}-\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}\\ \mathbf{y'y}-n\bar y^2=&\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}-n\bar y^2+\mathbf{y'y}-\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}\\ \sum(y-\bar y)^2=&\sum(\hat y-\bar y)^2+\sum(y-\hat y)^2\\ SST =& SSR + SSE \end{split} \tag{5.13} \end{equation}\]
where \(SST\) is Sum of Squares Total, \(SSR\) is Sum of Squares Regression, and \(SSE\) is Sum of Square Error. \(SSE=\mathbf{e'e}\) represents the unknown part of model.
For Generalized Least Squares method, \(SST=\mathbf{y'V^{-1}y}\), \(SSR= \boldsymbol{\hat\beta'}\mathbf{B'z}=\mathbf{y'V^{-1}X(X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\), and \(SSE=SST-SSR\)
5.3.2 Hypothesis Test
- Significance of Regression
Significance of regression means if the linear relationship between response and predictors is adequate. The hypotheses for testing model adequacy are
\[\begin{equation} \begin{split} H_0:&\quad \beta_0 = \beta_1 = \cdots =\beta_{p-1}=0\\ H_1:&\quad \text{at least one } \beta_j \neq 0,\ j=0,1,...,(p-1)\\ \end{split} \tag{5.14} \end{equation}\]
By Theorem D14 (XX, p.90), if an \(n\times1\)random vector \(\mathbf{y}\sim N(\boldsymbol{\mu},\mathbf{I})\), then
\[\begin{equation} \mathbf{y'y} \sim \chi^2(n,\frac12\boldsymbol{\mu'\mu}) \tag{5.15} \end{equation}\]
Recall the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\).
By the additive property of \(\chi^2\) distribution,
\[\begin{equation} \frac{MSE}{\sigma^2}=\frac{\mathbf{y'(I-H)y}}{(n-p)\sigma^2} \sim \chi^2_{(n-p)}\\ \frac{MSR}{\sigma^2}=\frac{\mathbf{y'Hy}}{(p-1)\sigma^2} \sim \chi^2_{(p-1)}\\ \end{equation}\]
Though \(\sigma^2\) is usually unknown, by the relationship between \(\chi^2\) and \(F\) distributions,
\[\begin{equation} F_0=\frac{MSE}{MSR} \sim F_{(p-1),(n-p),\lambda}\\ \end{equation}\] where \(\lambda\) is the non-centrality parameter. It allows to test the hypotheses given a significance level \(\alpha\). If test statistic \(F_0>F_{\alpha,(p-1),(n-p)}\), then one can reject \(H_0\).
If a VMT-urban form model added many predictors but adjusted \(R^2\) is still low, the association between travel distance and built environment might be spurious.
- Significance of Coefficients
For testing a specific coefficient, the hypothesis is
\[\begin{equation} \begin{split} H_0:&\quad \beta_j =0\\ H_1:&\quad \beta_j \neq 0\\ \end{split} \tag{5.16} \end{equation}\]
\(\boldsymbol{\hat\beta}\) is a linear combination of \(\mathbf{y}\). Based on the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\), it can be proved that \(\boldsymbol{\hat\beta}\sim N (\boldsymbol{\beta},\sigma^2(\mathbf{X'X})^{-1})\) and
\[\begin{equation} t_0=\frac{\hat\beta_j}{se(\hat\beta_j)}=\frac{\hat\beta_j}{\sqrt{\hat\sigma^2C_{jj}}} \sim t_{(n-p)}\\ \end{equation}\] where \(C_{jj}\) is the element at the \(j\) row and \(j\) column of \((\mathbf{X'X})^{-1}\). If \(|t_0|< t_{\alpha/2,(n-p)}\), then the test failed to reject the \(H_0\), this predictor can be removed from the model. This test is called partial or marginal test because the test statistic for \(\beta_j\) depends on all the predictors in the model.
5.3.3 Confidence Intervals
Above results can also construct the confidence interval for each coefficient. A \(100(1-\alpha)\) confidence interval for \(\beta_j\) is
\[\begin{equation} \hat\beta_j-t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}}\le \beta_j \le \hat\beta_j+t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}} \end{equation}\]
5.4 Adequacy
The outcome of estimation and inference can not demonstrate model’s performance. If the primary assumptions is violated, the estimations could be biased and the model could be useless. These problems can also happen when the model is not correctly specified. It is necessary to make diagnosis and validation for fitted models.
5.4.1 Goodness of fit
This structure tell us how good the model can explain the data. Coefficient of Determination \(R^2\) is a proportion to assess the quality of fitted model.
\[\begin{equation} R^2 =\frac{SSR}{SST}=1-\frac{SSE}{SST} \tag{5.17} \end{equation}\]
when \(R^2\) is close to \(1\), the most of variation in response can be explained by the fitted model. Although \(R^2\) is not the only criteria of a good model, it is often available in most published papers. Recall the discussion in Part I, the aggregated data will eliminate the difference among individuals, households, or neighborhoods. In the new variance structure, \(SSE\) will be much less than disaggregated model. The \(R^2\) in many disaggregate studies are around 0.3, while the \(R^2\) in some aggregate studies can reach 0.8. A seriously underfitting model’s outputs could be biased and unstable.
A fact is that adding predictors into the model will never decrease \(R^2\). Thus the models with different number of predictors is not comparable. Adjusted \(R^2\) can address this issue by introducing degree of freedom. The degree of freedom denotes the amount of information required to know.
\[\begin{equation} \begin{split} df_T =& df_R + df_E\\ n-1=&(p-1)+(n-p) \end{split} \tag{5.18} \end{equation}\]
Then, the mean square (MS) of each sum of squares (SS) can be calculated by \(MS=SS/df\). The mean square error \(MSE\) is also called as the expected value of error variance \(\hat\sigma^2=MSE=SSE/(n-p)\). \(n-p\) is the degree of freedom. Then adjusted \(R^2\) is
\[\begin{equation} R_{adj}^2 = 1-\frac{MSE}{MST} = 1-\frac{SSE/(n-p)}{SST/(n-1)} \tag{5.19} \end{equation}\]
Another similar method is \(R^2\) for prediction based on PRESS. Recall the PRESS statistic is the prediction error sum of square by fitting a model with \(n-1\) observations.
\[\begin{equation} PRESS = \sum_{i=1}^n(y_i-\hat y_{(i)})^2= \sum_{i=1}^n\left(\frac{e_i}{1-h_{ii}}\right)^2 \tag{5.20} \end{equation}\]
A model with smaller PRESS has a better ability of prediction. The \(R^2\) for prediction is
\[\begin{equation} R_{adj}^2 = 1-\frac{PRESS}{MST} \tag{5.21} \end{equation}\]
5.4.2 Residuals Analysis
The major assumptions, both IID and normality are related to residual. Residual diagnosis is an essential step for modeling validation.
There are several scaled residuals can help the diagnosis. Since \(MSE\) is the expected variance of error \(\hat\sigma^2\) and \(E[\varepsilon]=0\), standardized residuals should follow a standard normal distribution.
\[d_i=\frac{e_i}{\sqrt{MSE}}=e_i\sqrt{\frac{n-p}{\sum_{i=1}^n e_i^2}},\quad i=1,2,...,n\]
Recall random error \(\mathbf{e}=\mathbf{y}-\mathbf{\hat y}=(\mathbf{I}-\mathbf{H})\mathbf{y}\) and hat matrix \(\mathbf{H}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}\). Let \(h_{ii}\) denote the \(i^{th}\) diagonal element of hat matrix. Studentized Residuals can be expressed by \[r_i=\frac{e_i}{\sqrt{MSE(1-h_{ii})}},\quad i=1,2,...,n\] It is proved that \(0\le h_{ii}\le1\). An observation with \(h_{ii}\) closed to one will return a large value of \(r_i\). The \(x_i\) who has strong influence on fitted value is called leverage point.
Ideally, the scaled residual have zero mean and unit variance. Hence, an observation with \(|d_i|>3\) or \(|r_i|>3\) is a potential outlier.
Predicted Residual Error Sum of Squares (PRESS) can also be used to detect outliers. This method predicts the \(i^{th}\) fitted response by excluding the \(i^{th}\) observation and examine the influence of this point. The corresponding error \(e_{(i)}=e_{i}/(1-h_{ii})\) and \(V[e_{(i)}]=\sigma^2/(1-h_{ii})\). Thus, if \(MSE\) is a good estimate of \(\sigma^2\), PRESS residuals is equivalent to Studentized Residuals.
\[\frac{e_{(i)}}{\sqrt{V[e_{(i)}]}}=\frac{e_i/(1-h_{ii})}{\sqrt{\sigma^2/(1-h_{ii})}}=\frac{e_i}{\sqrt{\sigma^2(1-h_{ii})}}\]
- Residual Plot
Residual plot shows the pattern of the residuals against fitted \(\mathbf{\hat y}\). If the assumptions are valid, the shape of points should like a envelope and be evenly distributed around the horizontal line of \(e=0\).
A funnel shape in residual plot shows that the variance of error is a function of \(\hat y\). A suitable transformation to response or predictor could stabilize the variance. A curved shape means the assumption of linearity is not valid. It implies that adding quadratic terms or higher-order terms might be suitable.
- Normal Probability Plot
A histogram of residuals can check the normality assumption. The highly right-skewed probability distribution of VMT log-transform of VMT is reasonable.
A better way is a normal quantile – quantile (QQ) plot of the residuals. An ideal cumulative normal distribution should plot as a straight line. Only looking at the \(R^2\) and p-values cannot disclose this feature.
5.4.3 Heteroscedasticity
When the assumption of constant variance is violated, the linear model is heteroscedastic. Heteroscedasticity is common in urban studies. For example, the cities with different size are not identical. Small cities or rural areas might have homogeneous values of population density, while large cities’ densities are more variable.
Recall Generalized least square estimates (5.7) and (5.8), if the residuals are independent but variances are not constant, a simple linear model becomes \(\boldsymbol{\varepsilon}\sim MVN(\mathbf{0},\sigma^2\mathbf{V})\) where
\[\begin{equation} \mathbf{V}=\begin{bmatrix} x_1^2 & 0 & \dots & 0 \\ 0 & x_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & x_n^2 \end{bmatrix},\quad \mathbf{V}^{-1}=\begin{bmatrix} \frac1{x_1^2} & 0 & \dots & 0 \\ 0 & \frac1{x_2^2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac1{x_n^2} \end{bmatrix} \tag{5.22} \end{equation}\]
Then \(\mathbf{X'V^{-1}X}=n\) and the generalized least squares solution is
\[\begin{equation} \hat\beta_{1,WLS}=\frac1n\sum_{i=1}^{n}\frac{y_i}{x_i} \tag{5.23} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2_{WLS}=\frac1{n-1}\sum_{i=1}^{n}\frac{(y_i-\hat\beta_{1}x_i)^2}{x_i^2} \tag{5.24} \end{equation}\]
In heteroscedastic model, the OLS estimates of coefficients are still unbiased but no longer efficient. But the estimates of variances are biased. The corresponding hypothesis test and confidence interval would be misleading.
Another special case is the model with aggregated variables, which is the cases of geographic unit. Let \(u_j\) and \(v_j\) are the response and predictors of \(j^{th}\) household in a neighborhood. \(n_i\) is the sample size in each neighborhood. Then \(y_i=\sum_{j=1}^{n_i}u_j/n_i\) and \(X_i=\sum_{j=1}^{n_i}v_j/n_i\). In this case,
\[\begin{equation} \mathbf{V}=\begin{bmatrix} \frac1{n_1} & 0 & \dots & 0 \\ 0 & \frac1{n_2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac1{n_n} \end{bmatrix},\quad \mathbf{V}^{-1}=\begin{bmatrix} n_1 & 0 & \dots & 0 \\ 0 & n_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & n_n \end{bmatrix} \tag{5.25} \end{equation}\]
Then \(\mathbf{X'V^{-1}X}=\sum_{i=1}^nn_ix_i^2\) and the WLS estimate of \(\beta_1\) is
\[\begin{equation} \hat\beta_{1,WLS}=\frac1n\frac{\sum_{i=1}^{n}n_ix_iy_i}{\sum_{i=1}^{n}n_ix_i^2} \tag{5.26} \end{equation}\]
and
\[\begin{equation} V[\hat\beta_{1,WLS}]=\frac{V[\sum_{i=1}^{n}n_ix_iy_i]}{(\sum_{i=1}^{n}n_ix_i^2)^2}=\frac{\sum_{i=1}^{n}n_i^2x_i^2\sigma^2/n_i}{(\sum_{i=1}^{n}n_ix_i^2)^2}=\frac{\sigma^2}{\sum_{i=1}^{n}n_ix_i^2} \tag{5.27} \end{equation}\]
There are three procedures, Bartlett’s likelihood ratio test, Goldfeld-Quandt test, or Breusch-Pagan test which can be used to examine heteroscedasticity (Ravishanker & Dey, 2020, 8.1.3, pp.288-290)
5.4.4 Autocorrelation
For spatio-temporal data, the observations often have some relationship over time or space. In these cases, the assumption of independent errors is violated, the linear model with serially correlated errors is called autocorrelation. Autocorrelation is also common in urban studies. All the neighboring geographic entities or stages could impact each other, or sharing the similar environment.
Take a special case of time-series data for example, assuming the model have constant variance. \(E[\varepsilon]=0\) but \(Cov[\varepsilon_i,\varepsilon_j]=\sigma^2\rho^{|j-i|}\), \(i,j=1,2,...,n\) and \(|\rho|<1\) The variance-covariance matrix is also called Toeplitz matrix as below
\[\begin{equation} \mathbf{V}=\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{bmatrix},\quad \{\mathbf{V}^{-1}\}_{ij}=\begin{cases} \frac{1}{1-\rho^2} & \text{if } i=j=1,n \\ \frac{1+\rho^2}{1-\rho^2} & \text{if } i=j=2,...,n-1 \\ \frac{-\rho}{1-\rho^2} & \text{if } |j-i|=1 \\ 0 & \text{otherwise} \end{cases} \tag{5.28} \end{equation}\]
This is a linear regression with autoregressive order 1 (AR(1)). The estimates of \(\boldsymbol{\hat\beta}\) is the same with the GLS solutions, which are \(\boldsymbol{\hat\beta}_{GLS}=(\mathbf{X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\) and \(\widehat{V[\boldsymbol{\hat\beta}_{GLS}]}=\hat\sigma^2_{GLS}(\mathbf{X'V^{-1}X})^{-1}\), where \(\hat\sigma^2_{GLS}=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})'\mathbf{V^{-1}}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})\).
It is can be verified that \(\boldsymbol{\hat\beta}_{GLS}\le\boldsymbol{\hat\beta}_{OLS}\) always holds and they are equal when \(\mathbf{V=I}\) or \(\rho=0\). It proves that \(\boldsymbol{\hat\beta}_{GLS}\) are the best linear unbiased estimators (BLUE).
This case can be extended to miltiple regression models and the autocorrelation of a stationary stochastic process at lag-k. Durbin-Watson test is used to test the null hypothesis of \(\rho=0\).
5.4.5 Multicollinearity
Multicollinearity or near-linear dependence refers to the models with highly correlated predictors. When data is generated from experimental design, the treatments \(X\) could be fixed variables and be orthogonal. But travel-urban form model is observational studies and nothing can be controlled as in lab. It is known that there are complex correlations among the built-environment predictors themselves.
Although, the basic IID assumptions do not require that all predictors \(\mathbf{X}\) are independent, when the predictors are near-linear dependent, the model is ill-conditioned and the least-square estimators are unstable.
multicollinearity can make the variances inflated and impact model precision seriously. If some of predictors are exact linear dependent, the matrix \((\mathbf{X'X})^{-1}\) is symmetric but non-invertible. By spectral decomposition of symmetric matrix, \(\mathbf{X'X}=\mathbf{P'\Lambda P}\) where \(\Lambda=\text{diag}(\lambda_1,...,\lambda_p)\), \(\lambda_i\)’s are eigenvalues of \(\mathbf{X'X}\), \(\mathbf{P}\) is an orthogonla matrix whose columns are normalize eigenvectors. Then the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is \(\sigma^2\sum_{j=1}^p1/\lambda_j\). If the predictors are near-linear dependent or nearly singular, \(\lambda_j\)s may be very small and the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is highly inflated.
For the same reason, the correlation matrix using unit length scaling \(\mathbf{Z'Z}\)will has a inverse matrix with inflated variances. That means that the diagonal elements of \((\mathbf{Z'Z})^{-1}\) are not all equal to one. The diagnoal elements are called Variance Inflation Factors, which can be used to examine multicollinearity. The VIF for a particular predictor is examined as below (5.29)
\[\begin{equation} VIF_j=\frac{1}{1-R_j^2} \tag{5.29} \end{equation}\]
where \(R_j^2\) is the coefficient of determination by regressing \(x_j\) on all the remaining predictors.
A common approach is to drop off the predictor with greatest VIF and refit the model until all VIFs are less than 10. However, dropping off one or more predictors will lose many information which might be valuable for explaining response. Due to the complexity among predictors, dropping off the predictor with the greatest VIF is not always the best choice. Sometimes, removing a predictor with moderate VIF can make all VIFs less than 10 in the refitted model. Moreover, there is not an unique criteria for VIF value. When the relationship between predictor and response is weak, or the \(R^2\) is low, the VIFs less than 10 may also affect the ability of estimation dramatically.
Orthogonalization before fitting the model might be helpful. Other approaches such as ridge regression or principal components regression could deal with multicollinearity better.
5.4.5.1 Ridge Regression and Lasso
Least squares method gives the unbiased estimates of regression coefficients. However, multicollinearity will lead to inflated variance and make the estimates unstable and unreliable. To get a smaller variance, a tradeoff is to release the requirement of unbiasedness. Denote \(\boldsymbol{\hat\beta}_{R}\) are biased estimates but its variance is small enough.
\[\begin{equation} \begin{split} MSE(\boldsymbol{\hat\beta}_{R})&=E[\boldsymbol{\hat\beta}_{R}-\boldsymbol{\beta}]^2=Var[\boldsymbol{\hat\beta}_{R}]+Bias[\boldsymbol{\hat\beta}_{R}]^2\\ &<MSE(\boldsymbol{\hat\beta}_{LS})=Var[\boldsymbol{\hat\beta}_{LS}] \end{split} \end{equation}\]
Hoerl and Kennard (1970) proposed ridge regression to address the nonorthogonal problems.
\[\begin{equation} \boldsymbol{\hat\beta}_{R}=(\mathbf{X'X}+k\mathbf{I})^{-1}\mathbf{X'}\mathbf{y} \tag{5.30} \end{equation}\] where \(k\ge0\) is a selected constant and is called a biasing parameter. When \(k=0\), the ridge estimator reduces to least squares estimators.
When \(\mathbf{X}\) is nonsingular and \((\mathbf{X'X})^{-1}\) exists, the ridge estimator is a linear transformation of \(\boldsymbol{\hat\beta}_{LS}\). That is \(\boldsymbol{\hat\beta}_{R}=Z_k\boldsymbol{\hat\beta}_{LS}\) where \(Z_k=(\mathbf{X'X}+k\mathbf{I})^{-1}\mathbf{X'X}\)
Recall the total-variance of \(\boldsymbol{\hat\beta}_{LS}\) is \(\sigma^2\sum_{j=1}^p1/\lambda_j\). The total-variance of \(\boldsymbol{\hat\beta}_{R}\) is
\[\begin{equation} \text{tr}(Cov[\boldsymbol{\hat\beta}_{R}])=\sigma^2\sum_{j=1}^p\frac{\lambda_j}{(\lambda_j+k)^2} \end{equation}\]
Thus, introducing \(k\) into the model can avoid tiny denominators and eliminate the inflated variance. Choosing a proper value of \(k\) is to keep the balance of \(MSE\) and \(bias\). The bias in \(\boldsymbol{\hat\beta}_{R}\) is
\[\begin{equation} Bias(\boldsymbol{\hat\beta}_{R})^2=k^2\boldsymbol{\beta}'(\mathbf{X'X}+k\mathbf{I})^{-2}\boldsymbol{\beta}) \end{equation}\]
Hence,increasing \(k\) will reduce \(MSE\) but make greater \(bias\). Ridge trace is a plot of \(\boldsymbol{\hat\beta}_{R}\) versus \(k\) that can help to select a suitable value of \(k\). First, at the value of \(k\), the estimates should be stable. Second, the estimated coefficients should have proper sign and reasonable values. Third, the \(SSE\) also should has a reasonable value.
Ridge regression will not give a greater \(R^2\) than least squares method. Because the total sum of squares is fixed.
\[\begin{equation} \begin{split} SSE(\boldsymbol{\hat\beta}_{R})&=(\mathbf{y-X}\boldsymbol{\hat\beta}_{R})'(\mathbf{y-X}\boldsymbol{\hat\beta}_{R})\\ &=(\mathbf{y-X}\boldsymbol{\hat\beta}_{LS})'(\mathbf{y-X}\boldsymbol{\hat\beta}_{LS})+(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})'\mathbf{X'X}(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})\\ &=SSE(\boldsymbol{\hat\beta}_{LS})+(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})'\mathbf{X'X}(\boldsymbol{\hat\beta}_{LS}-\boldsymbol{\hat\beta}_{R})\\ &\ge SSE(\boldsymbol{\hat\beta}_{LS}) \end{split} \end{equation}\]
The advantage of ridge regression is to abtain a suitable set of parameter estimates rather than to improve the fitness. It could have a better prediction ability than least squares. It can also be useful for variable selection. The variables with unstable ridge trace or tending toward the value of zero can be removed from the model.
In many case, the ridge trace is erratic divergence and may revert back to least square estimates. Jensen and Ramirez(2010, 2012) proposed surrogate model to further improve ridge regression. Surrogate model chooses \(k\) depend on matrix \(\mathbf{X}\) and free to \(\mathbf{Y}\).
Using a compact singular value decomposition (SVD) \(\mathbf{X}=\mathbf{PD_{\xi}Q}'\), the left-singular vector \(\mathbf{P}\) satisfies \(\mathbf{P'P}=\mathbf{I}\) and the right-singular vectors \(\mathbf{Q}\) are orthogonal. \(\mathbf{D}_{\xi}=\text{diag}(\xi_1,...,\xi_p)\) is decreasing singular values. Then \(\mathbf{X}_k=\mathbf{PD}((\xi_i^2+k_i)^{1/2})\mathbf{Q}'\) and
\[\begin{equation} \begin{split} \mathbf{X'X}=&\mathbf{QD}_\xi^2\mathbf{Q}'&\\ \mathbf{X}_k'\mathbf{X}_k=&\mathbf{Q(D_\xi^2+K)}\mathbf{Q}'&\quad\text{generalized surrogate}\\ \mathbf{X}_k'\mathbf{X}_k=&\mathbf{QD}_\xi^2\mathbf{Q}'+k\mathbf{I}&\quad\text{ordinary surrogate} \end{split} \end{equation}\]
and the surrogate solution \(\boldsymbol{\hat\beta}_{S}\) is
\[\begin{equation} \mathbf{Q(D^2_{\xi}+K)Q}'\boldsymbol{\hat\beta}_{S}=\mathbf{X}_k=\mathbf{QD}((\xi_i^2+k_i)^{1/2})\mathbf{P}'\mathbf{y} \tag{5.31} \end{equation}\]
Jensen and Ramirez proved that \(SSE(\boldsymbol{\hat\beta}_{S})< SSE(\boldsymbol{\hat\beta}_{S})\) and surrogate model’s canonical traces are monotone in \(k\).
5.4.5.2 Principal Components Regression (PCR)
Dimention Reduction Methods Shrinkage Methods
Partial Least Square (PLS)
By dint of some synthetic variables, the disaggregated model’s \(R^2\) can be over 0.5. But the risk is these techniques may describe the data themselves, and the results cannot be generalized. It is worthy of a deeper investigation.
5.5 Variables Selections
5.5.1 Mallows \(C_p\)
5.5.2 PRESS
Underfitting models
Overfitting models
Interaction Effects
Polynomial Regression Models
Theoretical considerations
5.6 Other Topics
5.6.1 Bayesian approaches (Opt.)
5.6.2 SEM (Opt.)
Another attempt tries the method of structural equation modeling (SEM). The two studies capture higher elasticities of per capita VMT with respect to density (-0.38 and -0.238) (Cervero and Murakami 2010; Reid Ewing, Hamidi, et al. 2014).
In general, modeling is a case-by-case work. Researchers may have their preferred model by weighing the sensitivity and robustness even given the same hypothesis and data. The published papers usually don’t show the results of diagnosis and validation. Under this circumstance, compare or summarize these outcomes are unreliable.
References
McCarthy, P.S., Transportation Economics Theory and Practice: A Case Study Approach. Blackwell, Boston, 2001.↩︎