Chapter 5 Multiple Linear Models
Part II presents some statistical methods relating to Travel-Urban Form models. Chapter 5 to 10 introduce the main methods could be applied on VMT-urban form models because the response are continue variables. The models of mode choice are placed in Chapter of Generalized linear models. The last chapter Meta-Analysis introduces basic ideas and approaches of meta-analysis and dealing with publication bias.
These contents focus on some analysis which often be neglected or omitted in travel-urban form models and may give some inspiration for improving current studies. Hence, these are mostly from several textbooks (Casella and Berger 2002; Montgomery, Peck, and Vining 2021; Ravishanker and Dey 2020) and lecture notes by Dr. Robert Fountain, Dr. Nadeeshani Jayasena, and Dr. Jong Sung Kim. It will not involve the front edge of techniques in recent published papers in statistics.
5.1 Assumptions
5.1.1 Additive and linearity
For linear models, the most important assumptions are the additive and linear relationship between the response and predictors. Gravity Law discloses that travel distance has a multiplicative (inverse) relationship with the ‘masses’ of two places. If the population size can be a representative of built environment, the additive relationship will not hold. Previous studies also shows that the effect sizes of built environment with respect of travel are small and complex. There is not sufficient evidence to support or against the linear hypothesis.
5.1.2 Independent Identically Distributed (IID)
Another essential assumption is that random error are Independent Identically Distributed (IID). Random error is also called residual, which refer to the difference between observed \(\mathbf{y}\) and fitted \(\mathbf{\hat y}\). \(\mathbf{\hat y}\) are the linear combinations of predictors \(\mathbf{X}\). residuals represent the part can not be explained by the model.
\[\begin{equation} \mathbf{e}=\mathbf{y}-\mathbf{\hat y} \tag{5.1} \end{equation}\]
The expected value, the variances, and the covariances among the random errors are the first- and second-moment of residuals. ‘Identical’ means that random errors should have zero mean and constant variance. The homogeneity of variance is also called homoscedasticity.
\[\begin{equation} E(\varepsilon) = 0, \quad Var(\varepsilon) = \sigma^2 \tag{5.2} \end{equation}\]
‘Independent’ requires the random errors are uncorrelated. That is
\[\begin{equation} Cov[\varepsilon_i,\varepsilon_j] = 0,\quad i\neq j \tag{5.3} \end{equation}\]
Once the conditions of IID are satisfied, the Gauss - Markov theorem 5.1 proves that least-square method could give the minimum-variance unbiased estimators (MVUE) or called the best linear unbiased estimators (BLUE). These conditions are not strict and make regression method widely applicable.
Theorem 5.1 (Gauss - Markov theorem) For the regression model (1.1) with the assumptions \(E(\varepsilon) = 0\), \(Var(\varepsilon) = \sigma^2\), and uncorrelated errors, the least-squares estimators are unbiased and have minimum variance when compared with all other unbiased estimators that are linear combinations of the \(y_i\). (Montgomery, Peck, and Vining 2021)
Another version is that: Under Models II - VII, if \(\lambda'\beta\) is estimable and \(\hat\beta\) is any solution to the normal equations, then \(\lambda'\hat\beta\) is a linear unbiased estimator of \(\lambda'\beta\) and, under Model II, the variance of \(\lambda'\hat\beta\) is uniformly less than that of any other linear unbiased estimator of \(\lambda'\beta\) (IX, Theorem E13, p38)
Unfortunately, many of the predictors are correlated. Moreover, the observations from various cities, regions, or counties are very unlikely identical. This issue is called heteroscedasticity. Related contents are in Section of Diagonusis and Validation.
5.1.3 Normality
When conducting hypothesis test and confidence intervals, the required assumption is \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\). Maximum Likelihood Methods also requires this assumption.
Evidence has demonstrated that travel distance is not Normal distributed. The Zipf’s law also prove that travel distance follows a power distribution. Using logarithm transformations, the skewed distribution can be converted to an approximate normal distribution.
There are some quantitative methods which can examine nomalirty of the transformed distributions.
5.2 Estimations
5.2.1 Least Squares
- Ordinary Least Squares
Least-Squares method can be used to estimate the coefficients \(\beta\) in equation (1.1) The dimension of \(\mathbf{X}\) is \(n\times p\), which means the data contain \(n\) observations and \(p-1\) predictors. The \(p\times1\) vector of least-squares estimators is denoted as \(\hat\beta\) and the solution to the normal equations is
\[\begin{equation} \boldsymbol{\hat\beta}=(\mathbf{X'X})^{-1}\mathbf{X'}\mathbf{y} \tag{5.4} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta})'(\mathbf{y-X}\boldsymbol{\hat\beta}) \tag{5.5} \end{equation}\]
Here requires \(\mathbf{X'X}\) are invertible, that is, the covariates are linearly independent if \(\mathbf{X}\) has rank \(p\) (Kim 2020, V., Definition, p.22).
Given the estimated coefficients, the model can give the fitted values of response as:
\[\begin{equation} \mathbf{\hat y}=\mathbf{X}\boldsymbol{\hat\beta}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'y}= \mathbf{Hy} \tag{5.6} \end{equation}\]
where \(\mathbf{H}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}\) is hat matrix and \(\mathbf{e}=\mathbf{y}-\mathbf{\hat y}=\mathbf{y}-\mathbf{X}\boldsymbol{\hat\beta}=(\mathbf{I}-\mathbf{H})\mathbf{y}\)
- Generalized Least Squares
When the observations are not independent or have unequal variances, the covariance matrix of error is not identity matrix. The assumption of regression model \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{I}\) doesn’t hold. Denote \(\mathbf{V}\) is a known \(n\times n\) positive definite matrix and \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{V}\). Then, there exists an \(n\times n\) symmetric matrix \(\mathbf{K}\) with rank \(n\) and \(\mathbf{V}=\mathbf{KK'}\). Let
\[\begin{equation} \mathbf{z}=\mathbf{K'y},\ \mathbf{B}=\mathbf{K^{-1}X}, \text{and}\ \boldsymbol{\eta}=\mathbf{K'}\boldsymbol{\varepsilon} \end{equation}\]
The linear model becomes \(\mathbf{z}=\mathbf{B}\boldsymbol{\beta}+\boldsymbol{\eta}\) and \(V[\boldsymbol{\eta}]=\sigma^2\mathbf{I}\). If the model is full rank, that is \(rank(\mathbf{X})=p\) then \(\mathbf{X'V^{-1}X}\) is invertible and the generalized least squares solution is
\[\begin{equation} \boldsymbol{\hat\beta}_{GLS}=(\mathbf{X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y} \tag{5.7} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2_{GLS}=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})'\mathbf{V^{-1}}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS}) \tag{5.8} \end{equation}\]
5.2.2 Standardized coefficients
The value of \(\hat \beta_j\) means that, given all other coefficients fixed, for each change of one unit in \(x_j\), the average change in the mean of \(Y\). However, the units of predictors \(\mathbf{X}\) are very different. Hence, the values of coefficients are not comparable.
Unit normal scaling or Unit length scaling can convert \(\hat \beta_j\) to dimensionless regression coefficient, which is called standardized regression coefficients. Let
\[\begin{equation} \begin{split} z_{ij}=&\frac{x_{ij}-\bar x_j}{\sqrt{\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}}\\ y^{0}_{i}=&\frac{y_{i}-\bar y}{\sqrt{\sum_{i=1}^{n}(y_{i}-\bar y)^2}} \end{split} \tag{5.9} \end{equation}\]
\[\begin{equation} \begin{split} \mathbf{\hat b}=&(\mathbf{Z'Z})^{-1}\mathbf{Z'}\mathbf{y^{0}},\ \text{or}\\ \hat b_j= &\hat\beta_j\sqrt{\frac{\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}{\sum_{i=1}^{n}(y_{i}-\bar y)^2}},\ j=1,2,...(p-1),\text{ and}\\ \hat\beta_0=&\bar y - \sum_{j=1}^{p-1}\hat\beta_j\bar x_j \end{split} \tag{5.10} \end{equation}\]
Note that \(\mathbf{Z'Z}\) correlations matrix.
\[\begin{equation} \mathbf{Z'Z}=\begin{bmatrix} 1 & r_{12} & r_{13} & \dots & r_{1k} \\ r_{21} & 1 & r_{23} & \dots & r_{2k} \\ r_{31} & _{32} & 1 & \dots & r_{3k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ r_{k1} & r_{k2} & _{k3} & \dots & 1 \end{bmatrix},\quad \mathbf{Z'}\mathbf{y^{0}}=\begin{bmatrix} r_{1y} \\ r_{2y} \\ r_{3y} \\ \vdots \\ r_{ky} \end{bmatrix} \tag{5.11} \end{equation}\]
where
\[\begin{equation} \begin{split} r_{ij}=&\frac{\sum_{u=1}^{n}(x_{ui}-\bar x_i)(x_{uj}-\bar x_j)}{\sqrt{\sum_{u=1}^{n}(x_{ui}-\bar x_i)^2\sum_{u=1}^{n}(x_{uj}-\bar x_j)^2}}\\ r_{jy}=&\frac{\sum_{u=1}^{n}(x_{uj}-\bar x_j)(y_{u}-\bar y)}{\sqrt{\sum_{u=1}^{n}(x_{uj}-\bar x_j)^2\sum_{u=1}^{n}(y_{u}-\bar y)^2}} \end{split} \tag{5.12} \end{equation}\]
where \(r_{ij}\) is the simple correlation between \(x_i\) and \(x_j\). \(r_{jy}\) is the simple correlation between \(x_j\) and \(y\)
It seems that standardized regression coefficients are comparable. However, the value of \(\hat b_j\) depends on other predictors. Therefore, comparison between different models is still problematic.
5.2.3 Elasticity
Elasticity is commonly used to determine the relative importance of a variable in terms of its influence on a dependent variable. It is generally interpreted as the percent change in the dependent variable induced by a 1% change in the independent variable (McCarthy 2001).
\[e_i=\beta_i\frac{X_i}{Y_i}\approx\frac{\partial Y_i}{\partial X_i}\frac{X_i}{Y_i}\]
Model | Marginal.Effects | Elasticity |
---|---|---|
Linear | \(\beta\) | \(\beta\frac{X_i}{Y_i}\) |
Log-linear | \(\beta Y_i\) | \(\beta X_i\) |
Linear-log | \(\beta\frac{1}{X_i}\) | \(\beta\frac{1}{Y_i}\) |
Log-log | \(\beta\frac{Y_i}{X_i}\) | \(\beta\) |
Logit | \(\beta p_i(1-p_i)\) | \(\beta X_i(1-p_i)\) |
Poisson | \(\beta\lambda_{i}\) | \(\beta X_i\) |
NB | \(\beta \lambda_{i}\) | \(\beta X_i\) |
It might be a typo that Reid Ewing and Cervero (2010) use the formula of \(\beta \bar X\left(1-\frac{\bar Y}{n}\right)\) for Logit model. In Poisson model and Negative Binomial model, \(\lambda_i=\exp[\mathbf{x}_i'\boldsymbol{\beta}]\) (Greene, 2018, eq.18-17,21). For truncated Poisson model: \(\delta_i=\frac{(1-P_{i,0}-\lambda_i P_{i,0})}{(1-P_{i,0})^2}\cdot\lambda_i\beta\) (Greene, 2018, eq.18-23). Hurdle model will give separate marginal(partial) effects (Greene, 2018, example 18.20)
5.2.4 Combined effects?
Some studies sums up the standardized coefficients or elasticities and called the summation as combined effects [leeComparingImpactsLocal2020]. Although these values are dimensionless, this method is problematic because different model specifications and data ranges are not comparable.
5.3 Inference
5.3.1 Analysis of Variance
Analysis of Variance (ANOVA) is the fundamental approach in regression analysis. Actually, this method analysis the variation in means rather than variances themselves (Casella & Berger, 2002, Ch.11).
Once the linear relationship holds, the response \(\mathbf{y}\) can be decomposed to
\[\begin{equation} \begin{split} \mathbf{y'y}=&\mathbf{y'Hy}+\mathbf{y'}(\mathbf{I}-\mathbf{H})\mathbf{y}\\ \mathbf{y'y}=&\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}+\mathbf{y'y}-\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}\\ \mathbf{y'y}-n\bar y^2=&\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}-n\bar y^2+\mathbf{y'y}-\boldsymbol{\hat\beta}\mathbf{X'}\mathbf{y}\\ \sum(y-\bar y)^2=&\sum(\hat y-\bar y)^2+\sum(y-\hat y)^2\\ \mathrm{SST} =& \mathrm{SSR} + \mathrm{SSE} \end{split} \tag{5.13} \end{equation}\]
where \(\mathrm{SST}\) is Sum of Squares Total, \(\mathrm{SSR}\) is Sum of Squares Regression, and \(\mathrm{SSE}\) is Sum of Square Error. \(\mathrm{SSE}=\mathbf{e'e}\) represents the unknown part of model.
For Generalized Least Squares method, \(\mathrm{SST}=\mathbf{y'V^{-1}y}\), \(\mathrm{SSR}= \boldsymbol{\hat\beta'}\mathbf{B'z}=\mathbf{y'V^{-1}X(X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\), and \(\mathrm{SSE}=\mathrm{SST}-\mathrm{SSR}\)
5.3.2 Hypothesis Test
- Significance of Regression
Significance of regression means if the linear relationship between response and predictors is adequate. The hypotheses for testing model adequacy are
\[\begin{equation} \begin{split} H_0:&\quad \beta_0 = \beta_1 = \cdots =\beta_{p-1}=0\\ H_1:&\quad \text{at least one } \beta_j \neq 0,\ j=0,1,...,(p-1)\\ \end{split} \tag{5.14} \end{equation}\]
By Theorem D14 (XX, p.90), if an \(n\times1\)random vector \(\mathbf{y}\sim N(\boldsymbol{\mu},\mathbf{I})\), then
\[\begin{equation} \mathbf{y'y} \sim \chi^2(n,\frac12\boldsymbol{\mu'\mu}) \tag{5.15} \end{equation}\]
Recall the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\).
By the additive property of \(\chi^2\) distribution,
\[\begin{equation} \begin{split} \frac{MSE}{\sigma^2}=&\frac{\mathbf{y'(I-H)y}}{(n-p)\sigma^2} \sim \chi^2_{(n-p)}\\ \frac{MSR}{\sigma^2}=&\frac{\mathbf{y'Hy}}{(p-1)\sigma^2} \sim \chi^2_{(p-1)} \end{split} \tag{5.16} \end{equation}\]
Though \(\sigma^2\) is usually unknown, by the relationship between \(\chi^2\) and \(F\) distributions,
\[\begin{equation} F_0=\frac{MSE}{MSR} \sim F_{(p-1),(n-p),\lambda} \end{equation}\]
where \(\lambda\) is the non-centrality parameter. It allows to test the hypotheses given a significance level \(\alpha\). If test statistic \(F_0>F_{\alpha,(p-1),(n-p)}\), then one can reject \(H_0\).
If a VMT-urban form model added many predictors but adjusted \(R^2\) is still low, the association between travel distance and built environment might be spurious.
- Significance of Coefficients
For testing a specific coefficient, the hypothesis is
\[\begin{equation} \begin{split} H_0:&\quad \beta_j =0\\ H_1:&\quad \beta_j \neq 0\\ \end{split} \tag{5.17} \end{equation}\]
\(\boldsymbol{\hat\beta}\) is a linear combination of \(\mathbf{y}\). Based on the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\), it can be proved that \(\boldsymbol{\hat\beta}\sim N (\boldsymbol{\beta},\sigma^2(\mathbf{X'X})^{-1})\) and
\[\begin{equation} t_0=\frac{\hat\beta_j}{se(\hat\beta_j)}=\frac{\hat\beta_j}{\sqrt{\hat\sigma^2C_{jj}}} \sim t_{(n-p)} \end{equation}\]
where \(C_{jj}\) is the element at the \(j\) row and \(j\) column of \((\mathbf{X'X})^{-1}\). If \(|t_0|< t_{\alpha/2,(n-p)}\), then the test failed to reject the \(H_0\), this predictor can be removed from the model. This test is called partial or marginal test because the test statistic for \(\beta_j\) depends on all the predictors in the model.
5.3.3 Confidence Intervals
Above results can also construct the confidence interval for each coefficient. A \(100(1-\alpha)\) confidence interval for \(\beta_j\) is
\[\begin{equation} \hat\beta_j-t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}}\le \beta_j \le \hat\beta_j+t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}} \end{equation}\]