Chapter 6 Common Methods
In Part I, we discuss that travel as the response has different types of variable such as distance, frequency, and mode choice. When fitting a travel-urban form model, the first step is to choose a suitable model genre. Chapter 5 introduce the main types of model with respect to the different travel variables.
The fitted models always give some outcomes. But the reliable outcomes require the models to be valid, adequate, and ‘healthy.’ Chapter 6 lists several potential issues which often be neglected or omitted in travel-urban form models.
Researcher never stop trying the new approaches to address the shortcomings in previous studies. Chapter 7 presents some recent trends in travel-urban form studies that can give some inspiration for future work.
Scholars don’t anticipate all studies have the same results, especially for the association between travel and urban form. But they really want to see a more generalized result and then contribute the policy implication. The last chapter Meta-Analysis introduces some basic ideas and approaches of meta-analysis and how to deal with publication bias.
6.1 For Travel Distance
6.1.1 Transformations
For continuous response variables, Multiple Linear Regression (MLR) is the proper type of models and Ordinary least squares (OLS) is the corresponding algorithm. In travel-urban form studies, travel distance belong to this category. Other continuous variables, such as travel time, CO\(_2\) emission are also usual research interests in transportation.
However, These variables have a common feature that their domain is positive number, not whole real number field. Here may raise a debate of whether zero values are a part of these variables or not. As mentioned in previous chapter, the logarithm transformation on these variables can convert the positive values to real numbers, and the zero values are excluded automatically. Log transformation can also address the issues of linearity and normality. Hence, Log transform on travel distance is widely used in transportation related research and practice. A resent example is that Alam, Nixon, and Zhang (2018) use log-log model for travel demand by transit at MSA level in U.S. Here gives two functions of log transform from mathematical perspective: variance stabilizing and linearizing.
- Variance Stabilizing
Equality of variance is a primary assumption of the regression model. When variance is not constant, the least-squares estimators will not give the minimized variance. Though the estimation is still unbiased, the standard errors of regression coefficients will be larger and the model becomes insensitive. Montgomery, Peck, and Vining (2021) give several useful variance stabilizing transformations in Table 6.1.
Relationship | Transformation |
---|---|
\(\sigma^2\propto E[y]\) | \(y^{1/2}\) |
\(\sigma^2\propto (E[y])^2\) | \(\ln(y)\) |
\(\sigma^2\propto (E[y])^3\) | \(y^{-1/2}\) |
\(\sigma^2\propto (E[y])^4\) | \(y^{-1}\) |
A preliminary study using National Household Travel Survey (NHTS) (U.S. Department of Transportation, Federal Highway Administration 2009) data finds both the mean and standard deviation of household daily VMT are close to 40. This relationship supports that the logarithm of \(\mathbf{y}\) is a proper choice for variance stabilizing.
- Linearizing
Another fundamental assumption, linearity is also can be addressed by transformation. If the relationship between response and predictors is linearizable, a suitable transformation can construct a intrinsically linear model. Several common forms from (Montgomery, Peck, and Vining 2021) are shown in Table 6.2.
Relationship | Transformation | Linear.Form |
---|---|---|
\(y=\beta_0\exp[\beta_1x]\varepsilon\) | \(y^*=\ln(y),\varepsilon^*=\ln\varepsilon\) | \(y^*=\ln \beta_0 +\beta_1x +\varepsilon^*\) |
\(y=\beta_0+\beta_1\ln(x)+\varepsilon\) | \(x^*=\ln(x)\) | \(y=\beta_0 +\beta_1x^*+\varepsilon\) |
\(y=\beta_0x^{\beta_1}\varepsilon\) | \(y^*=\ln(y),x^*=\ln(x),\varepsilon^*=\ln\varepsilon\) | \(y^*=\ln\beta_0 +\beta_1x^* +\varepsilon^*\) |
\(y=x/((\beta_0+\varepsilon)x+\beta_1)\) | \(y^*=1/y,x^*=1/x\) | \(y^*=\beta_0 +\beta_1x^* +\varepsilon\) |
Comparing these forms, the \(log(y)\) transformation also called log-linear model gives a finite value of response \(y\) when predictor \(x\to 0\). While the log-log model (\(y'=\ln(y),x'=\ln(x)\)) will give an infinite value of \(y\) when \(x\to 0\). This gives a useful hint when one chooses from log-linear and log-log models.
Moreover, the \(log(y)\) transformation changes the scale of error term. Only one term in \(\varepsilon\) and \(\ln\varepsilon\) can be close to constant mean and normal distributed. Therefore, residual diagnosis is still a effective way for choosing the proper form of transformation. Prior theories and experience can also help to make a proper choice. Recall the Equation (3.1), both Gravity law and Zipf’s law imply that a logarithm transformation on VMT is suitable.
- A brief discussion
In literature, some regression models take logarithm transforms on all variables (Alam, Nixon, and Zhang 2018; S. Lee and Lee 2020), some only transform one or a part of variables (Perumal and Timmons 2017), while some models keep the original metrics of data (B. (Brenda). Zhou and Kockelman 2008). Some studies choose Tobit model to deal with the domain and normality issues (Chatman 2008). Tobit model assume the travel distance follows a left-censored normal distribution. That mean no log transform is needed but a special distribution must be accounted. Does unobserved negative distance or utility exists? Tobit model claims a very strong assumption and requires both theoretical and empirical evidences. Which approach is the proper one? Many of studies don’t explain their choice and treat log transform as a ‘tradition.’ This question should be answered by checking model adequacy, which is presented in next chapter.
A correct choice of model type may depended on the data and research design. Obviously, the models with and without log transform have different structures and are not equivalent. The following question is, do their outcomes are comparable? This question needs further investigation. If the answer is no, the relevant meta-analysis or summaries should tread them separately.
6.1.2 Estimations
- Coefficients
Estimating the effect size of built environment factors on travel is one of the major goals of travel-urban form studies. In regression analysis, the values of coefficients represent the effect size. Least Squares is the mainstream method in the past decades. By Gauss - Markov theorem, OLS method itself doesn’t require explanatory variables and response variable following normal distribution. If the residuals \(\varepsilon\) satisfy \(E(\varepsilon) = 0\) and \(Var(\varepsilon) = \sigma^2\), the least-squares method will give the unbiased estimators with minimum variance.
Ordinary Least Squares (OLS) method can be used to estimate the coefficients \(\boldsymbol{\beta}\). The dimension of \(\mathbf{X}\) is \(n\times p\), which means the data contain \(n\) observations and \(p-1\) predictors. The \(p\times1\) vector of least-squares estimators is denoted as \(\hat\beta\) and the solution to the normal equations is \(\boldsymbol{\hat\beta}=(\mathbf{X'X})^{-1}\mathbf{X'}\mathbf{y}\) and \(\hat\sigma^2=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta})'(\mathbf{y-X}\boldsymbol{\hat\beta})\)
Here requires \(\mathbf{X'X}\) are invertible, that is, the covariates are linearly independent if \(\mathbf{X}\) has rank \(p\) (Kim 2020, V., Definition, p.22). When the observations are not independent or have unequal variances, the covariance matrix of error is not identity matrix. The assumption of regression model \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{I}\) doesn’t hold. Denote \(\mathbf{V}\) is a known \(n\times n\) positive definite matrix and \(V[\boldsymbol{\varepsilon}]=\sigma^2\mathbf{V}\). The generalized least squares solution is \(\boldsymbol{\hat\beta}_{GLS}=(\mathbf{X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\) and \(\hat\sigma^2_{GLS}=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})'\mathbf{V^{-1}}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})\)
- Standardized coefficients
It is inevitable that the units of covariates \(\mathbf{X}\) are very different in many studies. One thing can be done is to standardize the values of coefficients (Lei Zhang et al. 2012; S. Lee and Lee 2020). Unit normal scaling or unit length scaling can convert \(\hat \beta_j\) to dimensionless regression coefficient, which is called standardized regression coefficients. A simple expression of standardized coefficients is that \(\hat b_j= \hat\beta_j\sqrt{\frac{\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}{\sum_{i=1}^{n}(y_{i}-\bar y)^2}}\),\(j=1,2,...(p-1)\), and \(\hat\beta_0=\bar y - \sum_{j=1}^{p-1}\hat\beta_j\bar x_j\)
- Elasticity
As introduced in previous chapter, elasticity is also commonly used to determine the relative importance of a variable in terms of its influence on a dependent variable. It is generally interpreted as the percent change in the dependent variable induced by a 1% change in the independent variable (McCarthy 2001) Table 6.3. The values of elasticity are calculated by \(e_i=\beta_i\frac{X_i}{Y_i}\approx\frac{\partial Y_i}{\partial X_i}\frac{X_i}{Y_i}\)
Model | Marginal.Effects | Elasticity |
---|---|---|
Linear | \(\beta\) | \(\beta\frac{X_i}{Y_i}\) |
Log-linear | \(\beta Y_i\) | \(\beta X_i\) |
Linear-log | \(\beta\frac{1}{X_i}\) | \(\beta\frac{1}{Y_i}\) |
Log-log | \(\beta\frac{Y_i}{X_i}\) | \(\beta\) |
Logit | \(\beta p_i(1-p_i)\) | \(\beta X_i(1-p_i)\) |
Poisson | \(\beta\lambda_{i}\) | \(\beta X_i\) |
NB | \(\beta \lambda_{i}\) | \(\beta X_i\) |
It might be a typo that Reid Ewing and Cervero (2010) use the formula of \(\beta \bar X\left(1-\frac{\bar Y}{n}\right)\) for Logit model. In Poisson model and Negative Binomial model, \(\lambda_i=\exp[\mathbf{x}_i'\boldsymbol{\beta}]\) (Greene 2018, eq.18–17, 21). For truncated Poisson model: \(\delta_i=\frac{(1-P_{i,0}-\lambda_i P_{i,0})}{(1-P_{i,0})^2}\cdot\lambda_i\beta\) (Greene 2018, eq.18–23). Hurdle model will give separate marginal(partial) effects (Greene 2018, example 18.20).
- A brief discussion
When a study contains two or more travel-urban form models, the models’ responses are the same or similar. Researcher can assume that the observed VMT are random sampled from a large population. They often compare the models’ performance by adding or removing one or a few independent variables. The coefficients from the best fitted model would be recommended. That means that the models in a study usually are comparable. But the models in cross studies could choose different combinations of covariates \(\mathbf{X}\) having substantial difference or uncertainties. In other words, the value of \(\hat \beta_j\) means that, given all other coefficients fixed, for each change of one unit in \(x_j\), the average change in the mean of \(\mathbf{Y}\). Since \(\boldsymbol{\hat\beta}\) are linear combinations of the response and covariates (Montgomery, Peck, and Vining 2021), these models should not take the consistent estimated coefficients for granted. Comparing the coefficiants among different models need to check whether their covariates matrix are similar. The framework of D-variables does help to make cross-study analysis.
Both standardized regression coefficients and elasticites try to make the effect sizes comparable in some way. For example, The population densities at tract level in Virginia and DC would have distinct ranges (Lei Zhang et al. 2012). Standardized regression coefficients can eliminate the different ranges of data. Another example, the unit of population densities in studies could be people per square mile (Alam, Nixon, and Zhang 2018) or people per square kilometer (Ingvardson and Nielsen 2018). Elasticites can eliminate the different units of data. Another way is to unify the units before fitting the models but gathering the original data from different studies is a huge challenge. Which measurement of effect size is better for comparison? A simulation test may answer this question. Some studies sums up the standardized regression coefficients or elasticites of Multiple Linear Regression and called the summation as combined effects (S. Lee and Lee 2020). Although these values are dimensionless, but standardized regression coefficients and elasticites are derived from the value of \(\boldsymbol{\hat\beta}\). Is the sum of partial regression coefficient meaningful? It needs some mathematical proof.
6.1.3 Inference
Point estimation in last section tell us what the effect size is. Statistical inference tell us how it is likely to be true. Most of travel-urban form studies are significance-centered. In a typical paper on this topic, if the p-value of one factor is small enough, the estimate of that factor would be accepted. But p-value is just a piece of inference. Analysis of variance (ANOVA), hypothesis test, and interval estimation provide more complete information.
- Analysis of Variance
Analysis of Variance (ANOVA) is the fundamental approach in regression analysis. Actually, this method analyses the variation in means rather than variances themselves (Casella and Berger 2002, Ch.11). The basic idea is
\[\begin{equation} \begin{split} \mathrm{SST} =& \mathrm{SSR} + \mathrm{SSE}\\ \sum(y-\bar y)^2=&\sum(\hat y-\bar y)^2+\sum(y-\hat y)^2 \end{split} \tag{6.1} \end{equation}\]
where \(\mathrm{SST}\) is Sum of Squares Total, \(\mathrm{SSR}\) is Sum of Squares Regression, and \(\mathrm{SSE}\) is Sum of Square Error. For Generalized Least Squares method, \(\mathrm{SST}=\mathbf{y'V^{-1}y}\), \(\mathrm{SSR}= \boldsymbol{\hat\beta'}\mathbf{B'z}=\mathbf{y'V^{-1}X(X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\), and \(\mathrm{SSE}=\mathrm{SST}-\mathrm{SSR}\). \(\mathrm{SSR}\) represents the part of variance can be explained by the model. \(\mathrm{SSE}=\mathbf{e'e}\) is the unknown part and \(\mathbf{e}=\mathbf{y}-\mathbf{\hat y}=\mathbf{y}-\mathbf{X}\boldsymbol{\hat\beta}\). This process is called variance decomposition.
- Hypothesis Test
Significance of regression means if the linear relationship between response and predictors is adequate. The hypotheses for testing model adequacy are
\[\begin{equation} \begin{split} H_0:&\quad \beta_0 = \beta_1 = \cdots =\beta_{p-1}=0\\ H_1:&\quad \text{at least one } \beta_j \neq 0,\ j=0,1,...,(p-1)\\ \end{split} \tag{6.2} \end{equation}\]
By Theorem D14 (Kim 2020,XX, p.90), if an \(n\times1\)random vector \(\mathbf{y}\sim N(\boldsymbol{\mu},\mathbf{I})\), then \(\mathbf{y'y} \sim \chi^2(n,\frac12\boldsymbol{\mu'\mu})\) Recall the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\). By the additive property of \(\chi^2\) distribution, \(\frac{MSE}{\sigma^2}=\frac{\mathbf{y'(I-H)y}}{(n-p)\sigma^2} \sim \chi^2_{(n-p)}\) and \(\frac{MSR}{\sigma^2}=\frac{\mathbf{y'Hy}}{(p-1)\sigma^2} \sim \chi^2_{(p-1)}\). Though \(\sigma^2\) is usually unknown, by the relationship between \(\chi^2\) and \(F\) distributions,
\[\begin{equation} F_0=\frac{MSE}{MSR} \sim F_{(p-1),(n-p),\lambda} \end{equation}\]
where \(\lambda\) is the non-centrality parameter. It allows to test the hypotheses given a significance level \(\alpha\). If test statistic \(F_0>F_{\alpha,(p-1),(n-p)}\), then one can reject \(H_0\).
Significance of coefficients is to test a specific coefficient, the hypothesis is H\(_0\): \(\beta_j =0\) and H\(_1\): \(\beta_j \neq 0\). \(\boldsymbol{\hat\beta}\) is a linear combination of \(\mathbf{y}\). Based on the assumption of \(\mathbf{y|x}\sim N (\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\), it can be proved that \(\boldsymbol{\hat\beta}\sim N (\boldsymbol{\beta},\sigma^2(\mathbf{X'X})^{-1})\) and
\[\begin{equation} t_0=\frac{\hat\beta_j}{se(\hat\beta_j)}=\frac{\hat\beta_j}{\sqrt{\hat\sigma^2C_{jj}}} \sim t_{(n-p)} \end{equation}\]
where \(C_{jj}\) is the element at the \(j\) row and \(j\) column of \((\mathbf{X'X})^{-1}\). If \(|t_0|< t_{\alpha/2,(n-p)}\), then the test failed to reject the \(H_0\), this predictor can be removed from the model. This test is called partial or marginal test because the test statistic for \(\beta_j\) depends on all the predictors in the model.
- Confidence Intervals
Above results can also construct the confidence interval for each coefficient. A \(100(1-\alpha)\) confidence interval for \(\beta_j\) is \(\hat\beta_j-t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}}\le \beta_j \le \hat\beta_j+t_{\alpha/2,(n-p)}\sqrt{\hat\sigma^2C_{jj}}\).
- A brief discussion
This section shows that statistical inference relies on some probability distributions. Hence, it requires more conditions than least squares methods. Checking model adequacy is a necessary step before reaching any conclusion.
ANOVA is an worthwhile method but is rarely seen in travel-urban form studies. Serbanica and Constantin (2017) use a two-way ANOVA to “compare the effects of country group, population growth and city size on green performance.” B. W. Lane (2011) applies multivariate analysis of covariance (MANCOVA) on examining how “the variation in proportional changes in driving is related to variation in the covariates.” This study demonstrates that coefficients are not the only measurement of influencing factors. ANOVA may explain how the effects appear or disappear in various spatial scales. When the variance structure changing along the scales, observing the dynamic of \(SSR\) of D-variables is an interesting topic. Gelman (2005) shows that ANOVA is important for hierarchical models.
Statistician have proved that p-value itself is not a sufficient evidence for hypothesis test (Hubbard and Lindsay 2008; Halsey et al. 2015) and it should not be the only criteria for statistical inference (Wasserstein and Lazar 2016). Confidence intervals (CI) is a better measurement which can exclude the values unlikely existing in population (Ranstam 2012). It can represent the uncertainty better than standard error because \(se\) also depends on the sample size. Most of travel-urban form studies provide \(se\) values of estimates. A few of them give the confidence intervals. It calls for some empirical or simulated studies to show how CI can tell more about the effect size.
6.2 For Tirp Generation
The frequency of trip or ridership is a count variable. The observed counts of trips \(Y_i,...,Y_n\) are random variable aggregated over differing numbers of individual or household with support \(Y=\in\{0,1,2,...\}\). The trips as events occur randomly in a day or other time. An usual assumption is that count data follow a Poisson or negative binomial distribution.
- Poisson Regression
The probability mass function (pmf) of Poisson distribution and its canonical form is
\[\begin{equation} Pr(Y=y) = \frac{e^{-\mu}\mu^y}{y!}=\exp[\log(\mu) y-\mu](y!)^{-1} \end{equation}\]
So Poisson distribution has a simple link function as
\[\begin{equation} \begin{split} g(\mu_i)&=\log\mu_i=\eta_i=\mathbf{x}'\boldsymbol{\beta}\\ g^{-1}(\eta_i)&=\exp[\eta_i]=\mu_i=\exp[\mathbf{x}'\boldsymbol{\beta}]\\ \end{split} \tag{6.3} \end{equation}\]
And Poisson distribution has the property of \(E[y_i]=Var[y_i]=\mu_i\) as the systematic component. By taking log transform, the non-negative parameter space mapping to real number. It also convert the multiplicative relationship among predictors to additive. The value of coefficient \(\beta_j\) means that per unit change in predictor \(x_j\) leads to the expected change in the log of the mean of response. Another interpretation is that the mean of response would multiple \(\exp[\beta_j]\) by per unit change in \(x_j\). Iteratively reweighed least squares method (IRLS) can solve the log-linear Poisson model. The key correction step is \(\hat{\eta_i}^{(1)}=\hat{\eta_i}^{(0)} + \frac{y_i-\hat\mu_i^{(0)}}{\hat\mu_i^{(0)}}\). The diagonal weight matrix is \(w_{ii}=\hat\mu_i^{(0)}\)
- Negative Binomial Regression
A restriction of Poisson Distribution is that the mean and variance should be equal or proportional. In many count data, the inequality of them is called overdispersion. Overdispersion is very common in trip frequency data. It could be because the daily trips are not independent and homogeneous. They are a mixture of several different purposes or several persons in a household. By adding a new parameter, mixture rate can construct a Poisson mixture model and address the overdispersion. Suppose an unobserved random variable follow a gamma distribution \(Z\sim Gamma(r,1/r)\) where \(r\) is the shape parameter. The pdf is
\[\begin{equation} f(z)=\frac{r^r}{\Gamma(r)}z^{r-1}\exp[-rz],\quad z>0 \end{equation}\]
It has \(E[Z]=1\) and \(Var[Z]=1/r\). Then a mixture model can be denote as a conditional distribution \(Y|Z\sim Pois(\mu Z)\) for some \(\mu>0\) and \(E[Y]=\mu\) and \(Var[Y]=\mu+\frac{\mu^2}{r}\). It is called Poisson-Gamma distribution who can represents the inequality of mean and variance. If \(r\) represent the given number of success and \(y\) represent the observed number of failure in a sequence of independent Bernoulli trails. Then the success probability is \(p=r/(r+\mu)\) Recall that \(\Gamma(r+y)=\int_0^\infty z^{r+y-1}\exp[-z]dz\), it can be proved that \(Y\) follow a negative binomial distribution.
Quasi-Poisson Model is another simple way for overdispersion. It introduces a dispersion parameter \(\phi\). The Poisson model has \(Var[Y|\eta]=\phi\mu\) where \(\phi>1\). The estimated \(\phi\) is \(\hat\phi=\frac{1}{n-p}\sum\frac{(Y_i-\hat\mu_i)^2}{\hat\mu_i}\) The extra parameter can be estimated by maximum likelihood.
- Zero-inflated and Hurdle Models
In trip frequency data, there are often more no-trip observations than Poisson and negative binomial distribution expected. One problem in previous planning studies is to manipulate data by replacing zero values with one (Reid Ewing et al. 2015). But the meaning and mechanism of ‘No participate’ are essentially different with that of intensive of participate (Greene 2018, 18.4.8).
Zero-inflated model and hurdle model can address this issue.
Both of them assume the data arise from two mechanisms.
For example, Q. Zhang et al. (2019) use the Zero-inflated Negative Binomial model to examine the influences of built environment on trip generation.
In Zero-inflated Poisson/negative binomial model, both of two mechanisms generate zero observations.
The first mechanism produce the excess zeros with \(\pi_i=Pr(Y_i=0)\). The rest zero and positive values are generated by the second mechanism \(f(y;\mathbf{x}_i,\boldsymbol\beta)\) of Poisson or negative binomial pmfs.
\[\begin{equation} f(y_i;\pi_i,\mu_i)=\begin{cases}\pi_i+(1-\pi_i)f(0;\mu_i)&y_i=0\\ (1-\pi_i)f(y_i;\mu_i)&y_i>0\\ \end{cases} \tag{6.4} \end{equation}\]
The two link functions are
\[\begin{equation} \begin{split} g_0(\pi_i)=&\mathbf{w}'_i\boldsymbol\gamma\\ g_1(\mu_i)=&\mathbf{x}'_i\boldsymbol\beta\\ \end{split} \end{equation}\]
Note that the two mechanisms could have different covariates and coefficients. But both of \(\pi_i\) and \(\mu_i\) appear in two equations and have to be evaluated jointly. Newton-Raphson algorithm or EM algorithm can deal with this question.
Hurdle models is another type of two-step models. Hurdle models assume that all zero observations are generated by the first mechanism. Hence the first mechanism is not depend on \(\mathbf{x}_i\) and \(\boldsymbol\beta\). A challenge is that ordinary Poisson or negative binomial distribution does contain zero values. Here use a truncated distribution to address this issue.
\[\begin{equation} f(y_i;\pi_i,\mu_i)=\begin{cases}\pi_i&y_i=0\\ (1-\pi_i)\frac{f(y;\mu_i)}{1-f(0;\mu_i)}&y_i>0\\ \end{cases} \tag{6.5} \end{equation}\]
where \(f(0;\mu_i)= \exp[-\mu_i]\) in Poisson model and \(f(0;\mu_i)= (\frac{r}{\mu_i+r})^r\) in negative binomial model.
- A brief discussion
Because the logarithm on response is similar with log-linear model, Choi et al. (2012) take log transform on both side of equation and compare the performance between Poisson regression and log-log models (they call it as multiplicative model). They think “the Poisson model … reflects the varying elasticity of the dependent variable according to the level of independent variables.” They purpose log-log model is better for greater F-statistic and adjusted \(R^2\) than Poisson model. But the test statistic is not a good measurement for comparing models with different structures. \(R^2\) is only a piece of evidence for goodness of fit. L. Wang and Currans (2018) ’s study shows that the predictions of log-transform model may have significant bias when conducting detransformation. The comparison between log-transform and Poisson models needs to go back to the properties of data and underlying mechanism. The detransformation bias may due to the inappropriate model structure. Although the two types of model have similar forms of equation but they perform two distinct types of randomization. A convincing comparison still call for the adequacy checking.
There are three conditions for Poisson process: First, as a stochastic process, the probability of at least one event happened in a time interval is proportional to the length of the interval. Second, the probability of two or more event happened in a small time interval is close to zero. Third, in disjoint time intervals, the count numbers of trip should be independent. In real life, a traveler can not make two trips at the same time so the second condition holds. But a household with two worker and two student might have four trip at the same time every morning. Hence, individual count data is more valid than household’s when using Poisson distribution. The independency of count number among differing time interval may not valid too. The daily trips often belong to a trip chain and require more information at a micro level.
Negative binomial regression has the same link function (Equation (6.3)) with Poisson models. For the advantage of addressing overdispersion, there are more travel-urban form studies choosing negative binomial regression than Poisson models (Reid Ewing et al. 2015). Some studies found the estimated coefficients are similar in two types of models (Chatman 2008). Dill et al. (2013) also report that count data models have no obvious advantages in prediction. A research about interval estimates may disclose their difference.
Reid Ewing et al. (2015) firstly apply the hurdle models on travel-urban form study. But their article is not for testing the advantage of hurdle model. The two-step models can better express the decision process discussed in Part I. It is worth to compare the performance between hurdle, Tobit, and replacing-0-with-1 models in the future.
6.3 For Mode Choice
Mode choice is the classical topic in travel studies. One can choose to taking a trip or not, driving or active modes. These discrete response variables cannot be denoted by continuous variables. Generalized Linear Models (GLM) allow the response following more general distributions than normal. GLMs (equation (1.2)) include three components. Systematic component \(\eta=\mathbf{X}\boldsymbol\beta\) has a similar form with ordinary linear models but without error term. \(\boldsymbol\beta\) are unknown coefficients. Random component \(E[Y]=\mu\) specifies the probability distributions of \(Y\), which could have a pdf or pmf from an exponential family. Link function \(g(\cdot)\) connects the systematic component and random component together.
- Binomial Response
When a traveler choose to make a trip or not, the decision follows a Bernoulli distribution. The probability is denoted by \(Pr(\text{choice}=\text{Yes})=\pi\) and \(Pr(\text{choice}=\text{No})=1-\pi\). For \(n\) number of decisions under the same \(\pi\), let \(Y\) represents the count of choosing ‘Yes’ and follow a binomial distribution \(Bin(n,\pi)\). For many travelers with different \(\pi\), one has \(Y_i\sim Bin(n_i,\pi_i)\), that is a binary response data. The number of total observation \(N=\sum_{i=1}^n n_i\). The pmf of binomial distribution is
\[\begin{equation} Pr(Y_i = y_i) = {{n_i}\choose{y_i}} \pi_i^{y_i} (1-\pi_i)^{n_i-y_i} \end{equation}\]
It is clear that the random component is \(E[y_i]=\pi_i\) and systematic component \(\eta_i=\mathbf{X}'_i\boldsymbol\beta\). \(\pi\) is the probability between zero and one. but the log odds of success \(\eta_i\) can take any real number. The canonical form of binomial distribution is
\[\begin{equation} Pr(Y_i = y_i) = \exp\left[\log(\frac{\pi_i}{1-\pi_i})y_i+n_i\log(1-\pi_i)\right]{{n_i}\choose{y_i}} \end{equation}\]
The canonical link function in logit models can transform the probability to the range of real number. In this one-to-one mapping, a probability \(\pi_i>1/2\) will give a positive \(\eta_i\) and a negative \(\eta_i\) correspond to a \(\pi_i\) less than one half.
\[\begin{equation} \begin{split} g(\pi_i)&=\log\frac{\pi_i}{1-\pi_i}=\eta_i\quad\text{Logit function}\\ g^{-1}(\eta_i)&=\frac{\exp[\eta_i]}{1+\exp[\eta_i]}=\pi_i\quad\text{Logistic function}\\ \end{split} \tag{6.6} \end{equation}\]
- Multinomial Response
For categorical response such as travel mode choice, a traveler has more than two alternatives including driving, transit, biking and walking. The generalized logistic regression can address these polychotomous data. The mode choice \(Y_i\) follows the mutinomial distribution with \(J\) alternatives. Denote the probability of \(i\)th traveler chooses the \(j\)th mode, then \(\pi_{ij}=Pr(Y_i=j)\). And the pmf of multinomial distribution is
\[\begin{equation} Pr(Y_{i1}=y_{i1}, ..., Y_{iJ}=y_{iJ})= {n_i \choose y_{i1},..., y_{iJ} } \pi_{i1}^{y_{i1}} \cdots \pi_{iJ}^{y_{iJ}} \end{equation}\]
When the data exclude the people without trip, the several modes exhaust all observations and mutually exclusive. That is \(\sum_{j=1}^J\pi_{ij}=1\) for each \(i\). Once \(J-1\) parameters are evaluated, the rest one will be determined automatically. That means \(\pi_{iJ}=1-\pi_{i1}-\cdots-\pi_{i,J-1}\). The random component is \(\mu_i=n_i\pi_{ij}\) and the systematic component is \(\eta_{ij}=\mathbf{X}_i'\boldsymbol\beta_j\)
\[\begin{equation} \begin{split} g^{-1}(\eta_{ij})&=\frac{\exp[\eta_{ij}]}{\sum_{k=1}^J\exp[\eta_{ik}]}=\pi_{ij}\\ g(\pi_{ij})&=\log\frac{\pi_{ij}}{\pi_{iJ}}=\eta_{ij}\\ \end{split} \tag{6.7} \end{equation}\]
McFadden (1973) proposed the Discrete Choice Models which is also called multinomial/conditional logit model. This model introduces \(U_{ij}\) as the random utility of \(j\)th choice. Then based on Utility Maximum theory,
\[\begin{equation} \pi_{ij}=Pr(Y_i=j)=Pr(\max(U_{i1},...,U_{iJ})=U_{ij}) \end{equation}\]
Here \(U_{ij}=\eta_{ij}+\varepsilon_{ij}\) where the error term follows a standard Type I extreme value distributions. The reason is that the difference between two independent extreme value distributions has a logistic distribution. Hence, it can still be solved by logit models. The expected utility depend on the characteristics of the alternatives rather than that of individuals. Let \(\mathbf{Z}_j\) represents the characteristics of \(j\)th alternative, one has \(\eta_{ij}=\mathbf{Z}_i'\boldsymbol\gamma\). Combining the two sources of utility together, a general form of utility is \(\eta_{ij}=\mathbf{X}_i'\boldsymbol\beta_j+\mathbf{Z}_i'\boldsymbol\gamma\)
- A brief discussion
Multinomial logistic models are widely used in mode choice questions. An alternative is the multinomial probit model witch assumes the error terms \(\boldsymbol\varepsilon\sim MVN(\mathbf{0},\Sigma)\) where \(\Sigma\) is a correlation matrix. The related application can be found in Chakour and Eluru (2016).
Logistic models are not robust when the probability of \(\pi\) is close to zero or one. For mode choice questions, the proportions of walking, biking, and transit are much smaller than that of driving. In logistic models, the goal is to estimate the unknown vector of parameters \(\boldsymbol\beta\) for the known covariates \(\mathbf{X}_i\). But in the systematic component, \(\eta_i\) is unobserved. Ordinary Linear Regression doesn’t work in this case. Fortunately, the link function in logit models has a close form. Iteratively Reweighted Least Squares method (IRLS) (Lawson 1961) can get the solution.
In IRLS algorithm, when the probability and sample size of one mode is small (e.g. \(\hat\pi_i=0.05\)), it would be assigned a small weight. “The standard error is artificially compressed, which leads us to overestimate the precision of the proportion estimate.” (Lipsey and Wilson 2001, chap. 3) Sometimes, researcher can combine several modes such as walking and biking to active mode and relief this issue. Otherwise, one has to looking for other algorithm, such as data augmentation by Markov chain Monte Carlo (MCMC) to get the more stable estimates.