Chapter 11 Generalized Linear Models for Travel Choice
Generalized Linear Models (GAM) release the restriction of normality. GLM allow the response following more general distributions than normal. It is very important for the research with discrete response variables, such as mode choice or trips counts.
GLM (equation (1.2)) include three components. Systematic component η=Xβ has a similar form with ordinary linear models but without error term. β are unknown coefficients. Random component E[Y]=μ specifies the probability distributions of Y, which could have a pdf or pmf from an exponential family. Link function g(⋅) connects the systematic component and random component together.
11.1 Binomial Response Models
When a traveler choose to make a trip or not, the decision follows a Bernoulli distribution. Denotes the probability Pr(choice=Yes)=π and Pr(choice=No)=1−π. For n number of decisions under the same π, let Y represents the count of choosing ‘Yes’ and follow a binomial distribution Bin(n,π). For many travelers with different π, one has Yi∼Bin(ni,πi), that is a binary response data. The number of total observation N=∑ni=1ni. The pmf of binomial distribution is
Pr(Yi=yi)=(niyi)πyii(1−πi)ni−yi
11.1.1 Logit Models
It is clear that the random component is E[yi]=πi and systematic component ηi=X′iβ. π is the probability between zero and one. but the log odds of success ηi can take any real number. The canonical form of binomial distribution is
Pr(Yi=yi)=exp[log(πi1−πi)yi+nilog(1−πi)](niyi)
The canonical link function in logit models can transform the probability to the range of real number. In this one-to-one mapping, a probability πi>1/2 will give a positive ηi and a negative ηi correspond to a πi less than one half.
g(πi)=logπi1−πi=ηiLogit functiong−1(ηi)=exp[ηi]1+exp[ηi]=πiLogistic function
The goal is to estimate the unknown vector of parameters β for the known covariates Xi. But in the systematic component, ηi is unobserved. Ordinary Linear Regression doesn’t work in this case. Fortunately, the link function in logit models has a close form. Iteratively Reweighted Least Squares method (IRLS) (Lawson 1961) can get the solution. One iteration of this approach includes four steps.
The first step starts from the current estimated coefficients ˆβ(0). Then the current estimates of ^ηi(0)=x′iˆβ(0) and ˆπ(0)i=exp[ˆη(0)i]1+exp[ˆη(0)i]. But current ^ηi(0) is different with the true ηi.
The second step will update the current estimates by adding a correction term. Using the first two terms of Taylor series,
^ηi(1)=^ηi(0)+(yi−ˆμ(0)i)⋅dˆη(0)idˆμ(0)i
Since E[Yi]=μi=niπi, it is also easy to get
dηidμi=1ni⋅dηidπi=1ni(1πi+11−πi)=1niπi(1−πi) Therefore, ^ηi(1)=^ηi(0)+yi−niˆπ(0)iniˆπ(0)i(1−ˆπ(0)i)
The third step is to calculate the diagonal weight matrix W in the Fisher scoring algorithm. It is known that the binomial distribution has Var[Yi]=niπi(1−πi).
wii=[Var[Yi](dηidμi)2]−1=niˆπi(1−ˆπi)
The final step improves estimate of ˆβ(1) using the weighted least-squares method
ˆβ(1)=(X′WX)−1X′Wˆη(1)
The four steps will repeat until the procedure convergence. And the result gives
Var[ˆβ]=(X′WX)−1
11.2 Multinomial Response Models
11.2.1 Multinomial Logit models
For categorical response such as travel mode choice, a traveler has more than two alternatives including driving, transit, biking and walking. The generalized logistic regression can address these polychotomous data.
The mode choice Yi follows the mutinomial distribution with J alternatives. Denote the probability of ith traveler chooses the jth mode, then
πij=Pr(Yi=j)
And the pmf of multinomial distribution is
Pr(Yi1=yi1,...,YiJ=yiJ)=(niyi1,...,yiJ)πyi1i1⋯πyiJiJ
When the data exclude the people without trip, the several modes exhaust all observations and mutually exclusive. That is ∑Jj=1πij=1 for each i. Once J−1 parameters are evaluated, the rest one will be determined automatically. That means πiJ=1−πi1−⋯−πi,J−1. The random component is μi=niπij
g−1(ηij)=exp[ηij]∑Jk=1exp[ηik]=πijg(πij)=logπijπiJ=ηij
And the systematic component is ηij=X′iβj
11.2.2 Discret Choice Models
McFadden (1973) proposed the discrete choice models also called multinomial/conditional logit model. This model introduces Uij as the random utility of jth choice. Then based on Utility Maximum theory,
πij=Pr(Yi=j)=Pr(max
Here U_{ij}=\eta_{ij}+\varepsilon_{ij} where the error term follows a standard Type I extreme value distributions. The reason is that the difference between two independent extreme value distributions has a logistic distribution. Hence, it can still be solved by logit models.
The expected utility depend on the characteristics of the alternatives rather than that of individuals. Let \mathbf{Z}_j represents the characteristics of jth alternative, one has \eta_{ij}=\mathbf{Z}_i'\boldsymbol\gamma.
Combining the two sources of utility together, a general form of utility is
\begin{equation} \eta_{ij}=\mathbf{X}_i'\boldsymbol\beta_j+\mathbf{Z}_i'\boldsymbol\gamma \end{equation}
In reality, a travel data can try to keep the observations independently by random sampling. But the available travel modes are not independent. The multinomial/conditional probit model can deal with this issue better. If assume the error terms \boldsymbol\varepsilon\sim MVN(\mathbf{0},\Sigma) where \Sigma is a correlation matrix.
11.3 Poisson Respone/Loglinear Models
The frequency of trip is count data. The observed trip counts Y_i,...,Y_n are random variable aggregated over differing numbers of individual or household with support Y=\in\{0,1,2,...\}. The trips as events occur randomly in a day or other time. An usual assumption is that count data follow a Poisson distribution.
There are three conditions for Poisson process: Firstly, as a stochastic process, the probability of at least one event happened in a time interval is proportional to the length of the interval. secondly, the probability of two or more event happened in a small time interval is close to zero. Finally, in disjoint time intervals, the count numbers of trip should be independent. In real life, a traveler can not make two trips at the same time so the second condition holds. But a household with two worker and two student might have four trip at the same time every morning. Hence, individual count data is more valid than household’s when using Poisson distribution. The independency of count number among differing time interval may not valid too. The daily trips is often a trip chain and require more information at a micro level.
The pmf and its canonical form is
\begin{equation} Pr(Y=y) = \frac{e^{-\mu}\mu^y}{y!}=\exp[\log(\mu) y-\mu](y!)^{-1} \end{equation}
So Poisson distribution has a simple link function as
\begin{equation} \begin{split} g(\mu_i)&=\log\mu_i=\eta_i\\ g^{-1}(\eta_i)&=\exp[\eta_i]=\mu_i\\ \end{split} \tag{11.3} \end{equation}
And Poisson distribution has the property of E[y_i]=Var[y_i]=\mu_i as the systematic component.
\begin{equation} \begin{split} \log(\mu_i)=&\mathbf{x}'\boldsymbol{\beta}\\ \mu_i=&\exp[\mathbf{x}'\boldsymbol{\beta}]\\ \end{split} \end{equation}
By taking log transform, the non-negative parameter space mapping to real number. It also convert the multiplicative relationship among predictors to additive. The value of coefficient \beta_j means that per unit change in predictor x_j leads to the expected change in the log of the mean of response. Another interpretation is that the mean of response would multiple \exp[\beta_j] by per unit change in x_j.
Similarly, iteratively reweighted least squares method (IRLS) can solve the log-linear Poisson model. The key correction step is
\begin{equation} \hat{\eta_i}^{(1)}=\hat{\eta_i}^{(0)} + \frac{y_i-\hat\mu_i^{(0)}}{\hat\mu_i^{(0)}} \end{equation}
The diagonal weight matrix is
\begin{equation} w_{ii}=\hat\mu_i^{(0)} \end{equation}
11.3.1 Negative Binomial Model
The restriction of Poisson Distribution is that the mean and variance should be equal or proportional. In many count data, the inequality of them is called overdispersion.
Suppose an unobserved random variable follow a gamma distribution Z\sim Gamma(r,1/r) where r is the shape parameter. The pdf is
\begin{equation} f(z)=\frac{r^r}{\Gamma(r)}z^{r-1}\exp[-rz],\quad z>0 \end{equation}
It has E[Z]=1 and Var[Z]=1/r.
Then a mixture model can be denote as a conditional distribution Y|Z\sim Pois(\mu Z) for some \mu>0 and
\begin{equation} E[Y]=E[E[Y|Z]]=E[\mu Z]=\mu E[Z]=\mu \end{equation}
and
\begin{equation} \begin{split} Var[Y]&=E[Var[Y|Z]] + Var[E[Y|Z]]\\ &=E[\mu Z]+Var[\mu Z]\\ &=\mu E[Z] + \mu^2Var[Z]\\ &=\mu+\frac{\mu^2}{r} \end{split} \end{equation}
It is called Poisson-Gamma distribution who can represents the inequality of mean and variance. If r represent the given number of success and y represent the observed number of failure in a sequence of independent Bernoulli trails. Then the success probability is p=r/(r+\mu) Recall that \Gamma(r+y)=\int_0^\infty z^{r+y-1}\exp[-z]dz, it can be proved that Y follow a negative binomial distribution
\begin{equation} \begin{split} p(y)&=\int_0^\infty p(y|z)\cdot f(z)dz\\ &=\int_0^\infty \frac{(\mu z)^y\exp[-\mu z]}{y!}\cdot\frac{r^r}{\Gamma(r)}z^{r-1}\exp[-rz]dz\\ &=\frac{r^r\mu^y}{y!\Gamma(r)}(\mu+r)^{-y-r}\int_0^\infty [(\mu+r)z]^{y+r-1}\exp[-(\mu+r)z]d(\mu+r)z\\ &=\frac{\Gamma(r+y-1)}{\Gamma(y+1)\Gamma(r)}(\frac{r}{\mu+r})^r(\frac{\mu}{\mu+r})^y\\ &={{r+y-1}\choose{r-1}}p^r(1-p)^y, \quad y=0,1,2,... \end{split} \end{equation}
However, negative binomial distribution is non-exponential families. Using maximum likelihood method and log link, the coefficients can be estimated.
11.3.2 Quasi-Poisson Model
Another simple way for overdispersion is to introduce a dispersion parameter \phi. The Poisson model has Var[Y|\eta]=\phi\mu where \phi>1. The estimated \phi is
\hat\phi=\frac{1}{n-p}\sum\frac{(Y_i-\hat\mu_i)^2}{\hat\mu_i} The extra parameter can also be estimated by maximum likelihood.
11.3.3 Zero-inflated and Hurdle Models
In trip frequency data, there are often more no-trip observations than Poisson and negative binomial distribution expected. The two types of models, zero-inflated model and hurdle model can address this issue. Both of them assume the data arise from two mechanisms.
In Zero-inflated Poisson/negative binomial model, both of two mechanisms generate zero observations.
The first mechanism produce the excess zeros with \pi_i=Pr(Y_i=0). The rest zero and positive values are generated by the second mechanism f(y;\mathbf{x}_i,\boldsymbol\beta) of Poisson or negative binomial pmfs.
\begin{equation} f(y_i;\pi_i,\mu_i)=\begin{cases}\pi_i+(1-\pi_i)f(0;\mu_i)&y_i=0\\ (1-\pi_i)f(y_i;\mu_i)&y_i>0\\ \end{cases} \tag{11.4} \end{equation}
The two link functions are
\begin{equation} \begin{split} g_0(\pi_i)=&\mathbf{w}'_i\boldsymbol\gamma\\ g_1(\mu_i)=&\mathbf{x}'_i\boldsymbol\beta\\ \end{split} \end{equation}
Note that the two mechanisms could have different covariates and coefficients. But both of \pi_i and \mu_i appear in two equations and have to be evaluated jointly. Newton-Raphson algorithm or EM algorithm can deal with this question.
Hurdle model assumes all zero observations are generated by the first mechanism. Hence the first mechanism is not depend on \mathbf{x}_i and \boldsymbol\beta. A challenge is that ordinary Poisson or negative binomial distribution does contain zero values. Here use a truncated distribution to address this issue.
\begin{equation} f(y_i;\pi_i,\mu_i)=\begin{cases}\pi_i&y_i=0\\ (1-\pi_i)\frac{f(y;\mu_i)}{1-f(0;\mu_i)}&y_i>0\\ \end{cases} \tag{11.5} \end{equation}
where f(0;\mu_i)= \exp[-\mu_i] in Poisson model and f(0;\mu_i)= (\frac{r}{\mu_i+r})^r in negativ binomial model.
“When \pi is close to 0 or close to 1, the standard error is artificially compressed, which leads us to overestimate the precision of the proportion estimate.” (Lipsey and Wilson 2001, chap. 3)