6.4 The multinomial probit model

The multinomial probit model is used to model the choice of the $l$ -th alternative over a set of $L$ mutually exclusive options. We observe the following:

$y_{il} = \begin{cases} 1, & \text{if } y_{il}^* \geq \max\left\{\boldsymbol{y}_i^*\right\}, \\ 0, & \text{otherwise,} \end{cases}$

where $\boldsymbol{y}_i^* = \boldsymbol{X}_{i} \boldsymbol{\delta} + \boldsymbol{\mu}_i$ , with $\boldsymbol{\mu}_i \stackrel{i.i.d.}{\sim} N(\boldsymbol{0}, \boldsymbol{\Sigma})$ . The vector $\boldsymbol{y}_i^*$ is an unobserved latent vector of dimension $L$ . The matrix $\boldsymbol{X}_i = \left[(1 \ \boldsymbol{c}_i^{\top}) \otimes \boldsymbol{I}_L \ \boldsymbol{A}_i\right]$ is an $L \times j$ matrix of regressors for each alternative, where $l = 1, 2, \dots, L$ , and $j = L \times (1 + \text{dim}(\boldsymbol{c}_i)) + a$ . Here, $\boldsymbol{c}_i$ is a vector of individual-specific characteristics, $\boldsymbol{A}_i$ is an $L \times a$ matrix of alternative-varying regressors, $a$ is the number of alternative-varying regressors, and $\boldsymbol{\delta}$ is a $j$ -dimensional vector of parameters.

We take into account simultaneously the alternative-varying regressors (alternative attributes) and alternative-invariant regressors (individual characteristics).²⁹ The vector $\boldsymbol{y}_i^*$ can be stacked into a multiple regression model with correlated stochastic errors, i.e., $\boldsymbol{y}^* = \boldsymbol{X} \boldsymbol{\delta} + \boldsymbol{\mu}$ , where $\boldsymbol{y}^* = \left[\boldsymbol{y}_1^{*\top} \ \boldsymbol{y}_2^{*\top} \ \dots \ \boldsymbol{y}_N^{*\top}\right]$ , $\boldsymbol{X} = \left[\boldsymbol{X}_1^{\top} \ \boldsymbol{X}_2^{\top} \ \dots \ \boldsymbol{X}_N^{\top}\right]^{\top}$ , and $\boldsymbol{\mu} = \left[\boldsymbol{\mu}_1^{\top} \ \boldsymbol{\mu}_2^{\top} \ \dots \ \boldsymbol{\mu}_N^{\top}\right]^{\top}$ .

Following the practice of expressing $y_{il}^*$ relative to $y_{iL}^*$ by letting $\boldsymbol{w}_i = \left[w_{i1} \ w_{i2} \ \dots \ w_{iL-1}\right]^{\top}$ , where $w_{il} = y_{il}^* - y_{iL}^*$ , we can write $\boldsymbol{w}_i = \boldsymbol{R}_i \boldsymbol{\beta} + \boldsymbol{\epsilon}_i$ , with $\boldsymbol{\epsilon}_i \sim N(\boldsymbol{0}, \boldsymbol{\Omega})$ , where $\boldsymbol{R}_i = \left[(1 \ \boldsymbol{c}_i^{\top}) \otimes \boldsymbol{I}_{L-1} \ \boldsymbol{\Delta A}_i\right]$ is an $(L-1) \times K$ matrix, with $\Delta \boldsymbol{A}_i = \boldsymbol{A}_{li} - \boldsymbol{A}_{Li}$ , for $l = 1, 2, \dots, L-1$ . That is, the last row of $\boldsymbol{A}_i$ is subtracted from each row of $\boldsymbol{A}_i$ , and $\boldsymbol{\beta}$ is a $K$ -dimensional vector, where $K = (L-1) \times (1 + \text{dim}(\boldsymbol{c}_i)) + a$ .

Observe that $\boldsymbol{\beta}$ contains the same last $a$ elements as $\boldsymbol{\delta}$ , that is, the alternative-specific attribute coefficients. However, the first $(L-1) \times (1 + \text{dim}(\boldsymbol{c}_i))$ elements of $\boldsymbol{\beta}$ are the differences $\delta_{jl} - \delta_{jL}$ , for $j = 1, \dots, \text{dim}(\boldsymbol{c}_i)$ and $l = 1, 2, \dots, L-1$ . That is, these elements represent the difference between the coefficients of each qualitative response and the $L$ -th alternative for the individuals’ characteristics. This makes it difficult to interpret the multinomial probit coefficients.

Note that in multinomial models, for each alternative-specific attribute, it is only necessary to estimate one coefficient for all alternatives. However, for individuals’ characteristics (non-alternative-specific regressors), it is required to estimate $L-1$ coefficients, since the coefficient for the base alternative is set equal to 0.

The likelihood function in this model is given by $p(\boldsymbol{\beta}, \boldsymbol{\Omega} \mid \boldsymbol{y}, \boldsymbol{R}) = \prod_{i=1}^N \prod_{l=1}^L p_{il}^{y_{il}},$ where $p_{il} = p(y_{il}^* \geq \max(\boldsymbol{y}_i^*))$ .

We assume independent priors for the parameters: $\boldsymbol{\beta} \sim N(\boldsymbol{\beta}_0, \boldsymbol{B}_0) \quad \text{and} \quad \boldsymbol{\Omega}^{-1} \sim W(\alpha_0, \boldsymbol{\Sigma}_0),$ where $W$ denotes the Wishart density.

We can employ Gibbs sampling in this model, as it is a standard Bayesian linear regression model when data augmentation is used for $\boldsymbol{w}$ . The posterior conditional distributions are given by

$\begin{equation*} \boldsymbol{\beta}\mid \boldsymbol{\Omega},\boldsymbol{w}\sim{N}(\boldsymbol{\beta}_n,\boldsymbol{B}_n), \end{equation*}$ $\begin{equation*} \boldsymbol{\Omega}^{-1}\mid \boldsymbol{\beta},\boldsymbol{w}\sim{W}(\alpha_n,\boldsymbol{\Sigma}_n), \end{equation*}$

where $\boldsymbol{B}_n=(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{*\top}\boldsymbol{X}^*)^{-1}$ , $\boldsymbol{\beta}_n=\boldsymbol{B}_n(\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0+\boldsymbol{X}^{*\top}\boldsymbol{w}^*)$ , $\boldsymbol{\Omega}^{-1}=\boldsymbol{C}^{\top}\boldsymbol{C}$ , $\boldsymbol{X}_i^{*\top}=\boldsymbol{C}^{\top}\boldsymbol{R}_i$ , $\boldsymbol{w}_i^*=\boldsymbol{C}^{\top}\boldsymbol{w}_i$ , $\boldsymbol{X}^*=\begin{bmatrix}\boldsymbol{X}_1^*\\ \boldsymbol{X}_2^*\\ \vdots\\ \boldsymbol{X}_N^* \end{bmatrix}$ , $\alpha_n=\alpha_0+N$ , $\boldsymbol{\Sigma}_n=(\boldsymbol{\Sigma}_0+\sum_{i=1}^N (\boldsymbol{w}_i-\boldsymbol{R}_i\boldsymbol{\beta})^{\top}(\boldsymbol{w}_i-\boldsymbol{R}_i\boldsymbol{\beta}))^{-1}$ .

We can collapse the multinomial vector $\boldsymbol{y}_i$ into the indicator variable $d_i=\sum_{l=1}^{L-1}l\times \mathbb{1}({\max(\boldsymbol{w}_{l})=w_{il}})$ .³⁰ Then the distribution of $\boldsymbol{w}_i\mid \boldsymbol{\beta},\boldsymbol{\Omega}^{-1},d_i$ is an $L-1$ dimensional Gaussian distribution truncated over the appropriate cone in $\mathcal{R}^{L-1}$ . R. McCulloch and Rossi (1994) propose drawing from the univariate conditional distributions $w_{il}\mid \boldsymbol{w}_{i,-l},\boldsymbol{\beta},\boldsymbol{\Omega}^{-1},d_i\sim TN_{I_{il}}(m_{il},\tau_{ll}^2)$ , where $\begin{equation*} I_{il}=\begin{Bmatrix} w_{il}>\max(\boldsymbol{w}_{i,-l},0), & d_i=l\\ w_{il}<\max(\boldsymbol{w}_{i,-l},0), & d_i\neq l\\ \end{Bmatrix}, \end{equation*}$ and permuting the columns and rows of $\boldsymbol{\Omega}^{-1}$ so that the $l$ -th column and row is the last, $\begin{equation*} \boldsymbol{\Omega}^{-1}=\begin{bmatrix} \boldsymbol{\Omega}_{-l,-l} & \boldsymbol\omega_{-l,l}\\ \boldsymbol\omega_{l,-1} & \omega_{l,l}\\ \end{bmatrix}^{-1} =\begin{bmatrix} \boldsymbol{\Omega}_{-l,-l}^{-1}+{\tau}^{-2}_{ll}\boldsymbol{f}_l\boldsymbol{f}_l^{\top} & -\boldsymbol{f}_l\tau^{-2}_{ll}\\ -{\tau}^{-2}_{ll}\boldsymbol{f}_l^{\top} & {\tau}^{-2}_{ll}\\ \end{bmatrix} \end{equation*}$ where $\boldsymbol{f}_l=\boldsymbol{\Omega}_{-l,-l}^{-1}\boldsymbol{\omega}_{-l,l}$ , $\tau_{ll}^2= \omega_{ll}-\boldsymbol{\omega}_{l,-l}\boldsymbol{\Omega}^{-1}_{-l,-1}\boldsymbol{\omega}_{-l,l}$ , $m_{il}=\boldsymbol{r}_{il}^{\top}\boldsymbol{\beta}+\boldsymbol{f}_l^{\top}(\boldsymbol{w}_{i,-l}-\boldsymbol{R}_{i,-l}\boldsymbol{\beta})$ , $\boldsymbol{w}_{i,-l}$ is an $L-2$ dimensional vector of all components of $\boldsymbol{w}_i$ excluding $w_{il}$ , $\boldsymbol{r}_{il}$ is the $l$ -th row of $\boldsymbol{R}_i$ , $l=1,2,\dots,L-1$ .

The identified parameters are obtained by normalizing with respect to one of the diagonal elements $\frac{1}{\omega_{1,1}^{0.5}}\boldsymbol{\beta}$ and $\frac{1}{\omega_{1,1}}\boldsymbol{\Omega}$ .³¹

Warning: This model is an example where decisions must be made about setting the model in an identified parameter space versus an unidentified parameter space. The mixing properties of the posterior draws can be better in the latter case (R. E. McCulloch, Polson, and Rossi 2000), which typically results in less computational burden. However, it is important to recover the identified space in a final stage. Additionally, defining priors in the unidentified space may have unintended consequences on the posterior distributions in the identified space (Nobile 2000). The multinomial probit model presented in this section is set in the unidentified space (R. McCulloch and Rossi 1994), while a version of the multinomial probit in the identified space is presented by R. E. McCulloch, Polson, and Rossi (2000).

Example: Choice of fishing mode

We used in this application the dataset 3Fishing.csv from Cameron and Trivedi (2005). The dependent variable is mutually exclusive alternatives regarding fishing modes (mode), where beach is equal to 1, pier is equal to 2, private boat is equal to 3, and chartered boat (baseline alternative) is equal to 4. In this model, we have

$\mathbf{X}_i = \begin{bmatrix} 1 & 0 & 0 & 0 & \text{Income}_i & 0 & 0 & 0 & \text{Price}_{i,1} & \text{Catch rate}_{i,1}\\ 0 & 1 & 0 & 0 & 0 & \text{Income}_i & 0 & 0 & \text{Price}_{i,2} & \text{Catch rate}_{i,2}\\ 0 & 0 & 1 & 0 & 0 & 0 & \text{Income}_i & 0 & \text{Price}_{i,3} & \text{Catch rate}_{i,3}\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & \text{Income}_i & \text{Price}_{i,4} & \text{Catch rate}_{i,4}\\ \end{bmatrix}$

In this example, chartered boat is the base category, the number of choice categories is four, there are two alternative-specific regressors (price and catch rate), and one non-alternative-specific regressor (income). This setting involves the estimation of eight location parameters ( $\boldsymbol{\beta}$ ): three intercepts, three for income, one for price, and one for catch rate. This is the order of the posterior chains in our GUI. Note that the location coefficients are set equal to 0 for the baseline category. For multinomial models, we strongly recommend using the last category as the baseline.

We also get posterior estimates for a $3\times 3$ covariance matrix (four alternatives minus one), where the element (1,1) is equal to 1 due to identification restrictions, and elements 2 and 4 are the same, as well as 3 and 7, and 6 and 8, due to symmetry.

Observe that this identification restriction implies NaN values in J. Geweke (1992) and Heidelberger and Welch (1983) tests for element (1,1) of the covariance matrix, and just eight dependence factors associated with the remaining elements of the covariance matrix.

Once our GUI is displayed (see beginning of this chapter), we should follow the next Algorithm to run multinomial probit models in our GUI (see Chapter 5 for details), which in turn uses the command rmnpGibbs from the bayesm package.

Algorithm: Multinomial Probit model

Select Univariate Models on the top panel
Select Multinomial Probit model using the left radio button
Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
Select dependent and independent variables using the Formula builder table
Select the number of the Base Alternative
Select the Number of choice categorical alternatives
Select the Number of alternative specific variables
Select the Number of Non-alternative specific variables
Click the Build formula button to generate the formula in R syntax
Set the hyperparameters: mean vector, covariance matrix, scale matrix, and degrees of freedom. This step is not necessary as by default our GUI uses non-informative priors
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons

We ran 100,000 MCMC iterations plus 10,000 as burn-in with a thinning parameter equal to 5, where all priors use default values for the hyperparameters in our GUI. We found that the 95% credible intervals of the coefficient associated with income for beach and private boat alternatives are equal to (8.58e-06, 8.88e-05) and (3.36e-05, 1.45e-04). This suggests that the probability of choosing these alternatives increases compared to a chartered boat when income increases. In addition, an increase in the price or a decrease in the catch rate for specific fishing alternatives imply lower probabilities of choosing them as the 95% credible intervals are (-9.91e-03, -3.83e-03) and (1.40e-01, 4.62e-01), respectively. However, the posterior chain diagnostics suggest there are convergence issues with the posterior draws (see Exercise 5).

References

Cameron, Colin, and Pravin Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge.

Geweke, J. 1992. “Bayesian Statistics.” In. Clarendon Press, Oxford, UK.

Heidelberger, P., and P. D. Welch. 1983. “Simulation Run Length Control in the Presence of an Initial Transient.” Operations Research 31 (6): 1109–44.

McCulloch, Robert E, Nicholas G Polson, and Peter E Rossi. 2000. “A Bayesian Analysis of the Multinomial Probit Model with Fully Identified Parameters.” Journal of Econometrics 99 (1): 173–93.

McCulloch, R., and P. Rossi. 1994. “An Exact Likelihood Analysis of the Multinomial Probit Model.” Journal of Econometrics 64: 207–40.

Nobile, Agostino. 2000. “Comment: Bayesian Multinomial Probit Models with a Normalization Constraint.” Journal of Econometrics 99 (2): 335–45.

Note that this model is not identified if $\boldsymbol{\Sigma}$ is unrestricted. The likelihood function remains the same if a scalar random variable is added to each of the $L$ latent regressions.↩︎
Observe that the identification issue in this model is due to scaling $w_{il}$ by a positive constant does not change the value of $d_i$ .↩︎
Our GUI is based on the bayesm package that takes into account this identification restriction to display the outcomes of the posterior chains.↩︎