## A.3 Multinomial logistic regression

The logistic model seen in Section 5.2.1 can be generalized to categorical variables \(Y\) with more than two possible levels, namely \(\{1,\ldots,J\}.\) Given the predictors \(X_1,\ldots,X_p,\) *multinomial logistic regression* models the probability of each level \(j\) of \(Y\) by

\[\begin{align} p_j(\mathbf{x}):=&\,\mathbb{P}[Y=j|X_1=x_1,\ldots,X_p=x_p]\nonumber\\ =&\,\frac{e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}}{1+\sum_{\ell=1}^{J-1}e^{\beta_{0\ell}+\beta_{1\ell}X_1+\cdots+\beta_{p\ell}X_p}} \tag{A.4} \end{align}\]

for \(j=1,\ldots,J-1\) and (for the last level \(J\))

\[\begin{align} p_J(\mathbf{x}):=&\,\mathbb{P}[Y=J|X_1=x_1,\ldots,X_p=x_p]\nonumber\\ =&\,\frac{1}{1+\sum_{\ell=1}^{J-1}e^{\beta_{0\ell}+\beta_{1\ell}X_1+\cdots+\beta_{p\ell}X_p}}. \tag{A.5} \end{align}\]

Note that (A.4) and (A.5) imply that \(\sum_{j=1}^J p_j(\mathbf{x})=1\) and that there are \((J-1)\times(p+1)\) coefficients^{244}. Also, (A.5) reveals that the last level, \(J,\) is given a different treatment. This is because it is the *reference level* (it could be a different one, but it is the tradition to choose the last one).

The multinomial logistic model has an interesting interpretation in terms of logistic regressions. Taking the quotient between (A.4) and (A.5) gives

\[\begin{align} \frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}\tag{A.6} \end{align}\]

for \(j=1,\ldots,J-1.\) Therefore, applying a logarithm to both sides we have:

\[\begin{align} \log\left(\frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}\right)=\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p.\tag{A.7} \end{align}\]

This equation is indeed very similar to (5.8). If \(J=2,\) it is the same up to a change in the codes for the levels: the logistic regression giving the probability of \(Y=1\) versus \(Y=2.\) On the LHS of (A.7) we have the logarithm of the ratio of two probabilities and on the RHS a linear combination of the predictors. If the probabilities on the LHS were complementary (if they added up to one), then we would have a log-odds and hence a logistic regression for \(Y.\) This is not the situation, but it is close: instead of odds and log-odds, we have *ratios* and *log-ratios* of non complementary probabilities. Also, it gives a good insight on what the multinomial logistic regression is: **a set of \(J-1\) independent logistic regressions** for the probability of \(Y=j\) versus the probability of the reference \(Y=J.\)

Equation (A.6) gives also interpretation on the coefficients of the model since

\[\begin{align*} p_j(\mathbf{x})=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}p_J(\mathbf{x}). \end{align*}\]

Therefore:

- \(e^{\beta_{0j}}\): is the ratio between \(p_j(\mathbf{0})/p_J(\mathbf{0}),\) the probabilities of \(Y=j\) and \(Y=J\) when \(X_1=\ldots=X_p=0.\) If \(e^{\beta_{0j}}>1\) (equivalently, \(\beta_{0j}>0\)), then \(Y=j\) is more likely than \(Y=J.\) If \(e^{\beta_{0j}}<1\) (\(\beta_{0j}<0\)), then \(Y=j\) is less likely than \(Y=J.\)
- \(e^{\beta_{\ell j}},\) \(\ell\geq1\): is the
**multiplicative**increment of the ratio between \(p_j(\mathbf{x})/p_J(\mathbf{x})\) for an increment of one unit in \(X_\ell=x_\ell,\) provided that the remaining variables \(X_1,\ldots,X_{\ell-1},X_{\ell+1},\ldots,X_p\)*do not change*. If \(e^{\beta_{\ell j}}>1\) (equivalently, \(\beta_{\ell j}>0\)), then \(Y=j\) becomes more likely than \(Y=J\) for each increment in \(X_j.\) If \(e^{\beta_{\ell j}}<1\) (\(\beta_{\ell j}<0\)), then \(Y=j\) becomes less likely than \(Y=J.\)

The following code illustrates how to compute a basic multinomial regression employing the `nnet`

package.

```
# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile, package = "carData")
summary(Chile)
## region population sex age education income statusquo vote
## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107 Min. : 2500 Min. :-1.80301 A :187
## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462 1st Qu.: 7500 1st Qu.:-1.00223 N :889
## N :322 Median :175000 Median :36.00 S :1120 Median : 15000 Median :-0.04558 U :588
## S :718 Mean :152222 Mean :38.55 NA's: 11 Mean : 33876 Mean : 0.00000 Y :868
## SA:960 3rd Qu.:250000 3rd Qu.:49.00 3rd Qu.: 35000 3rd Qu.: 0.96857 NA's:168
## Max. :250000 Max. :70.00 Max. :200000 Max. : 2.04859
## NA's :1 NA's :98 NA's :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)
# Fit of the model done by multinom: Response ~ Predictors
# It is an iterative procedure (maxit sets the maximum number of iterations)
# Read the documentation in ?multinom for more information
mod1 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,
maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2310.201176
## iter 20 value 2135.385060
## final value 2132.416452
## converged
# Each row of coefficients gives the coefficients of the logistic
# regression of a level versus the reference level (A)
summary(mod1)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo,
## data = Chile, maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## N 0.3002851 0.004829029 0.4101765 -0.1526621 -1.7583872
## U 0.8722750 0.020030032 -1.0293079 -0.6743729 0.3261418
## Y 0.5093217 0.016697208 -0.4419826 -0.6909373 1.8752190
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## N 0.3315229 0.006742834 0.2659012 0.2098064 0.1292517
## U 0.3183088 0.006630914 0.2822363 0.2035971 0.1059440
## Y 0.3333254 0.006915012 0.2836015 0.2131728 0.1197440
##
## Residual Deviance: 4264.833
## AIC: 4294.833
# Set a different level as the reference (N) for easier interpretations
Chile$vote <- relevel(Chile$vote, ref = "N")
mod2 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,
maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2393.713801
## iter 20 value 2134.438912
## final value 2132.416452
## converged
summary(mod2)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo,
## data = Chile, maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## A -0.3002035 -0.00482911 -0.4101274 0.1525608 1.758307
## U 0.5720544 0.01519931 -1.4394862 -0.5217093 2.084491
## Y 0.2091397 0.01186576 -0.8521205 -0.5382716 3.633550
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## A 0.3315153 0.006742654 0.2658887 0.2098012 0.1292494
## U 0.2448452 0.004819103 0.2116375 0.1505854 0.1091445
## Y 0.2850655 0.005700894 0.2370881 0.1789293 0.1316567
##
## Residual Deviance: 4264.833
## AIC: 4294.833
exp(coef(mod2))
## (Intercept) age educationPS educationS statusquo
## A 0.7406675 0.9951825 0.6635657 1.1648133 5.802607
## U 1.7719034 1.0153154 0.2370495 0.5935052 8.040502
## Y 1.2326171 1.0119364 0.4265095 0.5837564 37.846937
# Some highlights:
# - intercepts do not have too much interpretation (correspond to age = 0).
# A possible solution is to center age by its mean (so age = 0 would
# represent the mean of the ages)
# - both age and statusquo increase the probability of voting Y, A or U
# with respect to voting N -> conservatism increases with ages
# - both age and statusquo increase more the probability of voting Y and U
# than A -> elderly and status quo supporters are more decided to participate
# - a PS level of education increases the probability of voting N. Same for
# a S level of education, but more prone to A
# Prediction of votes -- three profile of voters
newdata <- data.frame(age = c(23, 40, 50),
education = c("PS", "S", "P"),
statusquo = c(-1, 0, 2))
# Probabilities of belonging to each class
predict(mod2, newdata = newdata, type = "probs")
## N A U Y
## 1 0.856057623 0.064885869 0.06343390 0.01562261
## 2 0.208361489 0.148185871 0.40245842 0.24099422
## 3 0.000288924 0.005659661 0.07076828 0.92328313
# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y
```

Multinomial logistic regression will suffer from numerical instabilities and its iterative algorithm might even fail to converge if the levels of the categorical variable are very separated (e.g., two data clouds clearly separated corresponding to a different level of the categorical variable).

\((J-1)\) intercepts and \((J-1)\times p\) slopes.↩︎