# C Multinomial logistic regression

The logistic model can be generalized to categorical variables \(Y\) with more than two possible levels, namely \(\{1,\ldots,J\}\). Given the predictors \(X_1,\ldots,X_k\),*multinomial logistic regression*models the probability of each level \(j\) of \(Y\) by \[\begin{align} p_j(\mathbf{x})=\mathbb{P}[Y=j|X_1=x_1,\ldots,X_k=x_k]=\frac{e^{\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k}}{1+\sum_{l=1}^{J-1}e^{\beta_{0l}+\beta_{1l}X_1+\ldots+\beta_{kl}X_k}} \tag{C.1} \end{align}\] for \(j=1,\ldots,J-1\) and (for the last level \(J\)) \[\begin{align} p_J(\mathbf{x})=\mathbb{P}[Y=J|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+\sum_{l=1}^{J-1}e^{\beta_{0l}+\beta_{1l}X_1+\ldots+\beta_{kl}X_k}}. \tag{C.2} \end{align}\]

Note that (C.1) and (C.2) imply that \(\sum_{j=1}^J p_j(\mathbf{x})=1\) and that there are \((J-1)\times(k+1)\) coefficients (\((J-1)\) intercepts and \((J-1)\times k\) slopes). Also, (C.2) reveals that the last level, \(J\), is given a different treatment. This is because it is the *reference level* (it could be a different one, but is tradition to choose the last one).

This equation is indeed very similar to (4.7). If \(J=2\), it is the same up to a change in the codes for the levels: the logistic regression giving the probability of \(Y=1\) versus \(Y=2\). On the LHS of (C.4) we have the logarithm of the ratio of two probabilities and on the RHS a linear combination of the predictors. **If the probabilities on the LHS were complementary** (if they added up to one), then we would have a **log-odds and hence a logistic regression** for \(Y\). This is not the situation, but it is close: instead of odds and log-odds, we have *ratios* and *log-ratios* of non complementary probabilities. Also, it gives a good insight on what the multinomial logistic regression is: **a set of \(J-1\) independent “logistic regressions” for the probability of \(Y=j\) versus the probability of the reference \(Y=J\)**.

Equation (C.3) gives also interpretation on the coefficients of the model since \[ p_j(\mathbf{x})=e^{\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k}p_J(\mathbf{x}). \] Therefore:

- \(e^{\beta_{0j}}\): is the ratio between \(p_j(\mathbf{0})/p_J(\mathbf{0})\), the probabilities of \(Y=j\) and \(Y=J\) when \(X_1=\ldots=X_k=0\). If \(e^{\beta_{0j}}>1\) (equivalently, \(\beta_{0j}>0\)), then \(Y=j\) is more likely than \(Y=J\). If \(e^{\beta_{0j}}<1\) (\(\beta_{0j}<0\)), then \(Y=j\) is less likely than \(Y=J\).
- \(e^{\beta_{lj}}\), \(l\geq1\): is the
**multiplicative**increment of the ratio between \(p_j(\mathbf{x})/p_J(\mathbf{x})\) for an increment of one unit in \(X_l=x_l\), provided that the remaining variables \(X_1,\ldots,X_{l-1},X_{l+1},\ldots,X_k\)*do not change*. If \(e^{\beta_{lj}}>1\) (equivalently, \(\beta_{lj}>0\)), then \(Y=j\) becomes more likely than \(Y=J\) for each increment in \(X_j\). If \(e^{\beta_{lj}}<1\) (\(\beta_{lj}<0\)), then \(Y=j\) becomes less likely than \(Y=J\).

The following code illustrates how to compute a basic multinomial regression in `R`

.

```
# Package included in R that implements multinomial regression
library(nnet)
# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile)
summary(Chile)
## region population sex age education
## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107
## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462
## N :322 Median :175000 Median :36.00 S :1120
## S :718 Mean :152222 Mean :38.55 NA's: 11
## SA:960 3rd Qu.:250000 3rd Qu.:49.00
## Max. :250000 Max. :70.00
## NA's :1
## income statusquo vote
## Min. : 2500 Min. :-1.80301 A :187
## 1st Qu.: 7500 1st Qu.:-1.00223 N :889
## Median : 15000 Median :-0.04558 U :588
## Mean : 33876 Mean : 0.00000 Y :868
## 3rd Qu.: 35000 3rd Qu.: 0.96857 NA's:168
## Max. :200000 Max. : 2.04859
## NA's :98 NA's :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)
# Fit of the model done by multinom: Response ~ Predictors
# It is an iterative procedure (maxit sets the maximum number of iterations)
# Read the documentation in ?multinom for more information
mod1 <- multinom(vote ~ age + education + statusquo, data = Chile,
maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2310.201176
## iter 20 value 2135.385060
## final value 2132.416452
## converged
# Each row of coefficients gives the coefficients of the logistic
# regression of a level versus the reference level (A)
summary(mod1)
## Call:
## multinom(formula = vote ~ age + education + statusquo, data = Chile,
## maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## N 0.3002851 0.004829029 0.4101765 -0.1526621 -1.7583872
## U 0.8722750 0.020030032 -1.0293079 -0.6743729 0.3261418
## Y 0.5093217 0.016697208 -0.4419826 -0.6909373 1.8752190
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## N 0.3315229 0.006742834 0.2659012 0.2098064 0.1292517
## U 0.3183088 0.006630914 0.2822363 0.2035971 0.1059440
## Y 0.3333254 0.006915012 0.2836015 0.2131728 0.1197440
##
## Residual Deviance: 4264.833
## AIC: 4294.833
# Set a different level as the reference (N) for easening interpretations
Chile$vote <- relevel(Chile$vote, ref = "N")
mod2 <- multinom(vote ~ age + education + statusquo, data = Chile,
maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2393.713801
## iter 20 value 2134.438912
## final value 2132.416452
## converged
summary(mod2)
## Call:
## multinom(formula = vote ~ age + education + statusquo, data = Chile,
## maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## A -0.3002035 -0.00482911 -0.4101274 0.1525608 1.758307
## U 0.5720544 0.01519931 -1.4394862 -0.5217093 2.084491
## Y 0.2091397 0.01186576 -0.8521205 -0.5382716 3.633550
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## A 0.3315153 0.006742654 0.2658887 0.2098012 0.1292494
## U 0.2448452 0.004819103 0.2116375 0.1505854 0.1091445
## Y 0.2850655 0.005700894 0.2370881 0.1789293 0.1316567
##
## Residual Deviance: 4264.833
## AIC: 4294.833
exp(coef(mod2))
## (Intercept) age educationPS educationS statusquo
## A 0.7406675 0.9951825 0.6635657 1.1648133 5.802607
## U 1.7719034 1.0153154 0.2370495 0.5935052 8.040502
## Y 1.2326171 1.0119364 0.4265095 0.5837564 37.846937
# Some highlights:
# - intercepts do not have too much interpretation (correspond to age = 0).
# A possible solution is to center age by its mean (so age = 0 would
# represent the mean of the ages)
# - both age and statusquo increase the probability of voting Y, A or U
# with respect to voting N -> conservativeness increases with ages
# - both age and statusquo increase more the probability of voting Y and U
# than A -> elderly and status quo supporters are more decided to participate
# - a PS level of education increases the probability of voting N. Same for
# a S level of education, but more prone to A
# Prediction of votes - three profile of voters
newdata <- data.frame(age = c(23, 40, 50),
education = c("PS", "S", "P"),
statusquo = c(-1, 0, 2))
# Probabilities of belonging to each class
predict(mod2, newdata = newdata, type = "probs")
## N A U Y
## 1 0.856057623 0.064885869 0.06343390 0.01562261
## 2 0.208361489 0.148185871 0.40245842 0.24099422
## 3 0.000288924 0.005659661 0.07076828 0.92328313
# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y
```

Multinomial logistic regression will suffer from numerical instabilities and its iterative algorithm might even fail to converge if the levels of the categorical variable are very separated (e.g., two data clouds clearly separated corresponding to a different level of the categorical variable).

The multinomial model employs (*J* − 1)(*k* + 1) parameters. It is easy to end up with **complex models** – that require a **large sample size** to be fitted properly – if the response has a few number of levels and there are several predictors. For example, with 5 levels and 8 predictors we will have 36 parameters. Estimating this model with 50 − 100 observations will probably result in **overfitting**.