A.3 Multinomial logistic regression

The logistic model seen in Section 5.2.1 can be generalized to categorical variables $Y$ with more than two possible levels, namely $\{1,\ldots,J\}.$ Given the predictors $X_1,\ldots,X_p,$ multinomial logistic regression models the probability of each level $j$ of $Y$ by

$\begin{align} p_j(\mathbf{x}):=&\,\mathbb{P}[Y=j|X_1=x_1,\ldots,X_p=x_p]\nonumber\\ =&\,\frac{e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}}{1+\sum_{\ell=1}^{J-1}e^{\beta_{0\ell}+\beta_{1\ell}X_1+\cdots+\beta_{p\ell}X_p}} \tag{A.4} \end{align}$

for $j=1,\ldots,J-1$ and (for the last level $J$ )

$\begin{align} p_J(\mathbf{x}):=&\,\mathbb{P}[Y=J|X_1=x_1,\ldots,X_p=x_p]\nonumber\\ =&\,\frac{1}{1+\sum_{\ell=1}^{J-1}e^{\beta_{0\ell}+\beta_{1\ell}X_1+\cdots+\beta_{p\ell}X_p}}. \tag{A.5} \end{align}$

Note that (A.4) and (A.5) imply that $\sum_{j=1}^J p_j(\mathbf{x})=1$ and that there are $(J-1)\times(p+1)$ coefficients.²⁴⁶ Also, (A.5) reveals that the last level, $J,$ is given a different treatment. This is because it is the reference level (it could be a different one, but it is the tradition to choose the last one).

The multinomial logistic model has an interesting interpretation in terms of logistic regressions. Taking the quotient between (A.4) and (A.5) gives

$\begin{align} \frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}\tag{A.6} \end{align}$

for $j=1,\ldots,J-1.$ Therefore, applying a logarithm to both sides we have:

$\begin{align} \log\left(\frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}\right)=\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p.\tag{A.7} \end{align}$

This equation is indeed very similar to (5.8). If $J=2,$ it is the same up to a change in the codes for the levels: the logistic regression giving the probability of $Y=1$ versus $Y=2.$ On the LHS of (A.7) we have the logarithm of the ratio of two probabilities and on the RHS a linear combination of the predictors. If the probabilities on the LHS were complementary (if they added up to one), then we would have a log-odds and hence a logistic regression for $Y.$ This is not the situation, but it is close: instead of odds and log-odds, we have ratios and log-ratios of non complementary probabilities. Also, it gives a good insight on what the multinomial logistic regression is: a set of $J-1$ independent logistic regressions for the probability of $Y=j$ versus the probability of the reference $Y=J.$

Equation (A.6) gives also interpretation on the coefficients of the model since

$\begin{align*} p_j(\mathbf{x})=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{pj}X_p}p_J(\mathbf{x}). \end{align*}$

Therefore:

$e^{\beta_{0j}}$ : is the ratio between $p_j(\mathbf{0})/p_J(\mathbf{0}),$ the probabilities of $Y=j$ and $Y=J$ when $X_1=\ldots=X_p=0.$ If $e^{\beta_{0j}}>1$ (equivalently, $\beta_{0j}>0$ ), then $Y=j$ is more likely than $Y=J.$ If $e^{\beta_{0j}}<1$ ( $\beta_{0j}<0$ ), then $Y=j$ is less likely than $Y=J.$
$e^{\beta_{\ell j}},$ $\ell\geq1$ : is the multiplicative increment of the ratio between $p_j(\mathbf{x})/p_J(\mathbf{x})$ for an increment of one unit in $X_\ell=x_\ell,$ provided that the remaining variables $X_1,\ldots,X_{\ell-1},X_{\ell+1},\ldots,X_p$ do not change. If $e^{\beta_{\ell j}}>1$ (equivalently, $\beta_{\ell j}>0$ ), then $Y=j$ becomes more likely than $Y=J$ for each increment in $X_j.$ If $e^{\beta_{\ell j}}<1$ ( $\beta_{\ell j}<0$ ), then $Y=j$ becomes less likely than $Y=J.$

The following code illustrates how to compute a basic multinomial regression employing the nnet package.

# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile, package = "carData")
summary(Chile)
##  region     population     sex           age        education       income         statusquo          vote    
##  C :600   Min.   :  3750   F:1379   Min.   :18.00   P   :1107   Min.   :  2500   Min.   :-1.80301   A   :187  
##  M :100   1st Qu.: 25000   M:1321   1st Qu.:26.00   PS  : 462   1st Qu.:  7500   1st Qu.:-1.00223   N   :889  
##  N :322   Median :175000            Median :36.00   S   :1120   Median : 15000   Median :-0.04558   U   :588  
##  S :718   Mean   :152222            Mean   :38.55   NA's:  11   Mean   : 33876   Mean   : 0.00000   Y   :868  
##  SA:960   3rd Qu.:250000            3rd Qu.:49.00               3rd Qu.: 35000   3rd Qu.: 0.96857   NA's:168  
##           Max.   :250000            Max.   :70.00               Max.   :200000   Max.   : 2.04859             
##                                     NA's   :1                   NA's   :98       NA's   :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)

# Fit of the model done by multinom: Response ~ Predictors
# It is an iterative procedure (maxit sets the maximum number of iterations)
# Read the documentation in ?multinom for more information
mod1 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,
                       maxit = 1e3)
## # weights:  24 (15 variable)
## initial  value 3476.826258 
## iter  10 value 2310.201176
## iter  20 value 2135.385060
## final  value 2132.416452 
## converged

# Each row of coefficients gives the coefficients of the logistic
# regression of a level versus the reference level (A)
summary(mod1)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo, 
##     data = Chile, maxit = 1000)
## 
## Coefficients:
##   (Intercept)         age educationPS educationS  statusquo
## N   0.3002851 0.004829029   0.4101765 -0.1526621 -1.7583872
## U   0.8722750 0.020030032  -1.0293079 -0.6743729  0.3261418
## Y   0.5093217 0.016697208  -0.4419826 -0.6909373  1.8752190
## 
## Std. Errors:
##   (Intercept)         age educationPS educationS statusquo
## N   0.3315229 0.006742834   0.2659012  0.2098064 0.1292517
## U   0.3183088 0.006630914   0.2822363  0.2035971 0.1059440
## Y   0.3333254 0.006915012   0.2836015  0.2131728 0.1197440
## 
## Residual Deviance: 4264.833 
## AIC: 4294.833

# Set a different level as the reference (N) for easier interpretations
Chile$vote <- relevel(Chile$vote, ref = "N")
mod2 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,
                       maxit = 1e3)
## # weights:  24 (15 variable)
## initial  value 3476.826258 
## iter  10 value 2393.713801
## iter  20 value 2134.438912
## final  value 2132.416452 
## converged
summary(mod2)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo, 
##     data = Chile, maxit = 1000)
## 
## Coefficients:
##   (Intercept)         age educationPS educationS statusquo
## A  -0.3002035 -0.00482911  -0.4101274  0.1525608  1.758307
## U   0.5720544  0.01519931  -1.4394862 -0.5217093  2.084491
## Y   0.2091397  0.01186576  -0.8521205 -0.5382716  3.633550
## 
## Std. Errors:
##   (Intercept)         age educationPS educationS statusquo
## A   0.3315153 0.006742654   0.2658887  0.2098012 0.1292494
## U   0.2448452 0.004819103   0.2116375  0.1505854 0.1091445
## Y   0.2850655 0.005700894   0.2370881  0.1789293 0.1316567
## 
## Residual Deviance: 4264.833 
## AIC: 4294.833
exp(coef(mod2))
##   (Intercept)       age educationPS educationS statusquo
## A   0.7406675 0.9951825   0.6635657  1.1648133  5.802607
## U   1.7719034 1.0153154   0.2370495  0.5935052  8.040502
## Y   1.2326171 1.0119364   0.4265095  0.5837564 37.846937
# Some highlights:
# - intercepts do not have too much interpretation (correspond to age = 0).
#   A possible solution is to center age by its mean (so age = 0 would
#   represent the mean of the ages)
# - both age and statusquo increase the probability of voting Y, A or U
#   with respect to voting N -> conservatism increases with ages
# - both age and statusquo increase more the probability of voting Y and U
#   than A -> elderly and status quo supporters are more decided to participate
# - a PS level of education increases the probability of voting N. Same for
#   a S level of education, but more prone to A

# Prediction of votes -- three profile of voters
newdata <- data.frame(age = c(23, 40, 50),
                      education = c("PS", "S", "P"),
                      statusquo = c(-1, 0, 2))

# Probabilities of belonging to each class
predict(mod2, newdata = newdata, type = "probs")
##             N           A          U          Y
## 1 0.856057623 0.064885869 0.06343390 0.01562261
## 2 0.208361489 0.148185871 0.40245842 0.24099422
## 3 0.000288924 0.005659661 0.07076828 0.92328313

# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y

Multinomial logistic regression will suffer from numerical instabilities and its iterative algorithm might even fail to converge if the levels of the categorical variable are very separated (e.g., two data clouds clearly separated corresponding to a different level of the categorical variable).

$(J-1)$ intercepts and $(J-1)\times p$ slopes.↩︎