C Multinomial logistic regression

The logistic model can be generalized to categorical variables \(Y\) with more than two possible levels, namely \(\{1,\ldots,J\}\). Given the predictors \(X_1,\ldots,X_k\), multinomial logistic regression models the probability of each level \(j\) of \(Y\) by \[\begin{align} p_j(\mathbf{x})=\mathbb{P}[Y=j|X_1=x_1,\ldots,X_k=x_k]=\frac{e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{kj}X_k}}{1+\sum_{l=1}^{J-1}e^{\beta_{0l}+\beta_{1l}X_1+\cdots+\beta_{kl}X_k}} \tag{C.1} \end{align}\] for \(j=1,\ldots,J-1\) and (for the last level \(J\)) \[\begin{align} p_J(\mathbf{x})=\mathbb{P}[Y=J|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+\sum_{l=1}^{J-1}e^{\beta_{0l}+\beta_{1l}X_1+\cdots+\beta_{kl}X_k}}. \tag{C.2} \end{align}\] Note that (C.1) and (C.2) imply that \(\sum_{j=1}^J p_j(\mathbf{x})=1\) and that there are \((J-1)\times(k+1)\) coefficients (\((J-1)\) intercepts and \((J-1)\times k\) slopes). Also, (C.2) reveals that the last level, \(J\), is given a different treatment. This is because it is the reference level (it could be a different one, but is tradition to choose the last one).

The multinomial logistic model has an interesting interpretation in terms of logistic regressions. Taking the quotient between (C.1) and (C.2) gives \[\begin{align} \frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{kj}X_k}\tag{C.3} \end{align}\] for \(j=1,\ldots,J-1\). Therefore, applying a logarithm to both sides we have: \[\begin{align} \log\frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{kj}X_k.\tag{C.4} \end{align}\] This equation is indeed very similar to (4.7). If \(J=2\), it is the same up to a change in the codes for the levels: the logistic regression giving the probability of \(Y=1\) versus \(Y=2\). On the LHS of (C.4) we have the logarithm of the ratio of two probabilities and on the RHS a linear combination of the predictors. If the probabilities on the LHS were complementary (if they added up to one), then we would have a log-odds and hence a logistic regression for \(Y\). This is not the situation, but it is close: instead of odds and log-odds, we have ratios and log-ratios of non complementary probabilities. Also, it gives a good insight on what the multinomial logistic regression is: a set of \(J-1\) independent “logistic regressions” for the probability of \(Y=j\) versus the probability of the reference \(Y=J\).

Equation (C.3) gives also interpretation on the coefficients of the model since \[ p_j(\mathbf{x})=e^{\beta_{0j}+\beta_{1j}X_1+\cdots+\beta_{kj}X_k}p_J(\mathbf{x}). \] Therefore:

  • \(e^{\beta_{0j}}\): is the ratio between \(p_j(\mathbf{0})/p_J(\mathbf{0})\), the probabilities of \(Y=j\) and \(Y=J\) when \(X_1=\ldots=X_k=0\). If \(e^{\beta_{0j}}>1\) (equivalently, \(\beta_{0j}>0\)), then \(Y=j\) is more likely than \(Y=J\). If \(e^{\beta_{0j}}<1\) (\(\beta_{0j}<0\)), then \(Y=j\) is less likely than \(Y=J\).
  • \(e^{\beta_{lj}}\), \(l\geq1\): is the multiplicative increment of the ratio between \(p_j(\mathbf{x})/p_J(\mathbf{x})\) for an increment of one unit in \(X_l=x_l\), provided that the remaining variables \(X_1,\ldots,X_{l-1},X_{l+1},\ldots,X_k\) do not change. If \(e^{\beta_{lj}}>1\) (equivalently, \(\beta_{lj}>0\)), then \(Y=j\) becomes more likely than \(Y=J\) for each increment in \(X_j\). If \(e^{\beta_{lj}}<1\) (\(\beta_{lj}<0\)), then \(Y=j\) becomes less likely than \(Y=J\).

The following code illustrates how to compute a basic multinomial regression in R.

# Package included in R that implements multinomial regression
library(nnet)

# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile)
summary(Chile)
##  region     population     sex           age        education  
##  C :600   Min.   :  3750   F:1379   Min.   :18.00   P   :1107  
##  M :100   1st Qu.: 25000   M:1321   1st Qu.:26.00   PS  : 462  
##  N :322   Median :175000            Median :36.00   S   :1120  
##  S :718   Mean   :152222            Mean   :38.55   NA's:  11  
##  SA:960   3rd Qu.:250000            3rd Qu.:49.00              
##           Max.   :250000            Max.   :70.00              
##                                     NA's   :1                  
##      income         statusquo          vote    
##  Min.   :  2500   Min.   :-1.80301   A   :187  
##  1st Qu.:  7500   1st Qu.:-1.00223   N   :889  
##  Median : 15000   Median :-0.04558   U   :588  
##  Mean   : 33876   Mean   : 0.00000   Y   :868  
##  3rd Qu.: 35000   3rd Qu.: 0.96857   NA's:168  
##  Max.   :200000   Max.   : 2.04859             
##  NA's   :98       NA's   :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)

# Fit of the model done by multinom: Response ~ Predictors
# It is an iterative procedure (maxit sets the maximum number of iterations)
# Read the documentation in ?multinom for more information
mod1 <- multinom(vote ~ age + education + statusquo, data = Chile,
                 maxit = 1e3)
## # weights:  24 (15 variable)
## initial  value 3476.826258 
## iter  10 value 2310.201176
## iter  20 value 2135.385060
## final  value 2132.416452 
## converged

# Each row of coefficients gives the coefficients of the logistic
# regression of a level versus the reference level (A)
summary(mod1)
## Call:
## multinom(formula = vote ~ age + education + statusquo, data = Chile, 
##     maxit = 1000)
## 
## Coefficients:
##   (Intercept)         age educationPS educationS  statusquo
## N   0.3002851 0.004829029   0.4101765 -0.1526621 -1.7583872
## U   0.8722750 0.020030032  -1.0293079 -0.6743729  0.3261418
## Y   0.5093217 0.016697208  -0.4419826 -0.6909373  1.8752190
## 
## Std. Errors:
##   (Intercept)         age educationPS educationS statusquo
## N   0.3315229 0.006742834   0.2659012  0.2098064 0.1292517
## U   0.3183088 0.006630914   0.2822363  0.2035971 0.1059440
## Y   0.3333254 0.006915012   0.2836015  0.2131728 0.1197440
## 
## Residual Deviance: 4264.833 
## AIC: 4294.833

# Set a different level as the reference (N) for facilitating interpretations
Chile$vote <- relevel(Chile$vote, ref = "N")
mod2 <- multinom(vote ~ age + education + statusquo, data = Chile,
                 maxit = 1e3)
## # weights:  24 (15 variable)
## initial  value 3476.826258 
## iter  10 value 2393.713801
## iter  20 value 2134.438912
## final  value 2132.416452 
## converged
summary(mod2)
## Call:
## multinom(formula = vote ~ age + education + statusquo, data = Chile, 
##     maxit = 1000)
## 
## Coefficients:
##   (Intercept)         age educationPS educationS statusquo
## A  -0.3002035 -0.00482911  -0.4101274  0.1525608  1.758307
## U   0.5720544  0.01519931  -1.4394862 -0.5217093  2.084491
## Y   0.2091397  0.01186576  -0.8521205 -0.5382716  3.633550
## 
## Std. Errors:
##   (Intercept)         age educationPS educationS statusquo
## A   0.3315153 0.006742654   0.2658887  0.2098012 0.1292494
## U   0.2448452 0.004819103   0.2116375  0.1505854 0.1091445
## Y   0.2850655 0.005700894   0.2370881  0.1789293 0.1316567
## 
## Residual Deviance: 4264.833 
## AIC: 4294.833
exp(coef(mod2))
##   (Intercept)       age educationPS educationS statusquo
## A   0.7406675 0.9951825   0.6635657  1.1648133  5.802607
## U   1.7719034 1.0153154   0.2370495  0.5935052  8.040502
## Y   1.2326171 1.0119364   0.4265095  0.5837564 37.846937
# Some highlights:
# - intercepts do not have too much interpretation (correspond to age = 0).
#   A possible solution is to center age by its mean (so age = 0 would
#   represent the mean of the ages)
# - both age and statusquo increase the probability of voting Y, A or U
#   with respect to voting N -> conservativeness increases with ages
# - both age and statusquo increase more the probability of voting Y and U
#   than A -> elderly and status quo supporters are more decided to participate
# - a PS level of education increases the probability of voting N. Same for
#   a S level of education, but more prone to A

# Prediction of votes - three profile of voters
newdata <- data.frame(age = c(23, 40, 50),
                      education = c("PS", "S", "P"),
                      statusquo = c(-1, 0, 2))

# Probabilities of belonging to each class
predict(mod2, newdata = newdata, type = "probs")
##             N           A          U          Y
## 1 0.856057623 0.064885869 0.06343390 0.01562261
## 2 0.208361489 0.148185871 0.40245842 0.24099422
## 3 0.000288924 0.005659661 0.07076828 0.92328313

# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y

Multinomial logistic regression will suffer from numerical instabilities and its iterative algorithm might even fail to converge if the levels of the categorical variable are very separated (e.g., two data clouds clearly separated corresponding to a different level of the categorical variable).

The multinomial model employs \((J-1)(k+1)\) parameters. It is easy to end up with complex models – that require a large sample size to be fitted properly – if the response has a few number of levels and there are several predictors. For example, with \(5\) levels and \(8\) predictors we will have \(36\) parameters. Estimating this model with \(50-100\) observations will probably result in overfitting.