5.2 Multinomial Logistic Regression

The following notes rely on the [PSU STAT 504 course notes](https://online.stat.psu.edu/stat504/node/171/.

Multinomial logistic regression models the odds the multinomial response variable \(Y \sim Mult(n, \pi)\) is in level \(j\) relative to baseline category \(j^*\) for all pairs of categories as a function of \(k\) explanatory variables, \(X = (X_1, X_2, ... X_k)\).

\[\log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = x_i^T \beta_j, \hspace{5mm} j \ne j^2\]

Interpet the \(k^{th}\) element of \(\beta_j\) as the increase in log-odds of falling a response in category \(j\) relative to category \(j^*\) resulting from a one-unit increase in the \(k^{th}\) predictor term, holding the other terms constant.

Multinomial model is a type of GLM.

Here is an example using multinomial logistic regression. A researcher classified the stomach contents of \(n = 219\) alligators according to \(r = 5\) categories (fish, Inv., Rept, Bird, Other) as a function of covariates Lake, Sex, and Size..

gator_dat <- tribble(
  ~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
  "1", "f", "<2.3", "george",  3, 9, 1, 0, 1,
  "2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
  "3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
  "4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
  "5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
  "6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
  "7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
  "8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
  "9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
  "10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
  "11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
  "12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
  "13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
  "14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
  "15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
  "16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)
gator_dat <- gator_dat %>%
  mutate(
    Gender = as_factor(Gender),
    Lake = fct_relevel(Lake, "hancock"),
    Size = as_factor(Size)
  )

There are 4 equations to estimate:

\[\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X\]

where \(\pi_{j^*}\) is the probability of fish, the baseline category.

Run a multivariate logistic regression model with VGAM::vglm().

library(VGAM)
## Warning: package 'VGAM' was built under R version 4.0.2

vglm() fits 4 logit models.

gator_vglm <- vglm(
  cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender, 
  data = gator_dat, 
  family = multinomial
)
summary(gator_vglm)
## 
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~ 
##     Lake + Size + Gender, family = multinomial, data = gator_dat)
## 
## Pearson residuals:
##                       Min     1Q  Median    3Q  Max
## log(mu[,1]/mu[,5]) -1.199 -0.548 -0.2242 0.368 3.48
## log(mu[,2]/mu[,5]) -1.322 -0.461  0.0105 0.381 1.87
## log(mu[,3]/mu[,5]) -0.703 -0.575 -0.3551 0.261 2.06
## log(mu[,4]/mu[,5]) -1.694 -0.289 -0.1081 1.124 1.37
## 
## Coefficients: 
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1    -1.857      0.581   -3.19   0.0014 ** 
## (Intercept):2    -1.611      0.551   -2.93   0.0034 ** 
## (Intercept):3    -2.287      0.657   -3.48   0.0005 ***
## (Intercept):4    -0.664      0.380   -1.75   0.0806 .  
## Lakegeorge:1     -0.575      0.795   -0.72   0.4694    
## Lakegeorge:2      1.781      0.623    2.86   0.0043 ** 
## Lakegeorge:3     -1.129      1.193   -0.95   0.3437    
## Lakegeorge:4     -0.767      0.569   -1.35   0.1776    
## Lakeoklawaha:1   -1.126      1.192   -0.94   0.3451    
## Lakeoklawaha:2    2.694      0.669    4.02 0.000057 ***
## Lakeoklawaha:3    1.401      0.810    1.73   0.0839 .  
## Lakeoklawaha:4   -0.741      0.742   -1.00   0.3184    
## Laketrafford:1    0.662      0.846    0.78   0.4341    
## Laketrafford:2    2.936      0.687    4.27 0.000019 ***
## Laketrafford:3    1.932      0.825    2.34   0.0193 *  
## Laketrafford:4    0.791      0.588    1.35   0.1784    
## Size>2.3:1        0.730      0.652    1.12   0.2629    
## Size>2.3:2       -1.336      0.411   -3.25   0.0012 ** 
## Size>2.3:3        0.557      0.647    0.86   0.3890    
## Size>2.3:4       -0.291      0.460   -0.63   0.5275    
## Genderm:1        -0.606      0.689   -0.88   0.3787    
## Genderm:2        -0.463      0.396   -1.17   0.2418    
## Genderm:3        -0.628      0.685   -0.92   0.3598    
## Genderm:4        -0.253      0.466   -0.54   0.5881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), 
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
## 
## Residual deviance: 50 on 40 degrees of freedom
## 
## Log-likelihood: -73 on 40 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Reference group is level  5  of the response

The residual deviance is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.

E <- data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
O <- gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
(g2 <- 2 * sum(O * log(O / E)))
## [1] 50

indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with p-value of 0.1282 and the statistics/df is close to 1 that is 1.256.