## 5.2 Multinomial Logistic Regression

The following notes rely on the [PSU STAT 504 course notes](https://online.stat.psu.edu/stat504/node/171/.

Multinomial logistic regression models the odds the multinomial response variable $$Y \sim Mult(n, \pi)$$ is in level $$j$$ relative to baseline category $$j^*$$ for all pairs of categories as a function of $$k$$ explanatory variables, $$X = (X_1, X_2, ... X_k)$$.

$\log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = x_i^T \beta_j, \hspace{5mm} j \ne j^2$

Interpet the $$k^{th}$$ element of $$\beta_j$$ as the increase in log-odds of falling a response in category $$j$$ relative to category $$j^*$$ resulting from a one-unit increase in the $$k^{th}$$ predictor term, holding the other terms constant.

Multinomial model is a type of GLM.

Here is an example using multinomial logistic regression. A researcher classified the stomach contents of $$n = 219$$ alligators according to $$r = 5$$ categories (fish, Inv., Rept, Bird, Other) as a function of covariates Lake, Sex, and Size..

gator_dat <- tribble(
~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
"1", "f", "<2.3", "george",  3, 9, 1, 0, 1,
"2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
"3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
"4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
"5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
"6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
"7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
"8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
"9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
"10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
"11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
"12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
"13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
"14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
"15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
"16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)
gator_dat <- gator_dat %>%
mutate(
Gender = as_factor(Gender),
Lake = fct_relevel(Lake, "hancock"),
Size = as_factor(Size)
)

There are 4 equations to estimate:

$\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X$

where $$\pi_{j^*}$$ is the probability of fish, the baseline category.

Run a multivariate logistic regression model with VGAM::vglm().

library(VGAM)
## Warning: package 'VGAM' was built under R version 4.0.2

vglm() fits 4 logit models.

gator_vglm <- vglm(
cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender,
data = gator_dat,
family = multinomial
)
summary(gator_vglm)
##
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
##     Lake + Size + Gender, family = multinomial, data = gator_dat)
##
## Pearson residuals:
##                       Min     1Q  Median    3Q  Max
## log(mu[,1]/mu[,5]) -1.199 -0.548 -0.2242 0.368 3.48
## log(mu[,2]/mu[,5]) -1.322 -0.461  0.0105 0.381 1.87
## log(mu[,3]/mu[,5]) -0.703 -0.575 -0.3551 0.261 2.06
## log(mu[,4]/mu[,5]) -1.694 -0.289 -0.1081 1.124 1.37
##
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept):1    -1.857      0.581   -3.19   0.0014 **
## (Intercept):2    -1.611      0.551   -2.93   0.0034 **
## (Intercept):3    -2.287      0.657   -3.48   0.0005 ***
## (Intercept):4    -0.664      0.380   -1.75   0.0806 .
## Lakegeorge:1     -0.575      0.795   -0.72   0.4694
## Lakegeorge:2      1.781      0.623    2.86   0.0043 **
## Lakegeorge:3     -1.129      1.193   -0.95   0.3437
## Lakegeorge:4     -0.767      0.569   -1.35   0.1776
## Lakeoklawaha:1   -1.126      1.192   -0.94   0.3451
## Lakeoklawaha:2    2.694      0.669    4.02 0.000057 ***
## Lakeoklawaha:3    1.401      0.810    1.73   0.0839 .
## Lakeoklawaha:4   -0.741      0.742   -1.00   0.3184
## Laketrafford:1    0.662      0.846    0.78   0.4341
## Laketrafford:2    2.936      0.687    4.27 0.000019 ***
## Laketrafford:3    1.932      0.825    2.34   0.0193 *
## Laketrafford:4    0.791      0.588    1.35   0.1784
## Size>2.3:1        0.730      0.652    1.12   0.2629
## Size>2.3:2       -1.336      0.411   -3.25   0.0012 **
## Size>2.3:3        0.557      0.647    0.86   0.3890
## Size>2.3:4       -0.291      0.460   -0.63   0.5275
## Genderm:1        -0.606      0.689   -0.88   0.3787
## Genderm:2        -0.463      0.396   -1.17   0.2418
## Genderm:3        -0.628      0.685   -0.92   0.3598
## Genderm:4        -0.253      0.466   -0.54   0.5881
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 50 on 40 degrees of freedom
##
## Log-likelihood: -73 on 40 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level  5  of the response

The residual deviance is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.

E <- data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
O <- gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
(g2 <- 2 * sum(O * log(O / E)))
##  50

indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with p-value of 0.1282 and the statistics/df is close to 1 that is 1.256.