5.2 Multinomial Logistic Regression
The following notes rely on the [PSU STAT 504 course notes](https://online.stat.psu.edu/stat504/node/171/.
Multinomial logistic regression models the odds the multinomial response variable \(Y \sim Mult(n, \pi)\) is in level \(j\) relative to baseline category \(j^*\) for all pairs of categories as a function of \(k\) explanatory variables, \(X = (X_1, X_2, ... X_k)\).
\[\log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = x_i^T \beta_j, \hspace{5mm} j \ne j^2\]
Interpet the \(k^{th}\) element of \(\beta_j\) as the increase in log-odds of falling a response in category \(j\) relative to category \(j^*\) resulting from a one-unit increase in the \(k^{th}\) predictor term, holding the other terms constant.
Multinomial model is a type of GLM.
Here is an example using multinomial logistic regression. A researcher classified the stomach contents of \(n = 219\) alligators according to \(r = 5\) categories (fish, Inv., Rept, Bird, Other) as a function of covariates Lake, Sex, and Size..
gator_dat <- tribble(
~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
"1", "f", "<2.3", "george", 3, 9, 1, 0, 1,
"2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
"3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
"4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
"5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
"6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
"7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
"8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
"9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
"10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
"11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
"12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
"13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
"14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
"15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
"16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)
gator_dat <- gator_dat %>%
mutate(
Gender = as_factor(Gender),
Lake = fct_relevel(Lake, "hancock"),
Size = as_factor(Size)
)
There are 4 equations to estimate:
\[\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X\]
where \(\pi_{j^*}\) is the probability of fish, the baseline category.
Run a multivariate logistic regression model with VGAM::vglm()
.
## Warning: package 'VGAM' was built under R version 4.0.2
vglm()
fits 4 logit models.
gator_vglm <- vglm(
cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender,
data = gator_dat,
family = multinomial
)
summary(gator_vglm)
##
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
## Lake + Size + Gender, family = multinomial, data = gator_dat)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,5]) -1.199 -0.548 -0.2242 0.368 3.48
## log(mu[,2]/mu[,5]) -1.322 -0.461 0.0105 0.381 1.87
## log(mu[,3]/mu[,5]) -0.703 -0.575 -0.3551 0.261 2.06
## log(mu[,4]/mu[,5]) -1.694 -0.289 -0.1081 1.124 1.37
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.857 0.581 -3.19 0.0014 **
## (Intercept):2 -1.611 0.551 -2.93 0.0034 **
## (Intercept):3 -2.287 0.657 -3.48 0.0005 ***
## (Intercept):4 -0.664 0.380 -1.75 0.0806 .
## Lakegeorge:1 -0.575 0.795 -0.72 0.4694
## Lakegeorge:2 1.781 0.623 2.86 0.0043 **
## Lakegeorge:3 -1.129 1.193 -0.95 0.3437
## Lakegeorge:4 -0.767 0.569 -1.35 0.1776
## Lakeoklawaha:1 -1.126 1.192 -0.94 0.3451
## Lakeoklawaha:2 2.694 0.669 4.02 0.000057 ***
## Lakeoklawaha:3 1.401 0.810 1.73 0.0839 .
## Lakeoklawaha:4 -0.741 0.742 -1.00 0.3184
## Laketrafford:1 0.662 0.846 0.78 0.4341
## Laketrafford:2 2.936 0.687 4.27 0.000019 ***
## Laketrafford:3 1.932 0.825 2.34 0.0193 *
## Laketrafford:4 0.791 0.588 1.35 0.1784
## Size>2.3:1 0.730 0.652 1.12 0.2629
## Size>2.3:2 -1.336 0.411 -3.25 0.0012 **
## Size>2.3:3 0.557 0.647 0.86 0.3890
## Size>2.3:4 -0.291 0.460 -0.63 0.5275
## Genderm:1 -0.606 0.689 -0.88 0.3787
## Genderm:2 -0.463 0.396 -1.17 0.2418
## Genderm:3 -0.628 0.685 -0.92 0.3598
## Genderm:4 -0.253 0.466 -0.54 0.5881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 50 on 40 degrees of freedom
##
## Log-likelihood: -73 on 40 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level 5 of the response
The residual deviance is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.
E <- data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
O <- gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
(g2 <- 2 * sum(O * log(O / E)))
## [1] 50
indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with p-value of 0.1282 and the statistics/df is close to 1 that is 1.256.