Chapter 2 Generalized Linear Models (GLM)

The linear regression model, \(E(y|X) = X \beta\), structured as \(y_i = X_i \beta + \epsilon_i\) where \(X_i \beta = \mu_i\), assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with zero mean and constant variance, \(\epsilon \sim N \left(0, \sigma^2 \right)\). This implies that given a set of predictors, the response is normally distributed about its expected value, \(y_i \sim N \left(\mu_i, \sigma^2 \right)\). However, there are many situations where this normality assumption does not hold. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses non-normal response distributions.1

The response will not have a normal distribution if the underlying data-generating process is binomial (Section 2.1) or multinomial (Section 2.2), ordinal (Section 2.3), Poisson (counts, Section 2.4), or exponential (time-to-event). In these situations the linear regression model can predict probabilities and proportions outside [0, 1], or negative counts or times. GLMs solve this problem by modeling a function of the expected value of \(y\), \(f(E(y|X)) = X \beta\). There are three components to a GLM: a random component specified by the response variable’s probability distribution (normal, binomial, etc.); a systematic component in the form \(X\beta\); and a link function, \(\eta\), that specifies the link between the random and systematic components and converts the response to a range of \([-\infty, +\infty]\).

Linear regression is a special case of GLM where link function is the identity function, \(f(E(y|X)) = E(y|X)\). For binary logistic regression, the link function is the logit,

\[f(E(y|X)) = \ln \left( \frac{E(y|X)}{1 - E(y|X)} \right) = \log \left( \frac{\pi}{1 - \pi} \right) = \mathrm{logit}(\pi)\]

where \(\pi\) is the response probability.2

For Poisson regression, the link function is

\[f(E(y|X)) = \ln (E(y|X)) = \ln(\lambda)\]

where \(\lambda\) is the expected event rate.

For exponential regression, the link function is

\[f(E(y|X) = -E(y|X) = -\lambda\]

where \(\lambda\) is the expected time to event.

GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

In R, specify a GLM just like an linear model, but with the glm() function, specifying the distribution with the family parameter.

  • family = "gaussian": linear regression
  • family = "binomial": logistic regression
  • family = "poisson": Poisson regression

Mulinomial logistic is a little different because it models a set of logits. Use nnet::multinom().

2.1 Binomial Logistic Regression

Logistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal, or ordinal.

The binary logistic regression model is

\[y_i = \mathrm{logit}(\pi_i) = \log \left( \frac{\pi_i}{1 - \pi_i} \right) = X_i \beta\]

where \(\pi_i\) is the response \(i\)’s binary level probability. The model predicts the log odds of the level. Why do this? The range of outcomes need to be between 0 and 1. A sigmoid function, \(f(y) = 1 / \left(1 + e^y \right)\), does that. So if the log odds of the level equals \(X_i\beta\), then the odds of the level equals \(e^{X\beta}\). You can solve for \(\pi_i\) to get \(\pi = \mathrm{odds} / (\mathrm{odds} + 1)\). Substituting,

\[\pi_i = \frac{\exp(y_i)}{1 + \exp(y_i)} = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]

which you can simplify to \(\pi_i = 1 / (1 + e^{-X_i\beta})\), a sigmoid function. The upshot is \(X\beta\) is the functional relationship between the independent variables and a function of the response, not the response itself.

The model parameters are estimated either by iteratively reweighted least squares optimization or by maximum likelihood estimation (MLE). Here is how MLE works3. MLE maximizes the probability produced by a set of parameters \(\beta\) given a data set and probability distribution. In logistic regression the probability distribution is the binomial and each observation is the outcome of a single Bernoulli trial.

\[L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.\]

In practice, multiplying many small probabilities can be unstable, so MLE optimizes the log likelihood instead.

\[\begin{align} l(\beta; y, X) &= \sum_{i = 1}^n \left(y_i \log(\pi_i) + (1 - y_i) \log(1 - \pi_i) \right) \\ &= \sum_{i = 1}^n \left(y_i X_i \beta - \log(1 + e^{X_i\beta}) \right) \end{align}\]

Sometimes you will see the cost function optimized. This is the negative of of the log likelihood function.

2.1.1 Case Study

Dataset donner contains n = 45 observations of the survival status (surv) of the members of the Donner party with explanatory variables age and sex.

glimpse(donner)
## Rows: 45
## Columns: 3
## $ age  <dbl> 23, 40, 40, 30, 28, 40, 45, 62, 65, 45, 25, 28, 28, 23, 22, 23, 2…
## $ sex  <fct> M, F, M, M, M, M, F, M, M, F, F, M, M, M, F, F, M, F, F, M, F, M,…
## $ surv <fct> Died, Lived, Lived, Died, Died, Died, Died, Died, Died, Died, Die…

2.1.2 Fit the Model

Fit a logistic regression \(\mathrm{surv} = \mathrm{sex} + \mathrm{age} + \mathrm{sex:age}\). The reference case based on the factor definition is sex = “F”, and age = 28 (the median age).

donner_centered <- donner %>%
  mutate(age = age - median(donner$age))
m <- glm(surv ~ sex*age, data = donner_centered, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = surv ~ sex * age, family = "binomial", data = donner_centered)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2279  -0.9388  -0.5550   0.7794   1.6998  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.81231    1.04269   1.738   0.0822 .
## sexM        -2.40334    1.11677  -2.152   0.0314 *
## age         -0.19407    0.08742  -2.220   0.0264 *
## sexM:age     0.16160    0.09426   1.714   0.0865 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.827  on 44  degrees of freedom
## Residual deviance: 47.346  on 41  degrees of freedom
## AIC: 55.346
## 
## Number of Fisher Scoring iterations: 5

The “z value” is the Wald \(z\) statistic, \(z = \hat{\beta} / SE(\hat{\beta})\). Its square, \(z^2\), is the Wald chi-squared statistic. The p-value is the area to the right of \(z^2\) in the \(\chi_1^2\) density curve:

m %>% tidy() %>% 
  mutate(
    z = estimate / std.error, 
    p_z2 = pchisq(z^2, df = 1, lower.tail = FALSE)
  ) %>%
  select(term, z, p_z2)
## # A tibble: 4 × 3
##   term            z   p_z2
##   <chr>       <dbl>  <dbl>
## 1 (Intercept)  1.74 0.0822
## 2 sexM        -2.15 0.0314
## 3 age         -2.22 0.0264
## 4 sexM:age     1.71 0.0865

Below the Coefficients table, the “dispersion parameter” refers to overdispersion, a common issue with GLM. For a logistic regression, the response variable should be distributed \(y_i \sim \mathrm{Bin}(n_i, \pi_i)\) with \(\mu_i = n_i \pi_i\) and \(\sigma^2 = \pi (1 - \pi)\). Overdispersion means the data shows evidence of variance greater than \(\sigma^2\).

“Fisher scoring” is a method for ML estimation. Logistic regression uses interatively reweighted least squares to fit the model, so this section indicates whether the algorithm converged.

The null deviance is the likelihood ratio \(G^2\) = 61.827 of the intercept-only model. The residual deviance is the likelihood ratio \(G^2\) = 47.346 after including the model covariates. \(G^2\) is large, so reject the null hypothesis of no age or sex effects. An ANOVA table shows the change in deviance from successively adding each variable to the model.

anova(m)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: surv
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev
## NULL                       44     61.827
## sex      1   4.5403        43     57.286
## age      1   6.0300        42     51.256
## sex:age  1   3.9099        41     47.346

2.1.3 Interpret Results

Plug in values to interpret the model. The log odds of a 24 year-old female surviving is \(\hat{y}\) = 2.59. The log odds of a 24 year-old male surviving is \(\hat{y}\) = -0.46.

new_dat <- expand.grid(sex = factor(c("F", "M")), age = 24 - median(donner$age))
(preds <- predict(m, newdata = new_dat))
##          1          2 
##  2.5886056 -0.4611192

Exponentiate to get the odds.

\[\mathrm{odds}(\hat{y}) = \exp (\hat{y}) = \frac{\pi}{1 - \pi}.\]

exp(preds)
##          1          2 
## 13.3111976  0.6305775

Solve for \(\pi\) to get the probability.

\[\pi = \frac{\exp (\hat{y})}{1 + \exp (\hat{y})}\]

You can do the math, or use predict(type = "response").

(preds <- predict(m, newdata = new_dat, type = "response"))
##         1         2 
## 0.9301246 0.3867204

Interpret the coefficient estimates as the change in the log odds of \(y\) due to a \(\delta = x_1 - x_0\) unit change in \(x\).

\[\log \left(\pi / (1 - \pi) |_{X = X_1} \right) - \log \left(\pi / (1 - \pi) |_{X = X_0} \right) = \log \left( \frac{\pi / (1 - \pi) |_{X = X_1}}{\pi / (1 - \pi) |_{X = X_0}} \right) = \delta \hat{\beta}\]

Better yet, interpret the exponential of the coefficient estimates as the change in the odds of \(y\) due to a \(\delta\) unit change in \(x\).

\[\mathrm{odds}(y) = e^{\delta \hat{\beta}}\]

The reference case is sex = “F” and age = 0 (centered on median age, 28).

  • The survival odds of a 28 year old female are \(\hat{y} = e^{(Intercept)}\) = 6.1 to 1. The expected probability of surviving is 6.1 / (1 + 6.1) = 86%.
  • The survival odds of a 24 year old female are \(\hat{y} = e^{(Intercept) -4 \cdot \mathrm{age}}\) = 13.3 to 1. The expected probability of surviving is 13.3 / (1 + 13.3) = 93%.
  • The survival odds of a 28 year old male are \(\hat{y} = e^{(Intercept) + sexM}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 36%.
  • The survival odds of a 24 year old male are \(\hat{y} = e^{(Intercept) -4 \cdot \mathrm{age} -4 \cdot \mathrm{sexM} \cdot \mathrm{sexM:age}}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 39%.
  • A female’s survival odds are multiplied by a factor of \(e^{(1 \cdot \mathrm{age})}\) = 0.824 per additional year of age (i.e. the odds fall by 17.6%).
  • A male’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{age}} e^{1 \cdot \mathrm{sexM:age}}\) = 0.968 per additional year of age (i.e. the odds fall by 3.2%).
  • A 28 year old female’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{sex}}\) = 0.090 if she were instead a male. (i.e. the odds fall by 91.0%)

oddsratio::or_glm() calculates the odds ratio from an increment in the predictor values.

oddsratio::or_glm(donner_centered, m, incr = list(age = 1))
##   predictor oddsratio ci_low (2.5) ci_high (97.5)          increment
## 1      sexM     0.090        0.005          0.561 Indicator variable
## 2       age     0.824        0.652          0.945                  1
## 3  sexM:age     1.175        1.001          1.496 Indicator variable

Expressing the predicted values as probabilities produces the familiar signmoidal shape of the binary relationship.

augment(m, type.predict = "response") %>%
  ggplot(aes(x = age, color = sex)) +
  geom_point(aes(y = as.integer(surv))) +
  geom_line(aes(y = .fitted+1)) +
  labs(x = "Age (relative to median)",
       y = NULL,
       title = "Binary Fitted Line Plot") +
  scale_y_continuous(breaks = c(1,2), labels = c("Died", "Lived")) +
  theme_light() +
  theme(legend.position = "right")

2.1.4 Assess Model Fit

Evaluate a logistic regression using a Gain curve or ROC curve.

Gain Curve

In the gain curve, the x-axis is the fraction of items seen when sorted by the predicted value, and the y-axis is the cumulatively summed true outcome. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the “gain” over a random prediction.

20 of the 45 members of the Donner party survived.

  • The gain curve encountered 10 survivors (50%) within the first 12 observations (27%). It encountered all 20 survivors on the 37th observation.
  • The bottom of the grey area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations.
  • The top of the grey area is the outcome of the perfect model, the “wizard curve”. Half the survivors would be observed in 10/45=22% of the observations.
augment(m, type.predict = "response") %>%
  # event_level = "second" sets the second level as success
  yardstick::gain_curve(surv, .fitted, event_level = "second") %>%
  autoplot() +
  labs(title = "Gain Curve")

ROC Curve

The ROC (Receiver Operating Characteristics) curve plots sensitivity vs specificity at different cut-off values for the probability, ranging cut-off from 0 to 1.

augment(m, type.predict = "response") %>%
  # event_level = "second" sets the second level as success
  yardstick::roc_curve(surv, .fitted, event_level = "second") %>%
  autoplot() +
  labs(title = "ROC Curve")

2.2 Multinomial Logistic Regression

The multinomial logistic regression model is

\[y_i = \log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = X_i \beta_j, \hspace{5mm} j \ne j^*\]

where \(j\) is a level of the multinomial response variable. Whereas the binomial logistic regression equation models the log odds of the response level, the multinomial logistic regression models the log relative risk, the probability relative to the baseline \(j^*\) level (as opposed to \(\pi_{ij} / (1 - \pi_{ij}\)).4

As with binary logistic regression, interpret the \(k^{th}\) element of \(\beta_j\) as the increase in log odds (relative risk) of \(Y_i = j\) relative to \(Y_i = j^*\) given a one-unit increase in the \(k^{th}\) element of \(X\), holding the other terms constant.

Assumptions

Multinomial logistic regression applies when the dependent variable is categorical. It presumes a linear relationship between the log odds of the dependent variable and \(X\) with residuals \(\epsilon\) that are independent. It also assumes there is no severe multicollinearity in the predictors, and there is independence of irrelevant alternatives (IIA). IIA means the relative likelihood of being in one category compared to the base category would not change if you added other categories.

2.2.1 Case Study

This case study is based on notes from Sara Worth and dat from UCLA.

download.file(
  "https://stats.idre.ucla.edu/stat/data/hsbdemo.dta",
  "./input/hsbdemo.dta",
  mode = "wb"
)
cs2 <- list()
cs2$dat <- read.dta("./input/hsbdemo.dta")

A study measures the association between a student’s academic program (general, academic, and vocational) and their socioeconomic status (SES) (low, middle, high) and writing score.

cs2$dat %>%
  tbl_summary(
    by = prog,
    include = c(prog, ses, write),
    statistic = list(all_continuous() ~ "{mean}, {sd}")
  ) %>%
  gtsummary::add_overall()
Characteristic Overall, N = 2001 general, N = 451 academic, N = 1051 vocation, N = 501
ses
    low 47 (24%) 16 (36%) 19 (18%) 12 (24%)
    middle 95 (48%) 20 (44%) 44 (42%) 31 (62%)
    high 58 (29%) 9 (20%) 42 (40%) 7 (14%)
write 53, 9 51, 9 56, 8 47, 9
1 n (%); Mean, SD


cs2$dat %>%
  mutate(write_bin = cut(write, breaks = 5, dig.lab = 1, right = FALSE)) %>%
  count(prog, ses, write_bin) %>%
  group_by(ses, write_bin) %>%
  mutate(prob = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = write_bin, y = prob, color = ses)) +
  geom_point() +
  geom_line(aes(group = ses)) +
  facet_wrap(facets = vars(prog)) +
  theme(legend.position = "top") +
  labs(title = "Program Probability for Students by SES")

Set academic as the reference level of the dependent variable, then fit the multinomial logistic regression with nnet::multinom().

cs2$dat <- cs2$dat %>% mutate(prog = fct_relevel(prog, "academic")) 

cs2$fit <- nnet::multinom(prog ~ ses + write, data = cs2$dat)
## # weights:  15 (8 variable)
## initial  value 219.722458 
## iter  10 value 179.982880
## final  value 179.981726 
## converged
summary(cs2$fit)
## Call:
## nnet::multinom(formula = prog ~ ses + write, data = cs2$dat)
## 
## Coefficients:
##          (Intercept)  sesmiddle    seshigh      write
## general     2.852198 -0.5332810 -1.1628226 -0.0579287
## vocation    5.218260  0.2913859 -0.9826649 -0.1136037
## 
## Std. Errors:
##          (Intercept) sesmiddle   seshigh      write
## general     1.166441 0.4437323 0.5142196 0.02141097
## vocation    1.163552 0.4763739 0.5955665 0.02221996
## 
## Residual Deviance: 359.9635 
## AIC: 375.9635
tbl_regression(cs2$fit, exponentiate = TRUE, conf.int = FALSE)
## ℹ Multinomial models have a different underlying structure than the models
## gtsummary was designed for. Other gtsummary functions designed to work with
## tbl_regression objects may yield unexpected results.
Characteristic OR1 p-value
general
ses
    low
    middle 0.59 0.2
    high 0.31 0.024
write 0.94 0.007
vocation
ses
    low
    middle 1.34 0.5
    high 0.37 0.10
write 0.89 <0.001
1 OR = Odds Ratio

A one unit increase in the writing score is associated with a 0.058 decrease in the log odds of being in a general vs academic program. The log odds of being in a general vs academic program are 1.163 lower for high SES than low SES.

The exponential of the coefficients is the relative risk (ratio of probabilities) relative to the baseline.

exp(coef(cs2$fit))
##          (Intercept) sesmiddle   seshigh     write
## general     17.32582 0.5866769 0.3126026 0.9437172
## vocation   184.61262 1.3382809 0.3743123 0.8926116

The relative risk ratio of being in a general vs academic program is 0.944 for students with a one unit higher writing score. The relative risk ratio of being in a general vs academic program are 0.313 for high SES relative low SES.

Define a range of interesting values for the continuous predictor and plot the associated predicted values.

summary(cs2$dat$write)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   45.75   54.00   52.77   60.00   67.00
newdat <- expand.grid(
  ses = cs2$dat$ses,
  write = seq(from = round(min(cs2$dat$write)), to = round(max(cs2$dat$write)), by = 1)
)

bind_cols(
  newdat,
  predict(cs2$fit, newdata = newdat, type = "probs")
) %>%
  pivot_longer(cols = -c(ses, write), names_to = "prog", values_to = "probability") %>%
  ggplot(aes(x = write, y = probability, color = ses)) +
  geom_line() + 
  facet_wrap(facets = vars(prog))

The summary() object does not include the p-values. The tidy() object does.

tidy(cs2$fit)
## # A tibble: 8 × 6
##   y.level  term        estimate std.error statistic     p.value
##   <chr>    <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 general  (Intercept)   2.85      1.17       2.45  0.0145     
## 2 general  sesmiddle    -0.533     0.444     -1.20  0.229      
## 3 general  seshigh      -1.16      0.514     -2.26  0.0237     
## 4 general  write        -0.0579    0.0214    -2.71  0.00682    
## 5 vocation (Intercept)   5.22      1.16       4.48  0.00000730 
## 6 vocation sesmiddle     0.291     0.476      0.612 0.541      
## 7 vocation seshigh      -0.983     0.596     -1.65  0.0989     
## 8 vocation write        -0.114     0.0222    -5.11  0.000000318

Here is an example using multinomial logistic regression. A researcher classified the stomach contents of \(n = 219\) alligators according to \(r = 5\) categories (fish, Inv., Rept, Bird, Other) as a function of covariates Lake, Sex, and Size..

gator_dat <- tribble(
  ~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
  "1", "f", "<2.3", "george",  3, 9, 1, 0, 1,
  "2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
  "3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
  "4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
  "5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
  "6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
  "7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
  "8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
  "9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
  "10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
  "11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
  "12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
  "13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
  "14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
  "15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
  "16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)
gator_dat <- gator_dat %>%
  mutate(
    Gender = as_factor(Gender),
    Lake = fct_relevel(Lake, "hancock"),
    Size = as_factor(Size)
  )

There are 4 equations to estimate:

\[\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X\]

where \(\pi_{j^*}\) is the probability of fish, the baseline category.

Run a multivariate logistic regression model with VGAM::vglm().

library(VGAM)

vglm() fits 4 logit models.

gator_vglm <- vglm(
  cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender, 
  data = gator_dat, 
  family = multinomial
)
summary(gator_vglm)
## 
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~ 
##     Lake + Size + Gender, family = multinomial, data = gator_dat)
## 
## Coefficients: 
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   -1.8568     0.5813  -3.195 0.001400 ** 
## (Intercept):2   -1.6115     0.5508  -2.926 0.003435 ** 
## (Intercept):3   -2.2866     0.6566  -3.483 0.000497 ***
## (Intercept):4   -0.6642     0.3802  -1.747 0.080639 .  
## Lakegeorge:1    -0.5753     0.7952  -0.723 0.469429    
## Lakegeorge:2     1.7805     0.6232   2.857 0.004277 ** 
## Lakegeorge:3    -1.1295     1.1928  -0.947 0.343687    
## Lakegeorge:4    -0.7666     0.5686  -1.348 0.177563    
## Lakeoklawaha:1  -1.1256     1.1923  -0.944 0.345132    
## Lakeoklawaha:2   2.6937     0.6693   4.025 5.70e-05 ***
## Lakeoklawaha:3   1.4008     0.8105   1.728 0.083926 .  
## Lakeoklawaha:4  -0.7405     0.7421  -0.998 0.318372    
## Laketrafford:1   0.6617     0.8461   0.782 0.434145    
## Laketrafford:2   2.9363     0.6874   4.272 1.94e-05 ***
## Laketrafford:3   1.9316     0.8253   2.340 0.019263 *  
## Laketrafford:4   0.7912     0.5879   1.346 0.178400    
## Size>2.3:1       0.7302     0.6523   1.120 0.262918    
## Size>2.3:2      -1.3363     0.4112  -3.250 0.001155 ** 
## Size>2.3:3       0.5570     0.6466   0.861 0.388977    
## Size>2.3:4      -0.2906     0.4599  -0.632 0.527515    
## Genderm:1       -0.6064     0.6888  -0.880 0.378666    
## Genderm:2       -0.4630     0.3955  -1.171 0.241796    
## Genderm:3       -0.6276     0.6853  -0.916 0.359785    
## Genderm:4       -0.2526     0.4663  -0.542 0.588100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), 
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
## 
## Residual deviance: 50.2637 on 40 degrees of freedom
## 
## Log-likelihood: -73.3221 on 40 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Reference group is level  5  of the response

The residual deviance is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.

# E <- data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
# O <- gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
# (g2 <- 2 * sum(O * log(O / E)))

indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with p-value of 0.1282 and the statistics/df is close to 1 that is 1.256.

2.3 Ordinal Logistic Regression

Ordinal logistic regression, also call cumulative link model (CLM), is a generalized linear model (GZLM), an extension of the general linear model (GLM) to non-continuous outcome variables. There are many approaches to ordinal logistic regression, including cumulative, adjacent, and continuation categories, but the most popular is the cumulative odds ordinal logistic regression with proportional odds. These notes rely on UVA, PSU, Laerd, and the CLM package article vignette.

The model for ordinal response random variable \(Y_i\) with \(J\) levels is

\[\gamma_{ij} = F(\eta_{ij}), \hspace{5 mm} \eta_{ij} = \theta_j - x_i^\mathrm{T}\beta, \hspace{5 mm} i = 1, \ldots, n, \hspace{5 mm} j = 1, \ldots, J-1\]

where \(\gamma_{ij} = P(Y_i \le j) = \pi_{i1} + \cdots + \pi_{ij}\). \(\eta_{ij}\) is a linear predictor with \(J-1\) intercepts. \(F\) is the inverse link function. The regression models the logit link function of \(\gamma_{ij}\).

\[\mathrm{logit}(\gamma_{ij}) = \log \left[\frac{P(Y_i \le j)}{P(Y_i \gt j)} \right] = \theta_j - x_i^\mathrm{T}\beta\] The cumulative logit is the log-odds of the cumulative probabilities that the response is in category \(\le j\) versus \(\gt j\). \(\theta_j\) is the log-odds when \(x_i^\mathrm{T}=0\) and \(\beta\) is the increase in the log odds attributed to a one unit increase in \(x_i^\mathrm{T}\). Notice \(\beta\) is the same for all \(j\).

Once you fit the model, you will either generate predicted values or evaluate the coefficient estimators. The predicted value is a log-odds by default (useless), so you will at least take exponential to get the odds. From there you can solve for the probability,

\[P(Y_i \gt j) = \frac{\mathrm{exp}(\hat{y}_i)}{1 + \mathrm{exp}(\hat{y}_i)}.\]

The exponential of \(\beta\) is the odds ratio of \(x_1^\mathrm{T} - x_0^\mathrm{T}\). You can solve for the odds ratio

\[\mathrm{OR} = \frac{\mathrm{exp}(\theta_j - x_1^\mathrm{T}\beta)}{\mathrm{exp}(\theta_j - x_2^\mathrm{T}\beta)} = \mathrm{exp}(\beta(x_1^\mathrm{T} - x_0^\mathrm{T}))\]

If \(x\) is a binary factor factor, then \(exp(\beta)\) is the odds ratio of \(x=1\) vs \(x=0\). Thus the odds-ratio is proportional to the difference between values of \(x\) and \(\beta\) is the constant of proportionality.

The model is estimated with a regularized Newton-Raphson algorithm with step-halving (line search) using analytic expressions for the gradient and Hessian of the negative log-likelihood function. What this means is beyond me right now, but the upshot is that the estimation is an iterative maximization exercise, not a formulaic matrix algebra process. It is possible for the model estimation to fail to converge on a maximum.

You will sometimes encounter discussion about the latent variable. That is just the underlying quality you are trying to measure. If you rate something a 4 on a 5-level likert scale, 4 is the expression of your valuation, the latent variable. Your precise valuation is somewhere between 3 and 5 on a continuous scale. The link function defines the distribution of the latent variable.

There are variations on the ordinal model. Structured thresholds impose restrictions on \(\theta_j\), for example requiring equal distances between levels. Partial proportional odds allow \(\theta_j\) to vary with nominal predictors. You can also use link functions other than logit.

There are two conditions to ordinal logistic regression: (a) no multicollinearity, and (b) proportional odds.

Case Study

192 participants in a study respond to the statement “Taxes are too high” on a 4-level likert scale (tax_too_high, Strongly Disagree, Disagree, Agree, Strongly Agree). Participant attributes include business owner (biz_owner, Y|N), age (age), and political affiliation (politics, Liberal, Conservative, Labor).

tax$gt <- tbl_summary(
  tax$dat %>% select(-age), 
  by = politics, 
  statistic = list(all_continuous() ~ "{mean} ({sd})")
)
tax$gt
Characteristic Lab, N = 621 Lib, N = 541 Con, N = 761
biz_owner 36 (58%) 26 (48%) 47 (62%)
income 36 (8) 35 (6) 37 (7)
tax_too_high
    Strongly Disagree 10 (16%) 10 (19%) 4 (5.3%)
    Disagree 13 (21%) 16 (30%) 9 (12%)
    Agree 28 (45%) 21 (39%) 42 (55%)
    Strongly Agree 11 (18%) 7 (13%) 21 (28%)
1 n (%); Mean (SD)
tax$dat %>%
  mutate(income = case_when(income < quantile(income, .25) ~ "low income",
                            income < quantile(income, .75) ~ "med income",
                            TRUE ~ "high income"),
         income = factor(income, levels = c("low income", "med income", "high income"))) %>%
  count(tax_too_high, biz_owner, politics, income) %>%
  ggplot(aes(x = tax_too_high, y = n, fill = biz_owner)) +
  geom_col(position = position_dodge2(preserve = "single")) +
  facet_grid(rows = vars(income), cols = vars(politics), space = "free") +
  scale_x_discrete(labels = function (x) str_wrap(x, width = 10)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Taxes too high?",
       subtitle = "Reponse count by business owner, income level, and party.")

Fit a cumulative link model for the cumulative probability of the \(i\)th response falling in \(j\)th category or below where \(i\) indexes the (\(n = 192\)) responses, \(j = 1, \ldots, J\) indexes the (\(J = 4\)) response categories, and \(\theta_j\) is the threshold for the \(j\)th cumulative logit.

\[\mathrm{logit}(P(Y_i \le j)) = \theta_j - \beta_1(\mathrm{politics}_i) - \beta_2(\mathrm{biz\_owner}_i) - \beta_3(\mathrm{age}_i)\]

Fit the Model

tax$fmla <- formula(tax_too_high ~ biz_owner + age + politics)
tax$clm <- clm(tax$fmla, data = tax$dat)
summary(tax$clm)
## formula: tax_too_high ~ biz_owner + age + politics
## data:    tax$dat
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  192  -197.62 409.23 6(0)  3.14e-12 3.2e+05
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## biz_ownerYes  0.66462    0.28894   2.300 0.021435 *  
## age           0.24189    0.03260   7.421 1.17e-13 ***
## politicsLib   0.03695    0.36366   0.102 0.919072    
## politicsCon   1.16142    0.34554   3.361 0.000776 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                            Estimate Std. Error z value
## Strongly Disagree|Disagree    7.026      1.166   6.024
## Disagree|Agree                8.766      1.231   7.119
## Agree|Strongly Agree         11.653      1.357   8.590

The summary table shows two fit statistics at the top: the log-likelihood and the AIC. The log-likelihood is the sum of the likelihoods for each observation that the predicted value correctly predicts the observed value. Its value ranges from \(-\infty\) to \(+\infty\). It’s value increases with observations, additional variables, and fit quality. I think you just use it to compare alternative model formulations. Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model. Again, I think the value is only useful as a comparison benchmark between alternative model fits. You’d want the model with the lowest AIC.

The Coefficients table is the familiar parameter estimates. E.g., for biz_ownerYes, the coefficient estimate is 0.665 with standard error 0.289. The z-value is the ratio \(z = \hat{\beta} / se =\) 2.300 with p-value equal to \(2 \cdot P(Z>z) =\) 0.021. Some programs (e.g., SPSS) also show the Wald chi-squared statistic, the square of the z statistic, \(z^2 =\), 5.291. The square of a normal variable has a chi-square distribution, so the p value for the Wald chi-squared statistic is the pchisq(z^2, df = 1) \(=\) 0.021.

The Threshold coefficients table are the intercepts, or cut-points. The first cut-point is log-odds of a response level Strongly Disagree (or less) vs greater than Strongly Disagree when all factor variables are at their reference level and the continuous vars are at 0.

There may be interaction effects between biz_owner and politics. Fit a saturate model, then compare their log likelihoods with a likelihood ratio test.

tax$fmla_sat <- formula(tax_too_high ~ biz_owner*politics + age)
tax$clm_sat <- clm(tax$fmla_sat, data = tax$dat)
tax$sat_anova <- anova(tax$clm, tax$clm_sat)
tax$sat_anova
## Likelihood ratio tests of cumulative link models:
##  
##             formula:     link: threshold:
## tax$clm     tax$fmla     logit flexible  
## tax$clm_sat tax$fmla_sat logit flexible  
## 
##             no.par    AIC  logLik LR.stat df Pr(>Chisq)
## tax$clm          7 409.23 -197.62                      
## tax$clm_sat      9 411.75 -196.87  1.4805  2      0.477

The likelihood ratio test indicates the main-effects model fits about the same in comparison to the saturated model (\(\chi^2(\) 2 $) = $ LR = 1.48, p = 0.477)

Verify Assumptions

Cumulative odds ordinal logistic regression with proportional odds models require a) no multicollinearity, and b) proportional odds.

Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated so that they do not provide unique or independent information in the regression model. Multicollinearity inflates the variances of the estimated coefficients, resulting in larger confidence intervals. The usual interpretation of a slope coefficient as the change in the mean response per unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes other predictors.

Test for multicollinearity with variance inflation factors (VIF). The VIF is the inflation percentage of the parameter variance due to multicollinearity. E.g., a VIF of 1.9 means the parameter variance is 90% larger than what it would be if it was not correlated with other predictors.

Predictor \(K\)’s variance, \(Var(\hat{\beta_k})\), is inflated by a factor of

\[VIF_k = \frac{1}{1 - R_k^2}\]

due to collinearity with other predictors, where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. If there is zero relationship between predictor \(k\) and the other variables, \(R_k^2 = 0\) and \(VIF = 1\) (no variance inflation). If 100% of the variance in predictor \(k\) is explained by the other predictors, then \(R_k^2 = 1\) and \(VIF = \infty\). A good rule of thumb is that \(VIF \le 5\) is acceptable.

# Cannot use CLM model with vif(). Re-express as a linear model.
tax$vif <- lm(as.numeric(tax_too_high) ~ politics + biz_owner + age, dat = tax$dat) %>%
  car::vif()
tax$vif
##               GVIF Df GVIF^(1/(2*Df))
## politics  1.035831  2        1.008840
## biz_owner 1.023642  1        1.011752
## age       1.036491  1        1.018082

The VIFs in column GVIF are all below 5, so this model is not compromised by multicollinearity.

Proportional Odds

The assumption of proportional odds means the independent variable effects are constant across each cumulative split of the ordinal dependent variable. Test for proportional odds using a full likelihood ratio test comparing the proportional odds model with a multinomial logit model, also called an unconstrained baseline logit model. This is also called the test of parallel lines. The multinomial logit model fits a slope to each of the \(J – 1\) levels. The proportional odds model is nested within the multinomial model, so you can use a likelihood ratio test to see if the models are statistically different. Fit the proportional odds model and a multinomial model using VGAM::vglm() and capture the log likelihoods and degrees of freedom. Perform a likelihood ratio test on the differences in log likelihoods, \(D = -2 \mathrm{loglik}(\beta)\).

tax$vglm_ordinal     <- vglm(tax$fmla, propodds,   data = tax$dat)
tax$vglm_multinomial <- vglm(tax$fmla, cumulative, data = tax$dat)
tax$po_lrt <- lrtest(tax$vglm_multinomial, tax$vglm_ordinal)
tax$po_lrt
## Likelihood ratio test
## 
## Model 1: tax_too_high ~ biz_owner + age + politics
## Model 2: tax_too_high ~ biz_owner + age + politics
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1 561 -193.31                     
## 2 569 -197.62  8 8.6197     0.3754

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters, \(\chi^2\)(8) = 8.620, p = 0.375.

Another option is the partial proportional odds test. This test locates specific variables causing the rejection of proportional odds.

tax$po_lrt2 <- clm(tax$fmla, data = tax$dat) %>%
  nominal_test()
tax$po_lrt2
## Tests of nominal effects
## 
## formula: tax_too_high ~ biz_owner + age + politics
##           Df  logLik    AIC     LRT Pr(>Chi)
## <none>       -197.62 409.23                 
## biz_owner  2 -197.34 412.67 0.55974   0.7559
## age                                         
## politics   4 -196.20 414.40 2.83415   0.5860

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters for business owner, \(\chi^2\)(2) = 0.560, p = 0.756 and politics, \(\chi^2\)(4) = 2.834, p = 0.586.

Assess the Model Fit

There are three ways to assess overall model fit: The Deviance and Pearson goodness-of-fit tests of the overall model fit; the Cox and Snell, Nagelkerke, and McFadden pseudo R measures of explained variance; and the likelihood ratio test comparing the model fit to the intercept-only model.

Deviance and Pearson

However, these tests rely on large frequencies in each cell, that is, each possible combination of predictor values. Overall goodness-of-fit statistics should be treated with suspicion when a continuous independent variable is present and/or there are a large number of cells with zero frequency.The Pearson goodness-of-fit statistic is \(X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\) where \(i\) is the observation number and \(j\) is the response variable level. It is a summary of the Pearson residuals, the difference between the observed and expected cell counts, \(O_{ij} - E_{ij}\). The deviance goodness-of-fit statistic is the difference in fit between the model and a full model; a full model being a model that fits the data perfectly, \(G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)\). Neither of these tests are reliable if there are many cells with zero frequencies and/or small expected frequencies and are generally not recommended. Generally, the chi-squared test requires a frequency count of at least 5 per cell.

# Observed combinations of model vars
tax$cell_patterns <- tax$dat %>% count(biz_owner, age, politics, tax_too_high) %>% nrow()

# Observed combinations of predictor vars * levels of response var
tax$covariate_patterns <- tax$dat %>% count(biz_owner, age, politics) %>% nrow()
tax$possible_cells <- tax$covariate_patterns * length(levels(tax$dat$tax_too_high))

# 1 - ratio of observed to possible
tax$pct_freq_zero <- 1 - tax$cell_patterns / tax$possible_cells

There are 137 observed combinations of model variables (predictors), and 372 possible combinations (predictors * outcome levels), so 63.2% of cells have zero frequencies. Ideally, zero frequencies should be less than 20%, so if you were to use the deviance or Pearson tests, you would need to report this. The results below are contradictory and bogus. I think you’d only use this test if you didn’t have continuous predictor variables.

observed <- tax$dat %>% 
  count(biz_owner, age, politics, tax_too_high) %>%
  pivot_wider(names_from = tax_too_high, values_from = n) %>%
  replace_na(list(`Strongly Disagree` = 0, Disagree = 0, Agree = 0, `Strongly Agree` = 0)) %>%
  pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "observed")

expected <- bind_cols(
  tax$dat, 
  predict(tax$clm, subset(tax$dat, select = -tax_too_high))$fit %>% data.frame()
) %>%
  rename(c("Strongly Disagree" = "Strongly.Disagree", "Strongly Agree" = "Strongly.Agree")) %>%
  group_by(biz_owner, age, politics) %>%
  summarize(.groups = "drop", across(`Strongly Disagree`:`Strongly Agree`, sum)) %>%
  pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "expected")

obs_exp <- observed %>%
  inner_join(expected, by = c("politics", "biz_owner", "age", "outcome")) %>%
  mutate(epsilon_sq = (observed - expected)^2,
         chi_sq = epsilon_sq / expected,
         g_sq = 2 * observed * log((observed+.0001) / expected)
  )

tax$chisq <- list()
tax$chisq$X2 = sum(obs_exp$chi_sq)
tax$chisq$G2 = sum(obs_exp$g_sq)
tax$chisq$df = tax$covariate_patterns * (length(levels(tax$dat$tax_too_high)) - 1) - 7
tax$chisq$X2_p.value = pchisq(tax$chisq$X2, df = tax$chisq$df, lower.tail = FALSE)
tax$chisq$G2_p.value = pchisq(tax$chisq$G2, df = tax$chisq$df, lower.tail = FALSE)

The Pearson goodness-of-fit test indicated that the model was not a good fit to the observed data, \(\chi^2\)(272) = 745.4, p < .001$. The deviance goodness-of-fit test indicated that the model was a good fit to the observed data, \(G^2\)(272) = 232.6, p = 0.960.

Pseudo-R2 Measures

There are a number of measures in ordinal regression that attempt to provide a similar “variance explained” measure as that provided in ordinary least-squares linear regression. However, these measures do not have the direct interpretation that they do in ordinary linear regression and are often, therefore, referred to as “pseudo” R2 measures. The three most common measures (Cox and Snell, Nagelkerke, and McFadden) are not particularly good and not universally used. It is presented in the SPSS output, so you might encounter it in published work.

tax$nagelkerke <- rcompanion::nagelkerke(tax$clm)
tax$nagelkerke$Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                             0.181957
## Cox and Snell (ML)                   0.367369
## Nagelkerke (Cragg and Uhler)         0.399641
Likelihood Ratio Test

The best way to assess model fit is the likelihood ratio test comparing the model to an intercept-only model. The difference in the -2 log likelihood between the models has a \(\chi^2\) distribution with degrees of freedom equal to the difference in the number of parameters.

intercept_only <- clm(tax_too_high ~ 1, data = tax$dat)
tax$lrt <- anova(tax$clm, intercept_only)
tax$lrt
## Likelihood ratio tests of cumulative link models:
##  
##                formula:         link: threshold:
## intercept_only tax_too_high ~ 1 logit flexible  
## tax$clm        tax$fmla         logit flexible  
## 
##                no.par    AIC  logLik LR.stat df Pr(>Chisq)    
## intercept_only      3 489.14 -241.57                          
## tax$clm             7 409.23 -197.62  87.911  4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows the log likelihoods of the two models. LR.stat is the difference between 2 * the logLik values.

The final model statistically significantly predicted the dependent variable over and above the intercept-only model, \(\chi^2(4)\) = 87.9, p = 0.000.

Interpret Results

Return to the model summary.

tidy(tax$clm)
## # A tibble: 7 × 6
##   term                       estimate std.error statistic  p.value coef.type
##   <chr>                         <dbl>     <dbl>     <dbl>    <dbl> <chr>    
## 1 Strongly Disagree|Disagree   7.03      1.17       6.02  1.70e- 9 intercept
## 2 Disagree|Agree               8.77      1.23       7.12  1.08e-12 intercept
## 3 Agree|Strongly Agree        11.7       1.36       8.59  8.72e-18 intercept
## 4 biz_ownerYes                 0.665     0.289      2.30  2.14e- 2 location 
## 5 age                          0.242     0.0326     7.42  1.17e-13 location 
## 6 politicsLib                  0.0369    0.364      0.102 9.19e- 1 location 
## 7 politicsCon                  1.16      0.346      3.36  7.76e- 4 location

The coefficients for biz_owner, age, and politics are positive. Positive parameters increase the likelihood of stronger agreement with the statement. In this case, discontent with taxes are higher for business owners, increase with age, and are higher for Liberal Democrats and Conservatives relative to the Labor Party. The expected cumulative log-odds of declaring \(\le j\) level of agreement with the statement for the baseline group (biz_ownerNo, age = 0, politicsLib) is 7.026 for \(j = 1\) (Strongly Disagree), 8.766 for \(j = 2\) (Disagree), and 11.653 for \(j = 3\) (Agree).

You could solve the logit equation for

\[\pi_j = \frac{\mathrm{exp}(Y_i)} {1 + \mathrm{exp}(Y_i)}\] to get the cumulative probabilities for each level. That’s what predict(type = "cum.prob") does. But it might be more intuitive to work with individual probabilities, the lagged differences to get the individual probabilities for each \(j\). That’s what predict(type = "prob") does. I like to play with predicted values to get a sense of the outcome distributions. In this case, I’ll take the median age, and each combination of biz_owner and politics.

new_data <- tax$dat %>% 
  mutate(age = median(tax$dat$age)) %>% 
  expand(age, politics, biz_owner)

preds <- predict(tax$clm, newdata = new_data, type = "prob")[["fit"]] %>% as.data.frame()

bind_cols(new_data, preds) %>%
  pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`) %>% 
  mutate(name = factor(name, levels = levels(tax$dat$tax_too_high))) %>%
  ggplot(aes(y = politics, x = value, fill = fct_rev(name))) +
  geom_col() +
  geom_text(aes(label = scales::percent(value, accuracy = 1)), 
            size = 3, position = position_stack(vjust=0.5)) +
  facet_grid(~paste("Bus Owner = ", biz_owner)) +
  scale_fill_grey(start = 0.5, end = 0.8) +
  theme_bw() + 
  theme(legend.position = "top",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(title = "Taxes too High for Conservative Business Owners?", 
       x = NULL, fill = NULL)

You will want to establish whether politics is statistically significant overall before exploring any specific contrasts. The ANOVA procedure with type I test reports an overall test of significance for each variable entered into the model.

(tax$anovaI <- anova(tax$clm, type = "I"))
## Type I Analysis of Deviance Table with Wald chi-square tests
## 
##           Df  Chisq Pr(>Chisq)    
## biz_owner  1 13.201  0.0002798 ***
## age        1 57.413  3.533e-14 ***
## politics   2 14.636  0.0006635 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6, p = 0.001.

The best way to work with the data is with the tidy(exponentiate = TRUE) version.

tax$tidy <- tax$clm %>% tidy(conf.int = TRUE, exponentiate = TRUE)
tax$tidy
## # A tibble: 7 × 8
##   term                  estim…¹ std.e…² stati…³  p.value conf.…⁴ conf.…⁵ coef.…⁶
##   <chr>                   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <chr>  
## 1 Strongly Disagree|Di…  1.13e3  1.17     6.02  1.70e- 9  NA       NA    interc…
## 2 Disagree|Agree         6.41e3  1.23     7.12  1.08e-12  NA       NA    interc…
## 3 Agree|Strongly Agree   1.15e5  1.36     8.59  8.72e-18  NA       NA    interc…
## 4 biz_ownerYes           1.94e0  0.289    2.30  2.14e- 2   1.11     3.44 locati…
## 5 age                    1.27e0  0.0326   7.42  1.17e-13   1.20     1.36 locati…
## 6 politicsLib            1.04e0  0.364    0.102 9.19e- 1   0.508    2.12 locati…
## 7 politicsCon            3.19e0  0.346    3.36  7.76e- 4   1.63     6.35 locati…
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic,
## #   ⁴​conf.low, ⁵​conf.high, ⁶​coef.type

Then you can summarize the table in words.

The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect, z = 2.300, p = 0.021.

The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919.

An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.

Reporting

Here is the complete write-up.

A cumulative odds ordinal logistic regression with proportional odds was run to determine the effect of business ownership, political party voted for, and age, on the belief that taxes are too high. There were proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters, \(\chi^2\)(8) = 8.620, p = 0.375. The final model statistically significantly predicted the dependent variable over and above the intercept-only model, \(\chi^2(4)\) = 87.9, p = 0.000. The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect, z = 2.300, p = 0.021. The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6, p = 0.001. The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919. An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.

Package gtsummary shows a nice summary table.

tbl_regression(tax$clm)
Characteristic log(OR)1 95% CI1 p-value
biz_owner
    No
    Yes 0.66 0.10, 1.2 0.021
age 0.24 0.18, 0.31 <0.001
politics
    Lab
    Lib 0.04 -0.68, 0.75 >0.9
    Con 1.2 0.49, 1.8 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

2.4 Poisson Regression

Poisson models count data, like “traffic tickets per day”, or “website hits per day”. The response is an expected rate or intensity. For count data, specify the generalized model, this time with family = poisson or family = quasipoisson.

Recall that the probability of achieving a count \(y\) when the expected rate is \(\lambda\) is distributed

\[P(Y = y|\lambda) = \frac{e^{-\lambda} \lambda^y}{y!}.\]

The poisson regression model is

\[\lambda = \exp(X \beta).\]

You can solve this for \(y\) to get

\[y = X\beta = \ln(\lambda).\]

That is, the model predicts the log of the response rate. For a sample of size n, the likelihood function is

\[L(\beta; y, X) = \prod_{i=1}^n \frac{e^{-\exp({X_i\beta})}\exp({X_i\beta})^{y_i}}{y_i!}.\]

The log-likelihood is

\[l(\beta) = \sum_{i=1}^n (y_i X_i \beta - \sum_{i=1}^n\exp(X_i\beta) - \sum_{i=1}^n\log(y_i!).\]

Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares.

Poisson processes assume the variance of the response variable equals its mean. “Equals” means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasi-poisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson.

Dataset fire contains response variable injuries counting the number of injuries during the month and one explanatory variable, the month mo.

fire <- read_csv(file = "C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CivilInjury_0.csv")
## Rows: 300 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (2): ID, Total Injuries
## dttm (1): Injury Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fire <- fire %>% 
  mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>%
  rename(dt = `Injury Date`,
         injuries = `Total Injuries`)
str(fire)
## tibble [300 × 4] (S3: tbl_df/tbl/data.frame)
##  $ ID      : num [1:300] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dt      : POSIXct[1:300], format: "2005-01-10" "2005-01-11" ...
##  $ injuries: num [1:300] 1 1 1 5 2 1 1 1 1 1 ...
##  $ mo      : num [1:300] 1 1 1 1 1 1 2 2 2 4 ...

In a situation like this where there the relationship is bivariate, start with a visualization.

ggplot(fire, aes(x = mo, y = injuries)) +
  geom_jitter() +
  geom_smooth(method = "glm", method.args = list(family = "poisson")) +
  labs(title = "Injuries by Month")
## `geom_smooth()` using formula 'y ~ x'

Fit a poisson regression in R using glm(formula, data, family = poisson). But first, check whether the mean and variance of injuries are the same magnitude? If not, then use family = quasipoisson.

mean(fire$injuries)
## [1] 1.36
var(fire$injuries)
## [1] 1.020468

They are of the same magnitude, so fit the regression with family = poisson.

m2 <- glm(injuries ~ mo, family = poisson, data = fire)
summary(m2)
## 
## Call:
## glm(formula = injuries ~ mo, family = poisson, data = fire)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3987  -0.3473  -0.3034  -0.2502   4.3185  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.22805    0.10482   2.176   0.0296 *
## mo           0.01215    0.01397   0.870   0.3844  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 139.87  on 299  degrees of freedom
## Residual deviance: 139.11  on 298  degrees of freedom
## AIC: 792.08
## 
## Number of Fisher Scoring iterations: 5

The predicted value \(\hat{y}\) is the estimated log of the response variable,

\[\hat{y} = X \hat{\beta} = \ln (\lambda).\]

Suppose mo is January (mo = ), then the log ofinjuries` is \(\hat{y} = 0.323787\). Or, more intuitively, the expected count of injuries is \(\exp(0.323787) = 1.38\)

predict(m2, newdata = data.frame(mo=1))
##         1 
## 0.2401999
predict(m2, newdata = data.frame(mo=1), type = "response")
##        1 
## 1.271503

Here is a plot of the predicted counts in red.

augment(m2, type.predict = "response") %>%
  ggplot(aes(x = mo, y = injuries)) +
  geom_point() +
  geom_point(aes(y = .fitted), color = "red") + 
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Month",
       y = "Injuries",
       title = "Poisson Fitted Line Plot")

Evaluate a logistic model fit with an analysis of deviance.

(perf <- glance(m2))
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          140.     299  -394.  792.  799.     139.         298   300
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
## [1] 0.005413723

The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedo-R2 is very low at .05. How about the RMSE?

RMSE(pred = predict(m2, type = "response"), obs = fire$injuries)
## [1] 1.006791

The average prediction error is about 0.99. That’s almost as much as the variance of injuries - i.e., just predicting the mean of injuries would be almost as good! Use the GainCurvePlot() function to plot the gain curve.

augment(m2, type.predict = "response") %>%
  ggplot(aes(x = injuries, y = .fitted)) +
  geom_point() +
  geom_smooth(method ="lm") +
  labs(x = "Actual",
       y = "Predicted",
       title = "Poisson Fitted vs Actual")
## `geom_smooth()` using formula 'y ~ x'

augment(m2) %>% data.frame() %>% 
  GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model")

It seems that mo was a poor predictor of injuries.

References


  1. These notes are primarily from PSU’s Analysis of Discrete Data which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU’s Regression Methods, DataCamp’s Generalized Linear Models in R, DataCamp’s Multiple and Logistic Regression, and Interpretable machine learning (Molnar 2020).↩︎

  2. The related probit regression link function is \(f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)\). The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can safely ignore probit.↩︎

  3. Notes from Machine Learning Mastery↩︎

  4. These notes rely on the course notes from PSU STAT 504, Analysis of Discrete Data, UCLA, and Sara Worth.↩︎