# Chapter 2 Generalized Linear Models (GLM)

The linear regression model, \(E(y|X) = X \beta\), structured as \(y_i = X_i \beta + \epsilon_i\) where \(X_i \beta = \mu_i\), assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with zero mean and constant variance, \(\epsilon \sim N \left(0, \sigma^2 \right)\). This implies that given a set of predictors, the response is normally distributed about its expected value, \(y_i \sim N \left(\mu_i, \sigma^2 \right)\). However, there are many situations where this normality assumption does not hold. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses non-normal response distributions.^{1}

The response will not have a normal distribution if the underlying data-generating process is binomial (Section 2.1) or multinomial (Section 2.2), ordinal (Section 2.3), Poisson (counts, Section 2.4), or exponential (time-to-event). In these situations the linear regression model can predict probabilities and proportions outside [0, 1], or negative counts or times. GLMs solve this problem by modeling a *function* of the expected value of \(y\), \(f(E(y|X)) = X \beta\). There are three components to a GLM: a *random component* specified by the response variable’s probability distribution (normal, binomial, etc.); a *systematic component* in the form \(X\beta\); and a *link function*, \(\eta\), that specifies the link between the random and systematic components and converts the response to a range of \([-\infty, +\infty]\).

Linear regression is a special case of GLM where link function is the identity function, \(f(E(y|X)) = E(y|X)\). For binary logistic regression, the link function is the logit,

\[f(E(y|X)) = \ln \left( \frac{E(y|X)}{1 - E(y|X)} \right) = \log \left( \frac{\pi}{1 - \pi} \right) = \mathrm{logit}(\pi)\]

where \(\pi\) is the response probability.^{2}

For Poisson regression, the link function is

\[f(E(y|X)) = \ln (E(y|X)) = \ln(\lambda)\]

where \(\lambda\) is the expected event rate.

For exponential regression, the link function is

\[f(E(y|X) = -E(y|X) = -\lambda\]

where \(\lambda\) is the expected time to event.

GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

In R, specify a GLM just like an linear model, but with the `glm()`

function, specifying the distribution with the `family`

parameter.

`family = "gaussian"`

: linear regression`family = "binomial"`

: logistic regression`family = "poisson"`

: Poisson regression

Mulinomial logistic is a little different because it models a set of logits. Use `nnet::multinom()`

.

## 2.1 Binomial Logistic Regression

Logistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal, or ordinal.

The *binary* logistic regression model is

\[y_i = \mathrm{logit}(\pi_i) = \log \left( \frac{\pi_i}{1 - \pi_i} \right) = X_i \beta\]

where \(\pi_i\) is the response \(i\)’s binary level probability. The model predicts the *log odds* of the level. Why do this? The range of outcomes need to be between 0 and 1. A sigmoid function, \(f(y) = 1 / \left(1 + e^y \right)\), does that. So if the *log odds* of the level equals \(X_i\beta\), then the *odds* of the level equals \(e^{X\beta}\). You can solve for \(\pi_i\) to get \(\pi = \mathrm{odds} / (\mathrm{odds} + 1)\). Substituting,

\[\pi_i = \frac{\exp(y_i)}{1 + \exp(y_i)} = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]

which you can simplify to \(\pi_i = 1 / (1 + e^{-X_i\beta})\), a sigmoid function. The upshot is \(X\beta\) is the functional relationship between the independent variables and *a function of the response*, not the response itself.

The model parameters are estimated either by *iteratively reweighted least squares optimization* or by *maximum likelihood estimation* (MLE). Here is how MLE works^{3}. MLE maximizes the probability produced by a set of parameters \(\beta\) given a data set and probability distribution. In logistic regression the probability distribution is the binomial and each observation is the outcome of a single Bernoulli trial.

\[L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.\]

In practice, multiplying many small probabilities can be unstable, so MLE optimizes the log likelihood instead.

\[\begin{align} l(\beta; y, X) &= \sum_{i = 1}^n \left(y_i \log(\pi_i) + (1 - y_i) \log(1 - \pi_i) \right) \\ &= \sum_{i = 1}^n \left(y_i X_i \beta - \log(1 + e^{X_i\beta}) \right) \end{align}\]

Sometimes you will see the *cost function* optimized. This is the negative of of the log likelihood function.

### 2.1.1 Case Study

Dataset `donner`

contains *n* = 45 observations of the survival status (`surv`

) of the members of the Donner party with explanatory variables `age`

and `sex`

.

```
glimpse(donner)
## Rows: 45
## Columns: 3
## $ age <dbl> 23, 40, 40, 30, 28, 40, 45, 62, 65, 45, 25, 28, 28, 23, 22, 23, 2…
## $ sex <fct> M, F, M, M, M, M, F, M, M, F, F, M, M, M, F, F, M, F, F, M, F, M,…
## $ surv <fct> Died, Lived, Lived, Died, Died, Died, Died, Died, Died, Died, Die…
```

### 2.1.2 Fit the Model

Fit a logistic regression \(\mathrm{surv} = \mathrm{sex} + \mathrm{age} + \mathrm{sex:age}\). The reference case based on the factor definition is `sex`

= “F”, and `age`

= 28 (the median age).

```
<- donner %>%
donner_centered mutate(age = age - median(donner$age))
<- glm(surv ~ sex*age, data = donner_centered, family = "binomial")
m summary(m)
```

```
##
## Call:
## glm(formula = surv ~ sex * age, family = "binomial", data = donner_centered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2279 -0.9388 -0.5550 0.7794 1.6998
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.81231 1.04269 1.738 0.0822 .
## sexM -2.40334 1.11677 -2.152 0.0314 *
## age -0.19407 0.08742 -2.220 0.0264 *
## sexM:age 0.16160 0.09426 1.714 0.0865 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 47.346 on 41 degrees of freedom
## AIC: 55.346
##
## Number of Fisher Scoring iterations: 5
```

The “z value” is the Wald \(z\) statistic, \(z = \hat{\beta} / SE(\hat{\beta})\). Its square, \(z^2\), is the Wald chi-squared statistic. The *p*-value is the area to the right of \(z^2\) in the \(\chi_1^2\) density curve:

```
%>% tidy() %>%
m mutate(
z = estimate / std.error,
p_z2 = pchisq(z^2, df = 1, lower.tail = FALSE)
%>%
) select(term, z, p_z2)
```

```
## # A tibble: 4 × 3
## term z p_z2
## <chr> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0822
## 2 sexM -2.15 0.0314
## 3 age -2.22 0.0264
## 4 sexM:age 1.71 0.0865
```

Below the Coefficients table, the “dispersion parameter” refers to overdispersion, a common issue with GLM. For a logistic regression, the response variable should be distributed \(y_i \sim \mathrm{Bin}(n_i, \pi_i)\) with \(\mu_i = n_i \pi_i\) and \(\sigma^2 = \pi (1 - \pi)\). Overdispersion means the data shows evidence of variance greater than \(\sigma^2\).

“Fisher scoring” is a method for ML estimation. Logistic regression uses interatively reweighted least squares to fit the model, so this section indicates whether the algorithm converged.

The null deviance is the likelihood ratio \(G^2\) = 61.827 of the intercept-only model. The residual deviance is the likelihood ratio \(G^2\) = 47.346 after including the model covariates. \(G^2\) is large, so reject the null hypothesis of no age or sex effects. An ANOVA table shows the change in deviance from successively adding each variable to the model.

`anova(m)`

```
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: surv
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 44 61.827
## sex 1 4.5403 43 57.286
## age 1 6.0300 42 51.256
## sex:age 1 3.9099 41 47.346
```

### 2.1.3 Interpret Results

Plug in values to interpret the model. The log odds of a 24 year-old female surviving is \(\hat{y}\) = 2.59. The log odds of a 24 year-old male surviving is \(\hat{y}\) = -0.46.

```
<- expand.grid(sex = factor(c("F", "M")), age = 24 - median(donner$age))
new_dat <- predict(m, newdata = new_dat))
(preds ## 1 2
## 2.5886056 -0.4611192
```

Exponentiate to get the **odds**.

\[\mathrm{odds}(\hat{y}) = \exp (\hat{y}) = \frac{\pi}{1 - \pi}.\]

```
exp(preds)
## 1 2
## 13.3111976 0.6305775
```

Solve for \(\pi\) to get the **probability**.

\[\pi = \frac{\exp (\hat{y})}{1 + \exp (\hat{y})}\]

You can do the math, or use `predict(type = "response")`

.

```
<- predict(m, newdata = new_dat, type = "response"))
(preds ## 1 2
## 0.9301246 0.3867204
```

Interpret the coefficient estimates as the change in the log odds of \(y\) due to a \(\delta = x_1 - x_0\) unit change in \(x\).

\[\log \left(\pi / (1 - \pi) |_{X = X_1} \right) - \log \left(\pi / (1 - \pi) |_{X = X_0} \right) = \log \left( \frac{\pi / (1 - \pi) |_{X = X_1}}{\pi / (1 - \pi) |_{X = X_0}} \right) = \delta \hat{\beta}\]

Better yet, interpret the exponential of the coefficient estimates as the change in the *odds* of \(y\) due to a \(\delta\) unit change in \(x\).

\[\mathrm{odds}(y) = e^{\delta \hat{\beta}}\]

The reference case is `sex`

= “F” and `age`

= 0 (centered on median age, 28).

- The survival odds of a 28 year old female are \(\hat{y} = e^{(Intercept)}\) = 6.1 to 1. The expected probability of surviving is 6.1 / (1 + 6.1) = 86%.
- The survival odds of a 24 year old female are \(\hat{y} = e^{(Intercept) -4 \cdot \mathrm{age}}\) = 13.3 to 1. The expected probability of surviving is 13.3 / (1 + 13.3) = 93%.
- The survival odds of a 28 year old male are \(\hat{y} = e^{(Intercept) + sexM}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 36%.
- The survival odds of a 24 year old male are \(\hat{y} = e^{(Intercept) -4 \cdot \mathrm{age} -4 \cdot \mathrm{sexM} \cdot \mathrm{sexM:age}}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 39%.
- A female’s survival odds are multiplied by a factor of \(e^{(1 \cdot \mathrm{age})}\) = 0.824 per additional year of age (i.e. the odds
*fall*by 17.6%). - A male’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{age}} e^{1 \cdot \mathrm{sexM:age}}\) = 0.968 per additional year of age (i.e. the odds
*fall*by 3.2%). - A 28 year old female’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{sex}}\) = 0.090 if she were instead a male. (i.e. the odds
*fall*by 91.0%)

`oddsratio::or_glm()`

calculates the odds ratio from an increment in the predictor values.

`::or_glm(donner_centered, m, incr = list(age = 1)) oddsratio`

```
## predictor oddsratio ci_low (2.5) ci_high (97.5) increment
## 1 sexM 0.090 0.005 0.561 Indicator variable
## 2 age 0.824 0.652 0.945 1
## 3 sexM:age 1.175 1.001 1.496 Indicator variable
```

Expressing the predicted values as probabilities produces the familiar signmoidal shape of the binary relationship.

```
augment(m, type.predict = "response") %>%
ggplot(aes(x = age, color = sex)) +
geom_point(aes(y = as.integer(surv))) +
geom_line(aes(y = .fitted+1)) +
labs(x = "Age (relative to median)",
y = NULL,
title = "Binary Fitted Line Plot") +
scale_y_continuous(breaks = c(1,2), labels = c("Died", "Lived")) +
theme_light() +
theme(legend.position = "right")
```

### 2.1.4 Assess Model Fit

Evaluate a logistic regression using a Gain curve or ROC curve.

#### Gain Curve

In the **gain curve**, the x-axis is the fraction of items seen when sorted by the predicted value, and the y-axis is the cumulatively summed true outcome. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the “gain” over a random prediction.

20 of the 45 members of the Donner party survived.

- The gain curve encountered 10 survivors (50%) within the first 12 observations (27%). It encountered all 20 survivors on the 37th observation.
- The bottom of the grey area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations.

- The top of the grey area is the outcome of the perfect model, the “wizard curve”. Half the survivors would be observed in 10/45=22% of the observations.

```
augment(m, type.predict = "response") %>%
# event_level = "second" sets the second level as success
::gain_curve(surv, .fitted, event_level = "second") %>%
yardstickautoplot() +
labs(title = "Gain Curve")
```

#### ROC Curve

The ROC (Receiver Operating Characteristics) curve plots sensitivity vs specificity at different cut-off values for the probability, ranging cut-off from 0 to 1.

```
augment(m, type.predict = "response") %>%
# event_level = "second" sets the second level as success
::roc_curve(surv, .fitted, event_level = "second") %>%
yardstickautoplot() +
labs(title = "ROC Curve")
```

## 2.2 Multinomial Logistic Regression

The *multinomial* logistic regression model is

\[y_i = \log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = X_i \beta_j, \hspace{5mm} j \ne j^*\]

where \(j\) is a level of the multinomial response variable. Whereas the binomial logistic regression equation models the *log odds* of the response level, the multinomial logistic regression models the *log relative risk*, the probability relative to the baseline \(j^*\) level (as opposed to \(\pi_{ij} / (1 - \pi_{ij}\)).^{4}

As with binary logistic regression, interpret the \(k^{th}\) element of \(\beta_j\) as the increase in log odds (relative risk) of \(Y_i = j\) relative to \(Y_i = j^*\) given a one-unit increase in the \(k^{th}\) element of \(X\), holding the other terms constant.

**Assumptions**

Multinomial logistic regression applies when the dependent variable is categorical. It presumes a linear relationship between the log odds of the dependent variable and \(X\) with residuals \(\epsilon\) that are independent. It also assumes there is no severe multicollinearity in the predictors, and there is independence of irrelevant alternatives (IIA). IIA means the relative likelihood of being in one category compared to the base category would not change if you added other categories.

### 2.2.1 Case Study

This case study is based on notes from Sara Worth and dat from UCLA.

```
download.file(
"https://stats.idre.ucla.edu/stat/data/hsbdemo.dta",
"./input/hsbdemo.dta",
mode = "wb"
)
```

```
<- list()
cs2 $dat <- read.dta("./input/hsbdemo.dta") cs2
```

A study measures the association between a student’s academic program (general, academic, and vocational) and their socioeconomic status (SES) (low, middle, high) and writing score.

```
$dat %>%
cs2tbl_summary(
by = prog,
include = c(prog, ses, write),
statistic = list(all_continuous() ~ "{mean}, {sd}")
%>%
) ::add_overall() gtsummary
```

Characteristic |
Overall, N = 200^{1} |
general, N = 45^{1} |
academic, N = 105^{1} |
vocation, N = 50^{1} |
---|---|---|---|---|

ses | ||||

low | 47 (24%) | 16 (36%) | 19 (18%) | 12 (24%) |

middle | 95 (48%) | 20 (44%) | 44 (42%) | 31 (62%) |

high | 58 (29%) | 9 (20%) | 42 (40%) | 7 (14%) |

write | 53, 9 | 51, 9 | 56, 8 | 47, 9 |

^{1} n (%); Mean, SD |

```
$dat %>%
cs2mutate(write_bin = cut(write, breaks = 5, dig.lab = 1, right = FALSE)) %>%
count(prog, ses, write_bin) %>%
group_by(ses, write_bin) %>%
mutate(prob = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(x = write_bin, y = prob, color = ses)) +
geom_point() +
geom_line(aes(group = ses)) +
facet_wrap(facets = vars(prog)) +
theme(legend.position = "top") +
labs(title = "Program Probability for Students by SES")
```

Set academic as the reference level of the dependent variable, then fit the multinomial logistic regression with `nnet::multinom()`

.

```
$dat <- cs2$dat %>% mutate(prog = fct_relevel(prog, "academic"))
cs2
$fit <- nnet::multinom(prog ~ ses + write, data = cs2$dat) cs2
```

```
## # weights: 15 (8 variable)
## initial value 219.722458
## iter 10 value 179.982880
## final value 179.981726
## converged
```

`summary(cs2$fit)`

```
## Call:
## nnet::multinom(formula = prog ~ ses + write, data = cs2$dat)
##
## Coefficients:
## (Intercept) sesmiddle seshigh write
## general 2.852198 -0.5332810 -1.1628226 -0.0579287
## vocation 5.218260 0.2913859 -0.9826649 -0.1136037
##
## Std. Errors:
## (Intercept) sesmiddle seshigh write
## general 1.166441 0.4437323 0.5142196 0.02141097
## vocation 1.163552 0.4763739 0.5955665 0.02221996
##
## Residual Deviance: 359.9635
## AIC: 375.9635
```

`tbl_regression(cs2$fit, exponentiate = TRUE, conf.int = FALSE)`

```
## ℹ Multinomial models have a different underlying structure than the models
## gtsummary was designed for. Other gtsummary functions designed to work with
## tbl_regression objects may yield unexpected results.
```

Characteristic |
OR^{1} |
p-value |
---|---|---|

general | ||

ses | ||

low | — | |

middle | 0.59 | 0.2 |

high | 0.31 | 0.024 |

write | 0.94 | 0.007 |

vocation | ||

ses | ||

low | — | |

middle | 1.34 | 0.5 |

high | 0.37 | 0.10 |

write | 0.89 | <0.001 |

^{1} OR = Odds Ratio |

A one unit increase in the writing score is associated with a 0.058 decrease in the log odds of being in a general vs academic program. The log odds of being in a general vs academic program are 1.163 lower for high SES than low SES.

The exponential of the coefficients is the relative risk (ratio of probabilities) relative to the baseline.

`exp(coef(cs2$fit))`

```
## (Intercept) sesmiddle seshigh write
## general 17.32582 0.5866769 0.3126026 0.9437172
## vocation 184.61262 1.3382809 0.3743123 0.8926116
```

The relative risk ratio of being in a general vs academic program is 0.944 for students with a one unit higher writing score. The relative risk ratio of being in a general vs academic program are 0.313 for high SES relative low SES.

Define a range of interesting values for the continuous predictor and plot the associated predicted values.

`summary(cs2$dat$write)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 45.75 54.00 52.77 60.00 67.00
```

```
<- expand.grid(
newdat ses = cs2$dat$ses,
write = seq(from = round(min(cs2$dat$write)), to = round(max(cs2$dat$write)), by = 1)
)
bind_cols(
newdat,predict(cs2$fit, newdata = newdat, type = "probs")
%>%
) pivot_longer(cols = -c(ses, write), names_to = "prog", values_to = "probability") %>%
ggplot(aes(x = write, y = probability, color = ses)) +
geom_line() +
facet_wrap(facets = vars(prog))
```

The `summary()`

object does not include the *p*-values. The `tidy()`

object does.

`tidy(cs2$fit)`

```
## # A tibble: 8 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 general (Intercept) 2.85 1.17 2.45 0.0145
## 2 general sesmiddle -0.533 0.444 -1.20 0.229
## 3 general seshigh -1.16 0.514 -2.26 0.0237
## 4 general write -0.0579 0.0214 -2.71 0.00682
## 5 vocation (Intercept) 5.22 1.16 4.48 0.00000730
## 6 vocation sesmiddle 0.291 0.476 0.612 0.541
## 7 vocation seshigh -0.983 0.596 -1.65 0.0989
## 8 vocation write -0.114 0.0222 -5.11 0.000000318
```

Here is an example using multinomial logistic regression. A researcher classified the stomach contents of \(n = 219\) alligators according to \(r = 5\) categories (*fish, Inv., Rept, Bird, Other*) as a function of covariates *Lake*, *Sex*, and *Size*..

```
<- tribble(
gator_dat ~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
"1", "f", "<2.3", "george", 3, 9, 1, 0, 1,
"2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
"3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
"4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
"5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
"6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
"7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
"8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
"9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
"10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
"11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
"12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
"13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
"14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
"15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
"16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)<- gator_dat %>%
gator_dat mutate(
Gender = as_factor(Gender),
Lake = fct_relevel(Lake, "hancock"),
Size = as_factor(Size)
)
```

There are 4 equations to estimate:

\[\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X\]

where \(\pi_{j^*}\) is the probability of fish, the baseline category.

Run a multivariate logistic regression model with `VGAM::vglm()`

.

`library(VGAM)`

`vglm()`

fits 4 logit models.

```
<- vglm(
gator_vglm cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender,
data = gator_dat,
family = multinomial
)summary(gator_vglm)
```

```
##
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
## Lake + Size + Gender, family = multinomial, data = gator_dat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.8568 0.5813 -3.195 0.001400 **
## (Intercept):2 -1.6115 0.5508 -2.926 0.003435 **
## (Intercept):3 -2.2866 0.6566 -3.483 0.000497 ***
## (Intercept):4 -0.6642 0.3802 -1.747 0.080639 .
## Lakegeorge:1 -0.5753 0.7952 -0.723 0.469429
## Lakegeorge:2 1.7805 0.6232 2.857 0.004277 **
## Lakegeorge:3 -1.1295 1.1928 -0.947 0.343687
## Lakegeorge:4 -0.7666 0.5686 -1.348 0.177563
## Lakeoklawaha:1 -1.1256 1.1923 -0.944 0.345132
## Lakeoklawaha:2 2.6937 0.6693 4.025 5.70e-05 ***
## Lakeoklawaha:3 1.4008 0.8105 1.728 0.083926 .
## Lakeoklawaha:4 -0.7405 0.7421 -0.998 0.318372
## Laketrafford:1 0.6617 0.8461 0.782 0.434145
## Laketrafford:2 2.9363 0.6874 4.272 1.94e-05 ***
## Laketrafford:3 1.9316 0.8253 2.340 0.019263 *
## Laketrafford:4 0.7912 0.5879 1.346 0.178400
## Size>2.3:1 0.7302 0.6523 1.120 0.262918
## Size>2.3:2 -1.3363 0.4112 -3.250 0.001155 **
## Size>2.3:3 0.5570 0.6466 0.861 0.388977
## Size>2.3:4 -0.2906 0.4599 -0.632 0.527515
## Genderm:1 -0.6064 0.6888 -0.880 0.378666
## Genderm:2 -0.4630 0.3955 -1.171 0.241796
## Genderm:3 -0.6276 0.6853 -0.916 0.359785
## Genderm:4 -0.2526 0.4663 -0.542 0.588100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 50.2637 on 40 degrees of freedom
##
## Log-likelihood: -73.3221 on 40 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level 5 of the response
```

The **residual deviance** is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.

```
# E <- data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
# O <- gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
# (g2 <- 2 * sum(O * log(O / E)))
```

indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with p-value of 0.1282 and the statistics/df is close to 1 that is 1.256.

## 2.3 Ordinal Logistic Regression

Ordinal logistic regression, also call cumulative link model (CLM), is a generalized linear model (GZLM), an extension of the general linear model (GLM) to non-continuous outcome variables. There are many approaches to ordinal logistic regression, including cumulative, adjacent, and continuation categories, but the most popular is the **cumulative odds ordinal logistic regression with proportional odds**. These notes rely on UVA, PSU, Laerd, and the CLM package article vignette.

The model for ordinal response random variable \(Y_i\) with \(J\) levels is

\[\gamma_{ij} = F(\eta_{ij}), \hspace{5 mm} \eta_{ij} = \theta_j - x_i^\mathrm{T}\beta, \hspace{5 mm} i = 1, \ldots, n, \hspace{5 mm} j = 1, \ldots, J-1\]

where \(\gamma_{ij} = P(Y_i \le j) = \pi_{i1} + \cdots + \pi_{ij}\). \(\eta_{ij}\) is a linear predictor with \(J-1\) intercepts. \(F\) is the inverse link function. The regression models the logit link function of \(\gamma_{ij}\).

\[\mathrm{logit}(\gamma_{ij}) = \log \left[\frac{P(Y_i \le j)}{P(Y_i \gt j)} \right] = \theta_j - x_i^\mathrm{T}\beta\] The cumulative logit is the log-odds of the cumulative probabilities that the response is in category \(\le j\) versus \(\gt j\). \(\theta_j\) is the log-odds when \(x_i^\mathrm{T}=0\) and \(\beta\) is the increase in the log odds attributed to a one unit increase in \(x_i^\mathrm{T}\). Notice \(\beta\) is the same for all \(j\).

Once you fit the model, you will either generate predicted values or evaluate the coefficient estimators. The predicted value is a log-odds by default (useless), so you will at least take exponential to get the odds. From there you can solve for the probability,

\[P(Y_i \gt j) = \frac{\mathrm{exp}(\hat{y}_i)}{1 + \mathrm{exp}(\hat{y}_i)}.\]

The exponential of \(\beta\) is the odds ratio of \(x_1^\mathrm{T} - x_0^\mathrm{T}\). You can solve for the odds ratio

\[\mathrm{OR} = \frac{\mathrm{exp}(\theta_j - x_1^\mathrm{T}\beta)}{\mathrm{exp}(\theta_j - x_2^\mathrm{T}\beta)} = \mathrm{exp}(\beta(x_1^\mathrm{T} - x_0^\mathrm{T}))\]

If \(x\) is a binary factor factor, then \(exp(\beta)\) is the odds ratio of \(x=1\) vs \(x=0\). Thus the odds-ratio is *proportional* to the difference between values of \(x\) and \(\beta\) is the constant of proportionality.

The model is estimated with a regularized Newton-Raphson algorithm with step-halving (line search) using analytic expressions for the gradient and Hessian of the negative log-likelihood function. What this means is beyond me right now, but the upshot is that the estimation is an iterative maximization exercise, not a formulaic matrix algebra process. It is possible for the model estimation to fail to converge on a maximum.

You will sometimes encounter discussion about the *latent variable*. That is just the underlying quality you are trying to measure. If you rate something a 4 on a 5-level likert scale, 4 is the expression of your valuation, the latent variable. Your precise valuation is somewhere between 3 and 5 on a continuous scale. The link function defines the distribution of the latent variable.

There are variations on the ordinal model. Structured thresholds impose restrictions on \(\theta_j\), for example requiring equal distances between levels. Partial proportional odds allow \(\theta_j\) to vary with nominal predictors. You can also use link functions other than logit.

There are two conditions to ordinal logistic regression: (a) no multicollinearity, and (b) proportional odds.

### Case Study

192 participants in a study respond to the statement “Taxes are too high” on a 4-level likert scale (`tax_too_high`

, *Strongly Disagree*, *Disagree*, *Agree*, *Strongly Agree*). Participant attributes include business owner (`biz_owner`

, *Y*|*N*), age (`age`

), and political affiliation (`politics`

, *Liberal*, *Conservative*, *Labor*).

```
$gt <- tbl_summary(
tax$dat %>% select(-age),
taxby = politics,
statistic = list(all_continuous() ~ "{mean} ({sd})")
)$gt tax
```

Characteristic |
Lab, N = 62^{1} |
Lib, N = 54^{1} |
Con, N = 76^{1} |
---|---|---|---|

biz_owner | 36 (58%) | 26 (48%) | 47 (62%) |

income | 36 (8) | 35 (6) | 37 (7) |

tax_too_high | |||

Strongly Disagree | 10 (16%) | 10 (19%) | 4 (5.3%) |

Disagree | 13 (21%) | 16 (30%) | 9 (12%) |

Agree | 28 (45%) | 21 (39%) | 42 (55%) |

Strongly Agree | 11 (18%) | 7 (13%) | 21 (28%) |

^{1} n (%); Mean (SD) |

```
$dat %>%
taxmutate(income = case_when(income < quantile(income, .25) ~ "low income",
< quantile(income, .75) ~ "med income",
income TRUE ~ "high income"),
income = factor(income, levels = c("low income", "med income", "high income"))) %>%
count(tax_too_high, biz_owner, politics, income) %>%
ggplot(aes(x = tax_too_high, y = n, fill = biz_owner)) +
geom_col(position = position_dodge2(preserve = "single")) +
facet_grid(rows = vars(income), cols = vars(politics), space = "free") +
scale_x_discrete(labels = function (x) str_wrap(x, width = 10)) +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Taxes too high?",
subtitle = "Reponse count by business owner, income level, and party.")
```

Fit a cumulative link model for the cumulative probability of the \(i\)th response falling in \(j\)th category or below where \(i\) indexes the (\(n = 192\)) responses, \(j = 1, \ldots, J\) indexes the (\(J = 4\)) response categories, and \(\theta_j\) is the threshold for the \(j\)th cumulative logit.

\[\mathrm{logit}(P(Y_i \le j)) = \theta_j - \beta_1(\mathrm{politics}_i) - \beta_2(\mathrm{biz\_owner}_i) - \beta_3(\mathrm{age}_i)\]

#### Fit the Model

```
$fmla <- formula(tax_too_high ~ biz_owner + age + politics)
tax$clm <- clm(tax$fmla, data = tax$dat)
taxsummary(tax$clm)
```

```
## formula: tax_too_high ~ biz_owner + age + politics
## data: tax$dat
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 192 -197.62 409.23 6(0) 3.14e-12 3.2e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## biz_ownerYes 0.66462 0.28894 2.300 0.021435 *
## age 0.24189 0.03260 7.421 1.17e-13 ***
## politicsLib 0.03695 0.36366 0.102 0.919072
## politicsCon 1.16142 0.34554 3.361 0.000776 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Strongly Disagree|Disagree 7.026 1.166 6.024
## Disagree|Agree 8.766 1.231 7.119
## Agree|Strongly Agree 11.653 1.357 8.590
```

The summary table shows two fit statistics at the top: the log-likelihood and the AIC. The log-likelihood is the sum of the likelihoods for each observation that the predicted value correctly predicts the observed value. Its value ranges from \(-\infty\) to \(+\infty\). It’s value increases with observations, additional variables, and fit quality. I think you just use it to compare alternative model formulations. Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model. Again, I think the value is only useful as a comparison benchmark between alternative model fits. You’d want the model with the lowest AIC.

The Coefficients table is the familiar parameter estimates. E.g., for `biz_ownerYes`

, the coefficient estimate is 0.665 with standard error 0.289. The *z*-value is the ratio \(z = \hat{\beta} / se =\) 2.300 with *p*-value equal to \(2 \cdot P(Z>z) =\) 0.021. Some programs (e.g., SPSS) also show the Wald chi-squared statistic, the square of the *z* statistic, \(z^2 =\), 5.291. The square of a normal variable has a chi-square distribution, so the *p* value for the Wald chi-squared statistic is the `pchisq(z^2, df = 1)`

\(=\) 0.021.

The Threshold coefficients table are the intercepts, or cut-points. The first cut-point is log-odds of a response level *Strongly Disagree* (or less) vs greater than *Strongly Disagree* when all factor variables are at their reference level and the continuous vars are at 0.

There may be interaction effects between `biz_owner`

and `politics`

. Fit a saturate model, then compare their log likelihoods with a likelihood ratio test.

```
$fmla_sat <- formula(tax_too_high ~ biz_owner*politics + age)
tax$clm_sat <- clm(tax$fmla_sat, data = tax$dat)
tax$sat_anova <- anova(tax$clm, tax$clm_sat)
tax$sat_anova tax
```

```
## Likelihood ratio tests of cumulative link models:
##
## formula: link: threshold:
## tax$clm tax$fmla logit flexible
## tax$clm_sat tax$fmla_sat logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## tax$clm 7 409.23 -197.62
## tax$clm_sat 9 411.75 -196.87 1.4805 2 0.477
```

The likelihood ratio test indicates the main-effects model fits about the same in comparison to the saturated model (\(\chi^2(\) 2 $) = $ *LR* = 1.48, *p* = 0.477)

#### Verify Assumptions

Cumulative odds ordinal logistic regression with proportional odds models require a) no multicollinearity, and b) proportional odds.

##### Multicollinearity

Multicollinearity occurs when two or more independent variables are *highly correlated* so that they do not provide unique or independent information in the regression model. Multicollinearity inflates the variances of the estimated coefficients, resulting in larger confidence intervals. The usual interpretation of a slope coefficient as the change in the mean response per unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor *necessarily* changes other predictors.

Test for multicollinearity with variance inflation factors (VIF). The VIF is the inflation percentage of the parameter variance due to multicollinearity. E.g., a VIF of 1.9 means the parameter variance is 90% larger than what it would be if it was not correlated with other predictors.

Predictor \(K\)’s variance, \(Var(\hat{\beta_k})\), is inflated by a factor of

\[VIF_k = \frac{1}{1 - R_k^2}\]

due to collinearity with other predictors, where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. If there is zero relationship between predictor \(k\) and the other variables, \(R_k^2 = 0\) and \(VIF = 1\) (no variance inflation). If 100% of the variance in predictor \(k\) is explained by the other predictors, then \(R_k^2 = 1\) and \(VIF = \infty\). A good rule of thumb is that \(VIF \le 5\) is acceptable.

```
# Cannot use CLM model with vif(). Re-express as a linear model.
$vif <- lm(as.numeric(tax_too_high) ~ politics + biz_owner + age, dat = tax$dat) %>%
tax::vif()
car$vif tax
```

```
## GVIF Df GVIF^(1/(2*Df))
## politics 1.035831 2 1.008840
## biz_owner 1.023642 1 1.011752
## age 1.036491 1 1.018082
```

The VIFs in column GVIF are all below 5, so this model is not compromised by multicollinearity.

##### Proportional Odds

The assumption of proportional odds means the independent variable effects are constant across each cumulative split of the ordinal dependent variable. Test for proportional odds using a full likelihood ratio test comparing the proportional odds model with a multinomial logit model, also called an unconstrained baseline logit model. This is also called the *test of parallel lines*. The multinomial logit model fits a slope to each of the \(J – 1\) levels. The proportional odds model is nested within the multinomial model, so you can use a likelihood ratio test to see if the models are statistically different. Fit the proportional odds model and a multinomial model using `VGAM::vglm()`

and capture the log likelihoods and degrees of freedom. Perform a likelihood ratio test on the differences in log likelihoods, \(D = -2 \mathrm{loglik}(\beta)\).

```
$vglm_ordinal <- vglm(tax$fmla, propodds, data = tax$dat)
tax$vglm_multinomial <- vglm(tax$fmla, cumulative, data = tax$dat)
tax$po_lrt <- lrtest(tax$vglm_multinomial, tax$vglm_ordinal)
tax$po_lrt tax
```

```
## Likelihood ratio test
##
## Model 1: tax_too_high ~ biz_owner + age + politics
## Model 2: tax_too_high ~ biz_owner + age + politics
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 561 -193.31
## 2 569 -197.62 8 8.6197 0.3754
```

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters, \(\chi^2\)(8) = 8.620,

p= 0.375.

Another option is the partial proportional odds test. This test locates specific variables causing the rejection of proportional odds.

```
$po_lrt2 <- clm(tax$fmla, data = tax$dat) %>%
taxnominal_test()
$po_lrt2 tax
```

```
## Tests of nominal effects
##
## formula: tax_too_high ~ biz_owner + age + politics
## Df logLik AIC LRT Pr(>Chi)
## <none> -197.62 409.23
## biz_owner 2 -197.34 412.67 0.55974 0.7559
## age
## politics 4 -196.20 414.40 2.83415 0.5860
```

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters for business owner, \(\chi^2\)(2) = 0.560,

p= 0.756 and politics, \(\chi^2\)(4) = 2.834,p= 0.586.

#### Assess the Model Fit

There are three ways to assess overall model fit: The Deviance and Pearson goodness-of-fit tests of the overall model fit; the Cox and Snell, Nagelkerke, and McFadden pseudo R measures of explained variance; and the likelihood ratio test comparing the model fit to the intercept-only model.

###### Deviance and Pearson

However, these tests rely on large frequencies in each *cell*, that is, each possible combination of predictor values. Overall goodness-of-fit statistics should be treated with suspicion when a continuous independent variable is present and/or there are a large number of cells with zero frequency.The Pearson goodness-of-fit statistic is \(X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\) where \(i\) is the observation number and \(j\) is the response variable level. It is a summary of the Pearson residuals, the difference between the observed and expected cell counts, \(O_{ij} - E_{ij}\). The deviance goodness-of-fit statistic is the difference in fit between the model and a full model; a full model being a model that fits the data perfectly, \(G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)\). Neither of these tests are reliable if there are many cells with zero frequencies and/or small expected frequencies and are generally not recommended. Generally, the chi-squared test requires a frequency count of at least 5 per cell.

```
# Observed combinations of model vars
$cell_patterns <- tax$dat %>% count(biz_owner, age, politics, tax_too_high) %>% nrow()
tax
# Observed combinations of predictor vars * levels of response var
$covariate_patterns <- tax$dat %>% count(biz_owner, age, politics) %>% nrow()
tax$possible_cells <- tax$covariate_patterns * length(levels(tax$dat$tax_too_high))
tax
# 1 - ratio of observed to possible
$pct_freq_zero <- 1 - tax$cell_patterns / tax$possible_cells tax
```

There are 137 observed combinations of model variables (predictors), and 372 possible combinations (predictors * outcome levels), so 63.2% of cells have zero frequencies. Ideally, zero frequencies should be less than 20%, so if you were to use the deviance or Pearson tests, you would need to report this. The results below are contradictory and bogus. I think you’d only use this test if you didn’t have continuous predictor variables.

```
<- tax$dat %>%
observed count(biz_owner, age, politics, tax_too_high) %>%
pivot_wider(names_from = tax_too_high, values_from = n) %>%
replace_na(list(`Strongly Disagree` = 0, Disagree = 0, Agree = 0, `Strongly Agree` = 0)) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "observed")
<- bind_cols(
expected $dat,
taxpredict(tax$clm, subset(tax$dat, select = -tax_too_high))$fit %>% data.frame()
%>%
) rename(c("Strongly Disagree" = "Strongly.Disagree", "Strongly Agree" = "Strongly.Agree")) %>%
group_by(biz_owner, age, politics) %>%
summarize(.groups = "drop", across(`Strongly Disagree`:`Strongly Agree`, sum)) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "expected")
<- observed %>%
obs_exp inner_join(expected, by = c("politics", "biz_owner", "age", "outcome")) %>%
mutate(epsilon_sq = (observed - expected)^2,
chi_sq = epsilon_sq / expected,
g_sq = 2 * observed * log((observed+.0001) / expected)
)
$chisq <- list()
tax$chisq$X2 = sum(obs_exp$chi_sq)
tax$chisq$G2 = sum(obs_exp$g_sq)
tax$chisq$df = tax$covariate_patterns * (length(levels(tax$dat$tax_too_high)) - 1) - 7
tax$chisq$X2_p.value = pchisq(tax$chisq$X2, df = tax$chisq$df, lower.tail = FALSE)
tax$chisq$G2_p.value = pchisq(tax$chisq$G2, df = tax$chisq$df, lower.tail = FALSE) tax
```

The Pearson goodness-of-fit test indicated that the model was

nota good fit to the observed data, \(\chi^2\)(272) = 745.4,p< .001$. The deviance goodness-of-fit test indicated that the model was a good fit to the observed data, \(G^2\)(272) = 232.6,p= 0.960.

###### Pseudo-R2 Measures

There are a number of measures in ordinal regression that attempt to provide a similar “variance explained” measure as that provided in ordinary least-squares linear regression. However, these measures do not have the direct interpretation that they do in ordinary linear regression and are often, therefore, referred to as “pseudo” R2 measures. The three most common measures (Cox and Snell, Nagelkerke, and McFadden) are not particularly good and not universally used. It is presented in the SPSS output, so you might encounter it in published work.

```
$nagelkerke <- rcompanion::nagelkerke(tax$clm)
tax$nagelkerke$Pseudo.R.squared.for.model.vs.null tax
```

```
## Pseudo.R.squared
## McFadden 0.181957
## Cox and Snell (ML) 0.367369
## Nagelkerke (Cragg and Uhler) 0.399641
```

###### Likelihood Ratio Test

The best way to assess model fit is the likelihood ratio test comparing the model to an intercept-only model. The difference in the -2 log likelihood between the models has a \(\chi^2\) distribution with degrees of freedom equal to the difference in the number of parameters.

```
<- clm(tax_too_high ~ 1, data = tax$dat)
intercept_only $lrt <- anova(tax$clm, intercept_only)
tax$lrt tax
```

```
## Likelihood ratio tests of cumulative link models:
##
## formula: link: threshold:
## intercept_only tax_too_high ~ 1 logit flexible
## tax$clm tax$fmla logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## intercept_only 3 489.14 -241.57
## tax$clm 7 409.23 -197.62 87.911 4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The table shows the log likelihoods of the two models. LR.stat is the difference between 2 * the logLik values.

The final model statistically significantly predicted the dependent variable over and above the intercept-only model, \(\chi^2(4)\) = 87.9,

p= 0.000.

#### Interpret Results

Return to the model summary.

`tidy(tax$clm)`

```
## # A tibble: 7 × 6
## term estimate std.error statistic p.value coef.type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Strongly Disagree|Disagree 7.03 1.17 6.02 1.70e- 9 intercept
## 2 Disagree|Agree 8.77 1.23 7.12 1.08e-12 intercept
## 3 Agree|Strongly Agree 11.7 1.36 8.59 8.72e-18 intercept
## 4 biz_ownerYes 0.665 0.289 2.30 2.14e- 2 location
## 5 age 0.242 0.0326 7.42 1.17e-13 location
## 6 politicsLib 0.0369 0.364 0.102 9.19e- 1 location
## 7 politicsCon 1.16 0.346 3.36 7.76e- 4 location
```

The coefficients for `biz_owner`

, `age`

, and politics are positive. Positive parameters *increase* the likelihood of stronger agreement with the statement. In this case, discontent with taxes are higher for business owners, increase with age, and are higher for Liberal Democrats and Conservatives relative to the Labor Party. The expected cumulative log-odds of declaring \(\le j\) level of agreement with the statement for the baseline group (`biz_ownerNo`

, `age = 0`

, `politicsLib`

) is 7.026 for \(j = 1\) (Strongly Disagree), 8.766 for \(j = 2\) (Disagree), and 11.653 for \(j = 3\) (Agree).

You could solve the logit equation for

\[\pi_j = \frac{\mathrm{exp}(Y_i)} {1 + \mathrm{exp}(Y_i)}\]
to get the cumulative probabilities for each level. That’s what `predict(type = "cum.prob")`

does. But it might be more intuitive to work with individual probabilities, the lagged differences to get the individual probabilities for each \(j\). That’s what `predict(type = "prob")`

does. I like to play with predicted values to get a sense of the outcome distributions. In this case, I’ll take the median `age`

, and each combination of `biz_owner`

and `politics`

.

```
<- tax$dat %>%
new_data mutate(age = median(tax$dat$age)) %>%
expand(age, politics, biz_owner)
<- predict(tax$clm, newdata = new_data, type = "prob")[["fit"]] %>% as.data.frame()
preds
bind_cols(new_data, preds) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`) %>%
mutate(name = factor(name, levels = levels(tax$dat$tax_too_high))) %>%
ggplot(aes(y = politics, x = value, fill = fct_rev(name))) +
geom_col() +
geom_text(aes(label = scales::percent(value, accuracy = 1)),
size = 3, position = position_stack(vjust=0.5)) +
facet_grid(~paste("Bus Owner = ", biz_owner)) +
scale_fill_grey(start = 0.5, end = 0.8) +
theme_bw() +
theme(legend.position = "top",
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(title = "Taxes too High for Conservative Business Owners?",
x = NULL, fill = NULL)
```

You will want to establish whether `politics`

is statistically significant overall before exploring any specific contrasts. The ANOVA procedure with type I test reports an overall test of significance for each variable entered into the model.

`$anovaI <- anova(tax$clm, type = "I")) (tax`

```
## Type I Analysis of Deviance Table with Wald chi-square tests
##
## Df Chisq Pr(>Chisq)
## biz_owner 1 13.201 0.0002798 ***
## age 1 57.413 3.533e-14 ***
## politics 2 14.636 0.0006635 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6,

p= 0.001.

The best way to work with the data is with the `tidy(exponentiate = TRUE)`

version.

```
$tidy <- tax$clm %>% tidy(conf.int = TRUE, exponentiate = TRUE)
tax$tidy tax
```

```
## # A tibble: 7 × 8
## term estim…¹ std.e…² stati…³ p.value conf.…⁴ conf.…⁵ coef.…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Strongly Disagree|Di… 1.13e3 1.17 6.02 1.70e- 9 NA NA interc…
## 2 Disagree|Agree 6.41e3 1.23 7.12 1.08e-12 NA NA interc…
## 3 Agree|Strongly Agree 1.15e5 1.36 8.59 8.72e-18 NA NA interc…
## 4 biz_ownerYes 1.94e0 0.289 2.30 2.14e- 2 1.11 3.44 locati…
## 5 age 1.27e0 0.0326 7.42 1.17e-13 1.20 1.36 locati…
## 6 politicsLib 1.04e0 0.364 0.102 9.19e- 1 0.508 2.12 locati…
## 7 politicsCon 3.19e0 0.346 3.36 7.76e- 4 1.63 6.35 locati…
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.low, ⁵conf.high, ⁶coef.type
```

Then you can summarize the table in words.

The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect,

z= 2.300,p= 0.021.

The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect,

z= 3.361,p= 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121),p= 0.919.

An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360),

z= 7.421,p= 0.000.

#### Reporting

Here is the complete write-up.

A cumulative odds ordinal logistic regression with proportional odds was run to determine the effect of business ownership, political party voted for, and age, on the belief that taxes are too high. There were proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters, \(\chi^2\)(8) = 8.620,

p= 0.375. The final model statistically significantly predicted the dependent variable over and above the intercept-only model, \(\chi^2(4)\) = 87.9,p= 0.000. The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect,z= 2.300,p= 0.021. The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6,p= 0.001. The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect,z= 3.361,p= 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121),p= 0.919. An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360),z= 7.421,p= 0.000.

Package **gtsummary** shows a nice summary table.

`tbl_regression(tax$clm)`

Characteristic |
log(OR)^{1} |
95% CI^{1} |
p-value |
---|---|---|---|

biz_owner | |||

No | — | — | |

Yes | 0.66 | 0.10, 1.2 | 0.021 |

age | 0.24 | 0.18, 0.31 | <0.001 |

politics | |||

Lab | — | — | |

Lib | 0.04 | -0.68, 0.75 | >0.9 |

Con | 1.2 | 0.49, 1.8 | <0.001 |

^{1} OR = Odds Ratio, CI = Confidence Interval |

## 2.4 Poisson Regression

Poisson models count data, like “traffic tickets per day”, or “website hits per day”. The response is an expected *rate* or intensity. For count data, specify the generalized model, this time with `family = poisson`

or `family = quasipoisson`

.

Recall that the probability of achieving a count \(y\) when the expected rate is \(\lambda\) is distributed

\[P(Y = y|\lambda) = \frac{e^{-\lambda} \lambda^y}{y!}.\]

The poisson regression model is

\[\lambda = \exp(X \beta).\]

You can solve this for \(y\) to get

\[y = X\beta = \ln(\lambda).\]

That is, the model predicts the log of the response rate. For a sample of size *n*, the likelihood function is

\[L(\beta; y, X) = \prod_{i=1}^n \frac{e^{-\exp({X_i\beta})}\exp({X_i\beta})^{y_i}}{y_i!}.\]

The log-likelihood is

\[l(\beta) = \sum_{i=1}^n (y_i X_i \beta - \sum_{i=1}^n\exp(X_i\beta) - \sum_{i=1}^n\log(y_i!).\]

Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares.

Poisson processes assume the variance of the response variable equals its mean. “Equals” means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasi-poisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson.

Dataset `fire`

contains response variable `injuries`

counting the number of injuries during the month and one explanatory variable, the month `mo`

.

`<- read_csv(file = "C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CivilInjury_0.csv") fire `

```
## Rows: 300 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): ID, Total Injuries
## dttm (1): Injury Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
<- fire %>%
fire mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>%
rename(dt = `Injury Date`,
injuries = `Total Injuries`)
str(fire)
```

```
## tibble [300 × 4] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:300] 1 2 3 4 5 6 7 8 9 10 ...
## $ dt : POSIXct[1:300], format: "2005-01-10" "2005-01-11" ...
## $ injuries: num [1:300] 1 1 1 5 2 1 1 1 1 1 ...
## $ mo : num [1:300] 1 1 1 1 1 1 2 2 2 4 ...
```

In a situation like this where there the relationship is bivariate, start with a visualization.

```
ggplot(fire, aes(x = mo, y = injuries)) +
geom_jitter() +
geom_smooth(method = "glm", method.args = list(family = "poisson")) +
labs(title = "Injuries by Month")
```

`## `geom_smooth()` using formula 'y ~ x'`

Fit a poisson regression in R using `glm(formula, data, family = poisson)`

. But first, check whether the mean and variance of `injuries`

are the same magnitude? If not, then use `family = quasipoisson`

.

`mean(fire$injuries)`

`## [1] 1.36`

`var(fire$injuries)`

`## [1] 1.020468`

They are of the same magnitude, so fit the regression with `family = poisson`

.

```
<- glm(injuries ~ mo, family = poisson, data = fire)
m2 summary(m2)
```

```
##
## Call:
## glm(formula = injuries ~ mo, family = poisson, data = fire)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3987 -0.3473 -0.3034 -0.2502 4.3185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22805 0.10482 2.176 0.0296 *
## mo 0.01215 0.01397 0.870 0.3844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 139.87 on 299 degrees of freedom
## Residual deviance: 139.11 on 298 degrees of freedom
## AIC: 792.08
##
## Number of Fisher Scoring iterations: 5
```

The *predicted* value \(\hat{y}\) is the estimated **log** of the response variable,

\[\hat{y} = X \hat{\beta} = \ln (\lambda).\]

Suppose `mo`

is January (mo = `), then the log of`

injuries` is \(\hat{y} = 0.323787\). Or, more intuitively, the expected count of injuries is \(\exp(0.323787) = 1.38\)

`predict(m2, newdata = data.frame(mo=1))`

```
## 1
## 0.2401999
```

`predict(m2, newdata = data.frame(mo=1), type = "response")`

```
## 1
## 1.271503
```

Here is a plot of the predicted counts in red.

```
augment(m2, type.predict = "response") %>%
ggplot(aes(x = mo, y = injuries)) +
geom_point() +
geom_point(aes(y = .fitted), color = "red") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Month",
y = "Injuries",
title = "Poisson Fitted Line Plot")
```

Evaluate a logistic model fit with an analysis of deviance.

`<- glance(m2)) (perf `

```
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 140. 299 -394. 792. 799. 139. 298 300
```

`<- 1 - perf$deviance / perf$null.deviance) (pseudoR2 `

`## [1] 0.005413723`

The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedo-R2 is very low at .05. How about the RMSE?

`RMSE(pred = predict(m2, type = "response"), obs = fire$injuries)`

`## [1] 1.006791`

The average prediction error is about 0.99. That’s almost as much as the variance of `injuries`

- i.e., just predicting the mean of `injuries`

would be almost as good! Use the `GainCurvePlot()`

function to plot the gain curve.

```
augment(m2, type.predict = "response") %>%
ggplot(aes(x = injuries, y = .fitted)) +
geom_point() +
geom_smooth(method ="lm") +
labs(x = "Actual",
y = "Predicted",
title = "Poisson Fitted vs Actual")
```

`## `geom_smooth()` using formula 'y ~ x'`

```
augment(m2) %>% data.frame() %>%
GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model")
```

It seems that `mo`

was a poor predictor of `injuries`

.

### References

*Categorical Data Analysis*. 3rd ed. Wiley. https://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics-ebook-dp-B00CAYUFM2/dp/B00CAYUFM2/ref=mt_kindle?_encoding=UTF8&me=&qid=.

*Interpretable Machine Learning*. https://christophm.github.io/interpretable-ml-book/.

These notes are primarily from PSU’s Analysis of Discrete Data which uses Alan Agresti’s

**Categorical Data Analysis**(Agresti 2013). I also reviewed PSU’s Regression Methods, DataCamp’s Generalized Linear Models in R, DataCamp’s Multiple and Logistic Regression, and**Interpretable machine learning**(Molnar 2020).↩︎The related probit regression link function is \(f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)\). The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can safely ignore probit.↩︎

Notes from Machine Learning Mastery↩︎

These notes rely on the course notes from PSU STAT 504, Analysis of Discrete Data, UCLA, and Sara Worth.↩︎