Chapter 2 Generalized Linear Models (GLM)
These notes are primarily from PSU’s Analysis of Discrete Data which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU’s Regression Methods, DataCamp’s Generalized Linear Models in R, DataCamp’s Multiple and Logistic Regression, and Interpretable machine learning (Molnar 2020).
The linear regression model, \(E(yX) = X \beta\), structured as \(y_i = X_i \beta + \epsilon_i\) where \(X_i \beta = \mu_i\), assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with zero mean and constant variance, \(\epsilon \sim N \left(0, \sigma^2 \right)\). This implies that given a set of predictors, the response is normally distributed about its expected value, \(y_i \sim N \left(\mu_i, \sigma^2 \right)\). However, there are many situations where this normality assumption does not hold. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses nonnormal response distributions.
The response will not have a normal distribution if the underlying datagenerating process is binomial or multinomial (proportions), Poisson (counts), or exponential (timetoevent). In these situations the linear regression model can predict proportions outside [0, 1], or negative counts and times. GLMs solve this problem by modeling a function of the expected value of \(y\), \(f(E(yX)) = X \beta\). There are three components to a GLM: the random component is the probability distribution of the response variable (normal, binomial, Poisson, etc.); the systematic component is the explanatory variables \(X\beta\); and the link function \(\eta\) specifies the link between random and systematic components, converting the response range to \([\infty, +\infty]\).
Linear regression is a special case of GLM where link function is the identity function, \(f(E(yX)) = E(yX)\). For logistic regression, where the data generating process is binomial, the link function is
\[f(E(yX)) = \ln \left( \frac{E(yX)}{1  E(yX)} \right) = \ln \left( \frac{\pi}{1  \pi} \right) = \mathrm{logit}(\pi)\]
where \(\pi\) is the event probability.^{1}
For Poisson regression, the link function is
\[f(E(yX)) = \ln (E(yX)) = \ln(\lambda)\]
where \(\lambda\) is the expected event rate.
For exponential regression, the link function is
\[f(E(yX) = E(yX) = \lambda\]
where \(\lambda\) is the expected time to event.
GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on largesample approximations.
In R, specify a GLM just like an linear model, but with the glm()
function, specifying the distribution with the family
parameter.
family = "gaussian"
: linear regressionfamily = "binomial"
: logistic regressionfamily = binomial(link = "probit")
: probit regressionfamily = "poisson"
: Poisson regression
2.1 Logistic Regression
Logistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal, or ordinal.
The binary logistic regression model is
\[y = \mathrm{logit}(\pi) = \ln \left( \frac{\pi}{1  \pi} \right) = X \beta\]
where \(\pi\) is the response level’s probability. The model predicts the log odds of the response level. Why do this? The range of outcomes need to be between 0 and 1. A sigmoid function, \(f(y) = 1 / \left(1 + e^y \right)\), does that. So if the log odds of the response level equals \(X\beta\), then the odds of the response level equals \(e^{X\beta}\). You can solve the odds formula for \(\pi\) to get \(\pi = \mathrm{odds} / (\mathrm{odds} + 1)\). Substituting, \(\pi = e^{X\beta} / (e^{X\beta} + 1)\) which you can simplify to \(\pi = 1 / (1 + e^{X\beta})\), a sigmoid function. The upshot is \(X\beta\) is the functional relationship between the independent variables and a function of the response, not the response itself.
The model parameters are estimated either by iteratively reweighted least squares optimization or by maximum likelihood estimation (MLE). Here is how MLE works^{2}. MLE maximizes the probability produced by a set of parameters \(\beta\) given a data set and probability distribution. In logistic regression the probability distribution is the binomial and each observation is the outcome of a single Bernoulli trial.
\[L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1  \pi_i)^{(1y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.\]
In practice, multiplying many small probabilities can be unstable, so MLE optimizes the log likelihood instead.
\[\begin{align} l(\beta; y, X) &= \sum_{i = 1}^n \left(y_i \log(\pi_i) + (1  y_i) \log(1  \pi_i) \right) \\ &= \sum_{i = 1}^n \left(y_i X_i \beta  \log(1 + e^{X_i\beta}) \right) \end{align}\]
Sometimes you will see the cost function optimized. This is the negative of of the log likelihood function.
2.1.1 Case Study
Dataset donner
contains n = 45 observations of the survival status (surv
) of the members of the Donner party with explanatory variables age
and sex
.
glimpse(donner)
## Rows: 45
## Columns: 3
## $ age <dbl> 23, 40, 40, 30, 28, 40, 45, 62, 65, 45, 25, 28, 28, 23, 22, 23, 2~
## $ sex <fct> M, F, M, M, M, M, F, M, M, F, F, M, M, M, F, F, M, F, F, M, F, M,~
## $ surv <fct> Died, Lived, Lived, Died, Died, Died, Died, Died, Died, Died, Die~
2.1.2 Fit the Model
Fit a logistic regression \(\mathrm{surv} = \mathrm{sex} + \mathrm{age} + \mathrm{sex:age}\). The reference case based on the factor definition is sex
= “F,” and age
= 28 (the median age).
< donner %>%
donner_centered mutate(age = age  median(donner$age))
< glm(surv ~ sex*age, data = donner_centered, family = "binomial")
m summary(m)
##
## Call:
## glm(formula = surv ~ sex * age, family = "binomial", data = donner_centered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 2.2279 0.9388 0.5550 0.7794 1.6998
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.81231 1.04269 1.738 0.0822 .
## sexM 2.40334 1.11677 2.152 0.0314 *
## age 0.19407 0.08742 2.220 0.0264 *
## sexM:age 0.16160 0.09426 1.714 0.0865 .
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 47.346 on 41 degrees of freedom
## AIC: 55.346
##
## Number of Fisher Scoring iterations: 5
The “z value” is the Wald \(z\) statistic, \(z = \hat{\beta} / SE(\hat{\beta})\). Its square, \(z^2\), is the Wald chisquared statistic. The pvalue is the area to the right of \(z^2\) in the \(\chi_1^2\) density curve:
%>% tidy() %>%
m mutate(
z = estimate / std.error,
p_z2 = pchisq(z^2, df = 1, lower.tail = FALSE)
%>%
) select(term, z, p_z2)
## # A tibble: 4 x 3
## term z p_z2
## <chr> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0822
## 2 sexM 2.15 0.0314
## 3 age 2.22 0.0264
## 4 sexM:age 1.71 0.0865
Below the Coefficients table, the “dispersion parameter” refers to overdispersion, a common issue with GLM. For a logistic regression, the response variable should be distributed \(y_i \sim \mathrm{Bin}(n_i, \pi_i)\) with \(\mu_i = n_i \pi_i\) and \(\sigma^2 = \pi (1  \pi)\). Overdispersion means the data shows evidence of variance greater than \(\sigma^2\).
“Fisher scoring” is a method for ML estimation. Logistic regression uses interatively reweighted least squares to fit the model, so this section indicates whether the algorithm converged.
The null deviance is the likelihood ratio \(G^2\) = 61.827 of the interceptonly model. The residual deviance is the likelihood ratio \(G^2\) = 47.346 after including the model covariates. \(G^2\) is large, so reject the null hypothesis of no age or sex effects. An ANOVA table shows the change in deviance from successively adding each variable to the model.
anova(m)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: surv
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 44 61.827
## sex 1 4.5403 43 57.286
## age 1 6.0300 42 51.256
## sex:age 1 3.9099 41 47.346
2.1.3 Interpret Results
Plug in values to interpret the model. The log odds of a 24 yearold female surviving is \(\hat{y}\) = 2.59. The log odds of a 24 yearold male surviving is \(\hat{y}\) = 0.46.
< expand.grid(sex = factor(c("F", "M")), age = 24  median(donner$age))
new_dat < predict(m, newdata = new_dat))
(preds ## 1 2
## 2.5886056 0.4611192
Exponentiate to get the odds.
\[\mathrm{odds}(\hat{y}) = \exp (\hat{y}) = \frac{\pi}{1  \pi}.\]
exp(preds)
## 1 2
## 13.3111976 0.6305775
Solve for \(\pi\) to get the probability.
\[\pi = \frac{\exp (\hat{y})}{1 + \exp (\hat{y})}\]
You can do the math, or use predict(type = "response")
.
< predict(m, newdata = new_dat, type = "response"))
(preds ## 1 2
## 0.9301246 0.3867204
Interpret the coefficient estimates as the change in the log odds of \(y\) due to a \(\delta = x_1  x_0\) unit change in \(x\).
\[\log \left(\pi / (1  \pi) _{X = X_1} \right)  \log \left(\pi / (1  \pi) _{X = X_0} \right) = \log \left( \frac{\pi / (1  \pi) _{X = X_1}}{\pi / (1  \pi) _{X = X_0}} \right) = \delta \hat{\beta}\]
Better yet, interpret the exponential of the coefficient estimates as the change in the odds of \(y\) due to a \(\delta\) unit change in \(x\).
\[\mathrm{odds}(y) = e^{\delta \hat{\beta}}\]
The reference case is sex
= “F” and age
= 0 (centered on median age, 28).
 The survival odds of a 28 year old female are \(\hat{y} = e^{(Intercept)}\) = 6.1 to 1. The expected probability of surviving is 6.1 / (1 + 6.1) = 86%.
 The survival odds of a 24 year old female are \(\hat{y} = e^{(Intercept) 4 \cdot \mathrm{age}}\) = 13.3 to 1. The expected probability of surviving is 13.3 / (1 + 13.3) = 93%.
 The survival odds of a 28 year old male are \(\hat{y} = e^{(Intercept) + sexM}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 36%.
 The survival odds of a 24 year old male are \(\hat{y} = e^{(Intercept) 4 \cdot \mathrm{age} 4 \cdot \mathrm{sexM} \cdot \mathrm{sexM:age}}\) = 0.6. The expected probability of surviving is 0.6 / (1 + 0.6) = 39%.
 A female’s survival odds are multiplied by a factor of \(e^{(1 \cdot \mathrm{age})}\) = 0.824 per additional year of age (i.e. the odds fall by 17.6%).
 A male’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{age}} e^{1 \cdot \mathrm{sexM:age}}\) = 0.968 per additional year of age (i.e. the odds fall by 3.2%).
 A 28 year old female’s survival odds are multiplied by a factor of \(e^{1 \cdot \mathrm{sex}}\) = 0.090 if she were instead a male. (i.e. the odds fall by 91.0%)
oddsratio::or_glm()
calculates the odds ratio from an increment in the predictor values.
::or_glm(donner_centered, m, incr = list(age = 1)) oddsratio
## predictor oddsratio ci_low (2.5) ci_high (97.5) increment
## 1 sexM 0.090 0.005 0.561 Indicator variable
## 2 age 0.824 0.652 0.945 1
## 3 sexM:age 1.175 1.001 1.496 Indicator variable
Expressing the predicted values as probabilities produces the familiar signmoidal shape of the binary relationship.
augment(m, type.predict = "response") %>%
ggplot(aes(x = age, color = sex)) +
geom_point(aes(y = as.integer(surv))) +
geom_line(aes(y = .fitted+1)) +
labs(x = "Age (relative to median)",
y = NULL,
title = "Binary Fitted Line Plot") +
scale_y_continuous(breaks = c(1,2), labels = c("Died", "Lived")) +
theme_light() +
theme(legend.position = "right")
2.1.4 Assess Model Fit
Evaluate a logistic regression using a Gain curve or ROC curve.
Gain Curve
In the gain curve, the xaxis is the fraction of items seen when sorted by the predicted value, and the yaxis is the cumulatively summed true outcome. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the “gain” over a random prediction.
20 of the 45 members of the Donner party survived.
 The gain curve encountered 10 survivors (50%) within the first 12 observations (27%). It encountered all 20 survivors on the 37th observation.
 The bottom of the grey area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations.
 The top of the grey area is the outcome of the perfect model, the “wizard curve.” Half the survivors would be observed in 10/45=22% of the observations.
augment(m, type.predict = "response") %>%
# event_level = "second" sets the second level as success
::gain_curve(surv, .fitted, event_level = "second") %>%
yardstickautoplot() +
labs(title = "Gain Curve")
ROC Curve
The ROC (Receiver Operating Characteristics) curve plots sensitivity vs specificity at different cutoff values for the probability, ranging cutoff from 0 to 1.
augment(m, type.predict = "response") %>%
# event_level = "second" sets the second level as success
::roc_curve(surv, .fitted, event_level = "second") %>%
yardstickautoplot() +
labs(title = "ROC Curve")
2.2 Multinomial Logistic Regression
The following notes rely on the [PSU STAT 504 course notes](https://online.stat.psu.edu/stat504/node/171/.
Multinomial logistic regression models the odds the multinomial response variable \(y \sim Mult(n, \pi)\) is in level \(j\) relative to baseline category \(j^*\) for all pairs of categories as a function of \(k\) explanatory variables, \(X = (X_1, X_2, ... X_k)\).
\[\log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = x_i^T \beta_j, \hspace{5mm} j \ne j^2\]
Interpet the \(k^{th}\) element of \(\beta_j\) as the increase in logodds of falling a response in category \(j\) relative to category \(j^*\) resulting from a oneunit increase in the \(k^{th}\) predictor term, holding the other terms constant.
Multinomial model is a type of GLM.
Here is an example using multinomial logistic regression. A researcher classified the stomach contents of \(n = 219\) alligators according to \(r = 5\) categories (fish, Inv., Rept, Bird, Other) as a function of covariates Lake, Sex, and Size..
< tribble(
gator_dat ~profile, ~Gender, ~Size, ~Lake, ~Fish, ~Invertebrate, ~Reptile, ~Bird, ~Other,
"1", "f", "<2.3", "george", 3, 9, 1, 0, 1,
"2", "m", "<2.3", "george", 13, 10, 0, 2, 2,
"3", "f", ">2.3", "george", 8, 1, 0, 0, 1,
"4", "m", ">2.3", "george", 9, 0, 0, 1, 2,
"5", "f", "<2.3", "hancock", 16, 3, 2, 2, 3,
"6", "m", "<2.3", "hancock", 7, 1, 0, 0, 5,
"7", "f", ">2.3", "hancock", 3, 0, 1, 2, 3,
"8", "m", ">2.3", "hancock", 4, 0, 0, 1, 2,
"9", "f", "<2.3", "oklawaha", 3, 9, 1, 0, 2,
"10", "m", "<2.3", "oklawaha", 2, 2, 0, 0, 1,
"11", "f", ">2.3", "oklawaha", 0, 1, 0, 1, 0,
"12", "m", ">2.3", "oklawaha", 13, 7, 6, 0, 0,
"13", "f", "<2.3", "trafford", 2, 4, 1, 1, 4,
"14", "m", "<2.3", "trafford", 3, 7, 1, 0, 1,
"15", "f", ">2.3", "trafford", 0, 1, 0, 0, 0,
"16", "m", ">2.3", "trafford", 8, 6, 6, 3, 5
)< gator_dat %>%
gator_dat mutate(
Gender = as_factor(Gender),
Lake = fct_relevel(Lake, "hancock"),
Size = as_factor(Size)
)
There are 4 equations to estimate:
\[\log \left( \frac{\pi_j} {\pi_{j^*}} \right) = \beta X\]
where \(\pi_{j^*}\) is the probability of fish, the baseline category.
Run a multivariate logistic regression model with VGAM::vglm()
.
library(VGAM)
vglm()
fits 4 logit models.
< vglm(
gator_vglm cbind(Bird,Invertebrate,Reptile,Other,Fish) ~ Lake + Size + Gender,
data = gator_dat,
family = multinomial
)summary(gator_vglm)
##
## Call:
## vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
## Lake + Size + Gender, family = multinomial, data = gator_dat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept):1 1.8568 0.5813 3.195 0.001400 **
## (Intercept):2 1.6115 0.5508 2.926 0.003435 **
## (Intercept):3 2.2866 0.6566 3.483 0.000497 ***
## (Intercept):4 0.6642 0.3802 1.747 0.080639 .
## Lakegeorge:1 0.5753 0.7952 0.723 0.469429
## Lakegeorge:2 1.7805 0.6232 2.857 0.004277 **
## Lakegeorge:3 1.1295 1.1928 0.947 0.343687
## Lakegeorge:4 0.7666 0.5686 1.348 0.177563
## Lakeoklawaha:1 1.1256 1.1923 0.944 0.345132
## Lakeoklawaha:2 2.6937 0.6693 4.025 5.70e05 ***
## Lakeoklawaha:3 1.4008 0.8105 1.728 0.083926 .
## Lakeoklawaha:4 0.7405 0.7421 0.998 0.318372
## Laketrafford:1 0.6617 0.8461 0.782 0.434145
## Laketrafford:2 2.9363 0.6874 4.272 1.94e05 ***
## Laketrafford:3 1.9316 0.8253 2.340 0.019263 *
## Laketrafford:4 0.7912 0.5879 1.346 0.178400
## Size>2.3:1 0.7302 0.6523 1.120 0.262918
## Size>2.3:2 1.3363 0.4112 3.250 0.001155 **
## Size>2.3:3 0.5570 0.6466 0.861 0.388977
## Size>2.3:4 0.2906 0.4599 0.632 0.527515
## Genderm:1 0.6064 0.6888 0.880 0.378666
## Genderm:2 0.4630 0.3955 1.171 0.241796
## Genderm:3 0.6276 0.6853 0.916 0.359785
## Genderm:4 0.2526 0.4663 0.542 0.588100
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 50.2637 on 40 degrees of freedom
##
## Loglikelihood: 73.3221 on 40 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No HauckDonner effect found in any of the estimates
##
##
## Reference group is level 5 of the response
The residual deviance is 50.2637 on 40 degrees of freedom. Residual deviance tests the current model fit versus the saturated model. The saturated model, which fits a separate multinomial distribution to each of the 16 profiles (unique combinations of lake, sex and size), has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.
# E < data.frame(fitted(gator_vglm) * rowSums(gator_dat[, 5:9]))
# O < gator_dat %>% select(Bird, Invertebrate, Reptile, Other, Fish) + .000001
# (g2 < 2 * sum(O * log(O / E)))
indicates the model fits okay, but not great. The Residual Deviance of 50.26 with 40 df from the table above output is reasonable, with pvalue of 0.1282 and the statistics/df is close to 1 that is 1.256.
2.3 Ordinal Logistic Regression
Ordinal logistic regression, also call cumulative link model (CLM), is a generalized linear model (GZLM), an extension of the general linear model (GLM) to noncontinuous outcome variables. There are many approaches to ordinal logistic regression, including cumulative, adjacent, and continuation categories, but the most popular is the cumulative odds ordinal logistic regression with proportional odds. These notes rely on UVA, PSU, Laerd, and the CLM package article vignette.
The model for ordinal response random variable \(Y_i\) with \(J\) levels is
\[\gamma_{ij} = F(\eta_{ij}), \hspace{5 mm} \eta_{ij} = \theta_j  x_i^\mathrm{T}\beta, \hspace{5 mm} i = 1, \ldots, n, \hspace{5 mm} j = 1, \ldots, J1\]
where \(\gamma_{ij} = P(Y_i \le j) = \pi_{i1} + \cdots + \pi_{ij}\). \(\eta_{ij}\) is a linear predictor with \(J1\) intercepts. \(F\) is the inverse link function. The regression models the logit link function of \(\gamma_{ij}\).
\[\mathrm{logit}(\gamma_{ij}) = \log \left[\frac{P(Y_i \le j)}{P(Y_i \gt j)} \right] = \theta_j  x_i^\mathrm{T}\beta\] The cumulative logit is the logodds of the cumulative probabilities that the response is in category \(\le j\) versus \(\gt j\). \(\theta_j\) is the logodds when \(x_i^\mathrm{T}=0\) and \(\beta\) is the increase in the log odds attributed to a one unit increase in \(x_i^\mathrm{T}\). Notice \(\beta\) is the same for all \(j\).
Once you fit the model, you will either generate predicted values or evaluate the coefficient estimators. The predicted value is a logodds by default (useless), so you will at least take exponential to get the odds. From there you can solve for the probability,
\[P(Y_i \gt j) = \frac{\mathrm{exp}(\hat{y}_i)}{1 + \mathrm{exp}(\hat{y}_i)}.\]
The exponential of \(\beta\) is the odds ratio of \(x_1^\mathrm{T}  x_0^\mathrm{T}\). You can solve for the odds ratio
\[\mathrm{OR} = \frac{\mathrm{exp}(\theta_j  x_1^\mathrm{T}\beta)}{\mathrm{exp}(\theta_j  x_2^\mathrm{T}\beta)} = \mathrm{exp}(\beta(x_1^\mathrm{T}  x_0^\mathrm{T}))\]
If \(x\) is a binary factor factor, then \(exp(\beta)\) is the odds ratio of \(x=1\) vs \(x=0\). Thus the oddsratio is proportional to the difference between values of \(x\) and \(\beta\) is the constant of proportionality.
The model is estimated with a regularized NewtonRaphson algorithm with stephalving (line search) using analytic expressions for the gradient and Hessian of the negative loglikelihood function. What this means is beyond me right now, but the upshot is that the estimation is an iterative maximization exercise, not a formulaic matrix algebra process. It is possible for the model estimation to fail to converge on a maximum.
You will sometimes encounter discussion about the latent variable. That is just the underlying quality you are trying to measure. If you rate something a 4 on a 5level likert scale, 4 is the expression of your valuation, the latent variable. Your precise valuation is somewhere between 3 and 5 on a continuous scale. The link function defines the distribution of the latent variable.
There are variations on the ordinal model. Structured thresholds impose restrictions on \(\theta_j\), for example requiring equal distances between levels. Partial proportional odds allow \(\theta_j\) to vary with nominal predictors. You can also use link functions other than logit.
There are two conditions to ordinal logistic regression: (a) no multicollinearity, and (b) proportional odds.
Case Study
192 participants in a study respond to the statement “Taxes are too high” on a 4level likert scale (tax_too_high
, Strongly Disagree, Disagree, Agree, Strongly Agree). Participant attributes include business owner (biz_owner
, YN), age (age
), and political affiliation (politics
, Liberal, Conservative, Labor).
$gt < tbl_summary(
tax$dat %>% select(age),
taxby = politics,
statistic = list(all_continuous() ~ "{mean} ({sd})")
)$gt tax
Characteristic  Lab, N = 62^{1}  Lib, N = 54^{1}  Con, N = 76^{1} 

biz_owner  36 (58%)  26 (48%)  47 (62%) 
income  36 (8)  35 (6)  37 (7) 
tax_too_high  
Strongly Disagree  10 (16%)  10 (19%)  4 (5.3%) 
Disagree  13 (21%)  16 (30%)  9 (12%) 
Agree  28 (45%)  21 (39%)  42 (55%) 
Strongly Agree  11 (18%)  7 (13%)  21 (28%) 
^{
1
}
n (%); Mean (SD)

$dat %>%
taxmutate(income = case_when(income < quantile(income, .25) ~ "low income",
< quantile(income, .75) ~ "med income",
income TRUE ~ "high income"),
income = factor(income, levels = c("low income", "med income", "high income"))) %>%
count(tax_too_high, biz_owner, politics, income) %>%
ggplot(aes(x = tax_too_high, y = n, fill = biz_owner)) +
geom_col(position = position_dodge2(preserve = "single")) +
facet_grid(rows = vars(income), cols = vars(politics), space = "free") +
scale_x_discrete(labels = function (x) str_wrap(x, width = 10)) +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Taxes too high?",
subtitle = "Reponse count by business owner, income level, and party.")
Fit a cumulative link model for the cumulative probability of the \(i\)th response falling in \(j\)th category or below where \(i\) indexes the (\(n = 192\)) responses, \(j = 1, \ldots, J\) indexes the (\(J = 4\)) response categories, and \(\theta_j\) is the threshold for the \(j\)th cumulative logit.
\[\mathrm{logit}(P(Y_i \le j)) = \theta_j  \beta_1(\mathrm{politics}_i)  \beta_2(\mathrm{biz\_owner}_i)  \beta_3(\mathrm{age}_i)\]
Fit the Model
$fmla < formula(tax_too_high ~ biz_owner + age + politics)
tax$clm < clm(tax$fmla, data = tax$dat)
taxsummary(tax$clm)
## formula: tax_too_high ~ biz_owner + age + politics
## data: tax$dat
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 192 197.62 409.23 6(0) 3.14e12 3.2e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## biz_ownerYes 0.66462 0.28894 2.300 0.021435 *
## age 0.24189 0.03260 7.421 1.17e13 ***
## politicsLib 0.03695 0.36366 0.102 0.919072
## politicsCon 1.16142 0.34554 3.361 0.000776 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Strongly DisagreeDisagree 7.026 1.166 6.024
## DisagreeAgree 8.766 1.231 7.119
## AgreeStrongly Agree 11.653 1.357 8.590
The summary table shows two fit statistics at the top: the loglikelihood and the AIC. The loglikelihood is the sum of the likelihoods for each observation that the predicted value correctly predicts the observed value. Its value ranges from \(\infty\) to \(+\infty\). It’s value increases with observations, additional variables, and fit quality. I think you just use it to compare alternative model formulations. Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model. Again, I think the value is only useful as a comparison benchmark between alternative model fits. You’d want the model with the lowest AIC.
The Coefficients table is the familiar parameter estimates. E.g., for biz_ownerYes
, the coefficient estimate is 0.665 with standard error 0.289. The zvalue is the ratio \(z = \hat{\beta} / se =\) 2.300 with pvalue equal to \(2 \cdot P(Z>z) =\) 0.021. Some programs (e.g., SPSS) also show the Wald chisquared statistic, the square of the z statistic, \(z^2 =\), 5.291. The square of a normal variable has a chisquare distribution, so the p value for the Wald chisquared statistic is the pchisq(z^2, df = 1)
\(=\) 0.021.
The Threshold coefficients table are the intercepts, or cutpoints. The first cutpoint is logodds of a response level Strongly Disagree (or less) vs greater than Strongly Disagree when all factor variables are at their reference level and the continuous vars are at 0.
There may be interaction effects between biz_owner
and politics
. Fit a saturate model, then compare their log likelihoods with a likelihood ratio test.
$fmla_sat < formula(tax_too_high ~ biz_owner*politics + age)
tax$clm_sat < clm(tax$fmla_sat, data = tax$dat)
tax$sat_anova < anova(tax$clm, tax$clm_sat)
tax$sat_anova tax
## Likelihood ratio tests of cumulative link models:
##
## formula: link: threshold:
## tax$clm tax$fmla logit flexible
## tax$clm_sat tax$fmla_sat logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## tax$clm 7 409.23 197.62
## tax$clm_sat 9 411.75 196.87 1.4805 2 0.477
The likelihood ratio test indicates the maineffects model fits about the same in comparison to the saturated model (\(\chi^2(\) 2 $) = $ LR = 1.48, p = 0.477)
Verify Assumptions
Cumulative odds ordinal logistic regression with proportional odds models require a) no multicollinearity, and b) proportional odds.
Multicollinearity
Multicollinearity occurs when two or more independent variables are highly correlated so that they do not provide unique or independent information in the regression model. Multicollinearity inflates the variances of the estimated coefficients, resulting in larger confidence intervals. The usual interpretation of a slope coefficient as the change in the mean response per unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes other predictors.
Test for multicollinearity with variance inflation factors (VIF). The VIF is the inflation percentage of the parameter variance due to multicollinearity. E.g., a VIF of 1.9 means the parameter variance is 90% larger than what it would be if it was not correlated with other predictors.
Predictor \(K\)’s variance, \(Var(\hat{\beta_k})\), is inflated by a factor of
\[VIF_k = \frac{1}{1  R_k^2}\]
due to collinearity with other predictors, where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. If there is zero relationship between predictor \(k\) and the other variables, \(R_k^2 = 0\) and \(VIF = 1\) (no variance inflation). If 100% of the variance in predictor \(k\) is explained by the other predictors, then \(R_k^2 = 1\) and \(VIF = \infty\). A good rule of thumb is that \(VIF \le 5\) is acceptable.
# Cannot use CLM model with vif(). Reexpress as a linear model.
$vif < lm(as.numeric(tax_too_high) ~ politics + biz_owner + age, dat = tax$dat) %>%
tax::vif()
car$vif tax
## GVIF Df GVIF^(1/(2*Df))
## politics 1.035831 2 1.008840
## biz_owner 1.023642 1 1.011752
## age 1.036491 1 1.018082
The VIFs in column GVIF are all below 5, so this model is not compromised by multicollinearity.
Proportional Odds
The assumption of proportional odds means the independent variable effects are constant across each cumulative split of the ordinal dependent variable. Test for proportional odds using a full likelihood ratio test comparing the proportional odds model with a multinomial logit model, also called an unconstrained baseline logit model. This is also called the test of parallel lines. The multinomial logit model fits a slope to each of the \(J – 1\) levels. The proportional odds model is nested within the multinomial model, so you can use a likelihood ratio test to see if the models are statistically different. Fit the proportional odds model and a multinomial model using VGAM::vglm()
and capture the log likelihoods and degrees of freedom. Perform a likelihood ratio test on the differences in log likelihoods, \(D = 2 \mathrm{loglik}(\beta)\).
$vglm_ordinal < vglm(tax$fmla, propodds, data = tax$dat)
tax$vglm_multinomial < vglm(tax$fmla, cumulative, data = tax$dat)
tax$po_lrt < lrtest(tax$vglm_multinomial, tax$vglm_ordinal)
tax$po_lrt tax
## Likelihood ratio test
##
## Model 1: tax_too_high ~ biz_owner + age + politics
## Model 2: tax_too_high ~ biz_owner + age + politics
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 561 193.31
## 2 569 197.62 8 8.6197 0.3754
The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters, \(\chi^2\)(8) = 8.620, p = 0.375.
Another option is the partial proportional odds test. This test locates specific variables causing the rejection of proportional odds.
$po_lrt2 < clm(tax$fmla, data = tax$dat) %>%
taxnominal_test()
$po_lrt2 tax
## Tests of nominal effects
##
## formula: tax_too_high ~ biz_owner + age + politics
## Df logLik AIC LRT Pr(>Chi)
## <none> 197.62 409.23
## biz_owner 2 197.34 412.67 0.55974 0.7559
## age
## politics 4 196.20 414.40 2.83415 0.5860
The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters for business owner, \(\chi^2\)(2) = 0.560, p = 0.756 and politics, \(\chi^2\)(4) = 2.834, p = 0.586.
Assess the Model Fit
There are three ways to assess overall model fit: The Deviance and Pearson goodnessoffit tests of the overall model fit; the Cox and Snell, Nagelkerke, and McFadden pseudo R measures of explained variance; and the likelihood ratio test comparing the model fit to the interceptonly model.
Deviance and Pearson
However, these tests rely on large frequencies in each cell, that is, each possible combination of predictor values. Overall goodnessoffit statistics should be treated with suspicion when a continuous independent variable is present and/or there are a large number of cells with zero frequency.The Pearson goodnessoffit statistic is \(X^2 = \sum \frac{(O_{ij}  E_{ij})^2}{E_{ij}}\) where \(i\) is the observation number and \(j\) is the response variable level. It is a summary of the Pearson residuals, the difference between the observed and expected cell counts, \(O_{ij}  E_{ij}\). The deviance goodnessoffit statistic is the difference in fit between the model and a full model; a full model being a model that fits the data perfectly, \(G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)\). Neither of these tests are reliable if there are many cells with zero frequencies and/or small expected frequencies and are generally not recommended. Generally, the chisquared test requires a frequency count of at least 5 per cell.
# Observed combinations of model vars
$cell_patterns < tax$dat %>% count(biz_owner, age, politics, tax_too_high) %>% nrow()
tax
# Observed combinations of predictor vars * levels of response var
$covariate_patterns < tax$dat %>% count(biz_owner, age, politics) %>% nrow()
tax$possible_cells < tax$covariate_patterns * length(levels(tax$dat$tax_too_high))
tax
# 1  ratio of observed to possible
$pct_freq_zero < 1  tax$cell_patterns / tax$possible_cells tax
There are 137 observed combinations of model variables (predictors), and 372 possible combinations (predictors * outcome levels), so 63.2% of cells have zero frequencies. Ideally, zero frequencies should be less than 20%, so if you were to use the deviance or Pearson tests, you would need to report this. The results below are contradictory and bogus. I think you’d only use this test if you didn’t have continuous predictor variables.
< tax$dat %>%
observed count(biz_owner, age, politics, tax_too_high) %>%
pivot_wider(names_from = tax_too_high, values_from = n) %>%
replace_na(list(`Strongly Disagree` = 0, Disagree = 0, Agree = 0, `Strongly Agree` = 0)) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "observed")
< bind_cols(
expected $dat,
taxpredict(tax$clm, subset(tax$dat, select = tax_too_high))$fit %>% data.frame()
%>%
) rename(c("Strongly Disagree" = "Strongly.Disagree", "Strongly Agree" = "Strongly.Agree")) %>%
group_by(biz_owner, age, politics) %>%
summarize(.groups = "drop", across(`Strongly Disagree`:`Strongly Agree`, sum)) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "expected")
< observed %>%
obs_exp inner_join(expected, by = c("politics", "biz_owner", "age", "outcome")) %>%
mutate(epsilon_sq = (observed  expected)^2,
chi_sq = epsilon_sq / expected,
g_sq = 2 * observed * log((observed+.0001) / expected)
)
$chisq < list()
tax$chisq$X2 = sum(obs_exp$chi_sq)
tax$chisq$G2 = sum(obs_exp$g_sq)
tax$chisq$df = tax$covariate_patterns * (length(levels(tax$dat$tax_too_high))  1)  7
tax$chisq$X2_p.value = pchisq(tax$chisq$X2, df = tax$chisq$df, lower.tail = FALSE)
tax$chisq$G2_p.value = pchisq(tax$chisq$G2, df = tax$chisq$df, lower.tail = FALSE) tax
The Pearson goodnessoffit test indicated that the model was not a good fit to the observed data, \(\chi^2\)(272) = 745.4, p < .001$. The deviance goodnessoffit test indicated that the model was a good fit to the observed data, \(G^2\)(272) = 232.6, p = 0.960.
PseudoR2 Measures
There are a number of measures in ordinal regression that attempt to provide a similar “variance explained” measure as that provided in ordinary leastsquares linear regression. However, these measures do not have the direct interpretation that they do in ordinary linear regression and are often, therefore, referred to as “pseudo” R2 measures. The three most common measures (Cox and Snell, Nagelkerke, and McFadden) are not particularly good and not universally used. It is presented in the SPSS output, so you might encounter it in published work.
$nagelkerke < rcompanion::nagelkerke(tax$clm)
tax$nagelkerke$Pseudo.R.squared.for.model.vs.null tax
## Pseudo.R.squared
## McFadden 0.181957
## Cox and Snell (ML) 0.367369
## Nagelkerke (Cragg and Uhler) 0.399641
Likelihood Ratio Test
The best way to assess model fit is the likelihood ratio test comparing the model to an interceptonly model. The difference in the 2 log likelihood between the models has a \(\chi^2\) distribution with degrees of freedom equal to the difference in the number of parameters.
< clm(tax_too_high ~ 1, data = tax$dat)
intercept_only $lrt < anova(tax$clm, intercept_only)
tax$lrt tax
## Likelihood ratio tests of cumulative link models:
##
## formula: link: threshold:
## intercept_only tax_too_high ~ 1 logit flexible
## tax$clm tax$fmla logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## intercept_only 3 489.14 241.57
## tax$clm 7 409.23 197.62 87.911 4 < 2.2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table shows the log likelihoods of the two models. LR.stat is the difference between 2 * the logLik values.
The final model statistically significantly predicted the dependent variable over and above the interceptonly model, \(\chi^2(4)\) = 87.9, p = 0.000.
Interpret Results
Return to the model summary.
tidy(tax$clm)
## # A tibble: 7 x 6
## term estimate std.error statistic p.value coef.type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Strongly DisagreeDisagree 7.03 1.17 6.02 1.70e 9 intercept
## 2 DisagreeAgree 8.77 1.23 7.12 1.08e12 intercept
## 3 AgreeStrongly Agree 11.7 1.36 8.59 8.72e18 intercept
## 4 biz_ownerYes 0.665 0.289 2.30 2.14e 2 location
## 5 age 0.242 0.0326 7.42 1.17e13 location
## 6 politicsLib 0.0369 0.364 0.102 9.19e 1 location
## 7 politicsCon 1.16 0.346 3.36 7.76e 4 location
The coefficients for biz_owner
, age
, and politics are positive. Positive parameters increase the likelihood of stronger agreement with the statement. In this case, discontent with taxes are higher for business owners, increase with age, and are higher for Liberal Democrats and Conservatives relative to the Labor Party. The expected cumulative logodds of declaring \(\le j\) level of agreement with the statement for the baseline group (biz_ownerNo
, age = 0
, politicsLib
) is 7.026 for \(j = 1\) (Strongly Disagree), 8.766 for \(j = 2\) (Disagree), and 11.653 for \(j = 3\) (Agree).
You could solve the logit equation for
\[\pi_j = \frac{\mathrm{exp}(Y_i)} {1 + \mathrm{exp}(Y_i)}\]
to get the cumulative probabilities for each level. That’s what predict(type = "cum.prob")
does. But it might be more intuitive to work with individual probabilities, the lagged differences to get the individual probabilities for each \(j\). That’s what predict(type = "prob")
does. I like to play with predicted values to get a sense of the outcome distributions. In this case, I’ll take the median age
, and each combination of biz_owner
and politics
.
< tax$dat %>%
new_data mutate(age = median(tax$dat$age)) %>%
expand(age, politics, biz_owner)
< predict(tax$clm, newdata = new_data, type = "prob")[["fit"]] %>% as.data.frame()
preds
bind_cols(new_data, preds) %>%
pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`) %>%
mutate(name = factor(name, levels = levels(tax$dat$tax_too_high))) %>%
ggplot(aes(y = politics, x = value, fill = fct_rev(name))) +
geom_col() +
geom_text(aes(label = scales::percent(value, accuracy = 1)),
size = 3, position = position_stack(vjust=0.5)) +
facet_grid(~paste("Bus Owner = ", biz_owner)) +
scale_fill_grey(start = 0.5, end = 0.8) +
theme_bw() +
theme(legend.position = "top",
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(title = "Taxes too High for Conservative Business Owners?",
x = NULL, fill = NULL)
You will want to establish whether politics
is statistically significant overall before exploring any specific contrasts. The ANOVA procedure with type I test reports an overall test of significance for each variable entered into the model.
$anovaI < anova(tax$clm, type = "I")) (tax
## Type I Analysis of Deviance Table with Wald chisquare tests
##
## Df Chisq Pr(>Chisq)
## biz_owner 1 13.201 0.0002798 ***
## age 1 57.413 3.533e14 ***
## politics 2 14.636 0.0006635 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6, p = 0.001.
The best way to work with the data is with the tidy(exponentiate = TRUE)
version.
$tidy < tax$clm %>% tidy(conf.int = TRUE, exponentiate = TRUE)
tax$tidy tax
## # A tibble: 7 x 8
## term estimate std.error statistic p.value conf.low conf.high coef.type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Strongly D~ 1.13e3 1.17 6.02 1.70e 9 NA NA intercept
## 2 DisagreeA~ 6.41e3 1.23 7.12 1.08e12 NA NA intercept
## 3 AgreeStro~ 1.15e5 1.36 8.59 8.72e18 NA NA intercept
## 4 biz_ownerY~ 1.94e0 0.289 2.30 2.14e 2 1.11 3.44 location
## 5 age 1.27e0 0.0326 7.42 1.17e13 1.20 1.36 location
## 6 politicsLib 1.04e0 0.364 0.102 9.19e 1 0.508 2.12 location
## 7 politicsCon 3.19e0 0.346 3.36 7.76e 4 1.63 6.35 location
Then you can summarize the table in words.
The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of nonbusiness owners, a statistically significant effect, z = 2.300, p = 0.021.
The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919.
An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.
Reporting
Here is the complete writeup.
A cumulative odds ordinal logistic regression with proportional odds was run to determine the effect of business ownership, political party voted for, and age, on the belief that taxes are too high. There were proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters, \(\chi^2\)(8) = 8.620, p = 0.375. The final model statistically significantly predicted the dependent variable over and above the interceptonly model, \(\chi^2(4)\) = 87.9, p = 0.000. The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of nonbusiness owners, a statistically significant effect, z = 2.300, p = 0.021. The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald \(\chi^2\)(2) = 14.6, p = 0.001. The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919. An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.
Package gtsummary shows a nice summary table.
tbl_regression(tax$clm)
Characteristic  log(OR)^{1}  95% CI^{1}  pvalue 

biz_owner  
No  —  —  
Yes  0.66  0.10, 1.2  0.021 
age  0.24  0.18, 0.31  <0.001 
politics  
Lab  —  —  
Lib  0.04  0.68, 0.75  >0.9 
Con  1.2  0.49, 1.8  <0.001 
^{
1
}
OR = Odds Ratio, CI = Confidence Interval

2.4 Poisson Regression
Poisson models count data, like “traffic tickets per day,” or “website hits per day.” The response is an expected rate or intensity. For count data, specify the generalized model, this time with family = poisson
or family = quasipoisson
.
Recall that the probability of achieving a count \(y\) when the expected rate is \(\lambda\) is distributed
\[P(Y = y\lambda) = \frac{e^{\lambda} \lambda^y}{y!}.\]
The poisson regression model is
\[\lambda = \exp(X \beta).\]
You can solve this for \(y\) to get
\[y = X\beta = \ln(\lambda).\]
That is, the model predicts the log of the response rate. For a sample of size n, the likelihood function is
\[L(\beta; y, X) = \prod_{i=1}^n \frac{e^{\exp({X_i\beta})}\exp({X_i\beta})^{y_i}}{y_i!}.\]
The loglikelihood is
\[l(\beta) = \sum_{i=1}^n (y_i X_i \beta  \sum_{i=1}^n\exp(X_i\beta)  \sum_{i=1}^n\log(y_i!).\]
Maximizing the loglikelihood has no closedform solution, so the coefficient estimates are found through interatively reweighted least squares.
Poisson processes assume the variance of the response variable equals its mean. “Equals” means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasipoisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson.
Dataset fire
contains response variable injuries
counting the number of injuries during the month and one explanatory variable, the month mo
.
< read_csv(file = "C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CivilInjury_0.csv") fire
## Rows: 300 Columns: 3
##  Column specification 
## Delimiter: ","
## dbl (2): ID, Total Injuries
## dttm (1): Injury Date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
< fire %>%
fire mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>%
rename(dt = `Injury Date`,
injuries = `Total Injuries`)
str(fire)
## spec_tbl_df [300 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:300] 1 2 3 4 5 6 7 8 9 10 ...
## $ dt : POSIXct[1:300], format: "20050110" "20050111" ...
## $ injuries: num [1:300] 1 1 1 5 2 1 1 1 1 1 ...
## $ mo : num [1:300] 1 1 1 1 1 1 2 2 2 4 ...
##  attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. `Injury Date` = col_datetime(format = ""),
## .. `Total Injuries` = col_double()
## .. )
##  attr(*, "problems")=<externalptr>
In a situation like this where there the relationship is bivariate, start with a visualization.
ggplot(fire, aes(x = mo, y = injuries)) +
geom_jitter() +
geom_smooth(method = "glm", method.args = list(family = "poisson")) +
labs(title = "Injuries by Month")
## `geom_smooth()` using formula 'y ~ x'
Fit a poisson regression in R using glm(formula, data, family = poisson)
. But first, check whether the mean and variance of injuries
are the same magnitude? If not, then use family = quasipoisson
.
mean(fire$injuries)
## [1] 1.36
var(fire$injuries)
## [1] 1.020468
They are of the same magnitude, so fit the regression with family = poisson
.
< glm(injuries ~ mo, family = poisson, data = fire)
m2 summary(m2)
##
## Call:
## glm(formula = injuries ~ mo, family = poisson, data = fire)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.3987 0.3473 0.3034 0.2502 4.3185
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 0.22805 0.10482 2.176 0.0296 *
## mo 0.01215 0.01397 0.870 0.3844
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 139.87 on 299 degrees of freedom
## Residual deviance: 139.11 on 298 degrees of freedom
## AIC: 792.08
##
## Number of Fisher Scoring iterations: 5
The predicted value \(\hat{y}\) is the estimated log of the response variable,
\[\hat{y} = X \hat{\beta} = \ln (\lambda).\]
Suppose mo
is January (mo = ), then the log of
injuries` is \(\hat{y} = 0.323787\). Or, more intuitively, the expected count of injuries is \(\exp(0.323787) = 1.38\)
predict(m2, newdata = data.frame(mo=1))
## 1
## 0.2401999
predict(m2, newdata = data.frame(mo=1), type = "response")
## 1
## 1.271503
Here is a plot of the predicted counts in red.
augment(m2, type.predict = "response") %>%
ggplot(aes(x = mo, y = injuries)) +
geom_point() +
geom_point(aes(y = .fitted), color = "red") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Month",
y = "Injuries",
title = "Poisson Fitted Line Plot")
Evaluate a logistic model fit with an analysis of deviance.
< glance(m2)) (perf
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 140. 299 394. 792. 799. 139. 298 300
< 1  perf$deviance / perf$null.deviance) (pseudoR2
## [1] 0.005413723
The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedoR2 is very low at .05. How about the RMSE?
RMSE(pred = predict(m2, type = "response"), obs = fire$injuries)
## [1] 1.006791
The average prediction error is about 0.99. That’s almost as much as the variance of injuries
 i.e., just predicting the mean of injuries
would be almost as good! Use the GainCurvePlot()
function to plot the gain curve.
augment(m2, type.predict = "response") %>%
ggplot(aes(x = injuries, y = .fitted)) +
geom_point() +
geom_smooth(method ="lm") +
labs(x = "Actual",
y = "Predicted",
title = "Poisson Fitted vs Actual")
## `geom_smooth()` using formula 'y ~ x'
augment(m2) %>% data.frame() %>%
GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model")
It seems that mo
was a poor predictor of injuries
.
References
The related probit regression link function is \(f(E(YX)) = \Phi^{1}(E(YX)) = \Phi^{1}(\pi)\). The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can safely ignore probit.↩︎
Notes from Machine Learning Mastery↩︎