Chapter 5 Generalized Linear Models
These notes are primarily from PSU STAT 504 which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU STAT 501, DataCamp’s Generalized Linar Models in R, DataCamp’s Multiple and Logistic Regression, and **Interpretable machine learning*"** (Molnar 2020).
The linear regression model, \(E(Y|X) = X \beta\), structured as \(y_i = X_i \beta + \epsilon_i\) where \(X_i \beta = \mu_i\), assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with mean zero and constant variance, \(\epsilon \sim N \left(0, \sigma^2 \right)\). This implies that given some set of predictors, the response is normally distributed about its expected value, \(y_i \sim N \left(\mu_i, \sigma^2 \right)\). However, there are many situations where this assumption of normality fails. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses non-normal response distributions.
The response given a set of predictors will not have a normal distribution if its underlying data-generating process is binomial or multinomial (proportions), Poisson (counts), or exponential (time-to-event). In these situations a regular linear regression can predict proportions outside [0, 100] or counts or times that are negative. GLMs solve this problem by modeling a function of the expected value of \(y\), \(f(E(Y|X)) = X \beta\). There are three components to a GLM: the random component is the probability distribution of the response variable (normal, binomial, Poisson, etc.); the systematic component is the explanatory variables \(X\beta\); and the link function \(\eta\) specifies the link between random and systematic components, converting the response range to \([-\infty, +\infty]\).
Linear regression is thus a special case of GLM where link function is the identity function, \(f(E(Y|X)) = E(Y|X)\). For a logistic regression, where the data generating process is binomial, the link function is
\[f(E(Y|X)) = \ln \left( \frac{E(Y|X)}{1 - E(Y|X)} \right) = \ln \left( \frac{\pi}{1 - \pi} \right) = logit(\pi)\]
where \(\pi\) is the event probability. (As an aside, you have probably heard of the related “probit” regression. The probit regression link function is \(f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)\). The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can probably safely ignore probit).
For a Poisson regression, the link function is
\[f(E(Y|X)) = \ln (E(Y|X)) = \ln(\lambda)\]
where \(\lambda\) is the expected event rate.
For an exponential regression, the link function is
\[f(E(Y|X) = -E(Y|X) = -\lambda\]
where \(\lambda\) is the expected time to event.
GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
In R, specify a GLM just like an linear model, but with the glm()
function, specifying the distribution with the family
parameter.
family = "gaussian"
: linear regressionfamily = "binomial"
: logistic regressionfamily = binomial(link = "probit")
: probit regressionfamily = "poisson"
: Poisson regression
References
Agresti, Alan. 2013. Categorical Data Analysis. 3rd ed. Wiley. https://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics-ebook-dp-B00CAYUFM2/dp/B00CAYUFM2/ref=mt_kindle?_encoding=UTF8&me=&qid=.
Molnar, Christoph. 2020. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/.