Chapter 5 Generalized Linear Models

These notes are primarily from PSU STAT 504 which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU STAT 501, DataCamp’s Generalized Linar Models in R, DataCamp’s Multiple and Logistic Regression, and **Interpretable machine learning*"** (Molnar 2020).

The linear regression model, \(E(Y|X) = X \beta\), structured as \(y_i = X_i \beta + \epsilon_i\) where \(X_i \beta = \mu_i\), assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with mean zero and constant variance, \(\epsilon \sim N \left(0, \sigma^2 \right)\). This implies that given some set of predictors, the response is normally distributed about its expected value, \(y_i \sim N \left(\mu_i, \sigma^2 \right)\). However, there are many situations where this assumption of normality fails. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses non-normal response distributions.

The response given a set of predictors will not have a normal distribution if its underlying data-generating process is binomial or multinomial (proportions), Poisson (counts), or exponential (time-to-event). In these situations a regular linear regression can predict proportions outside [0, 100] or counts or times that are negative. GLMs solve this problem by modeling a function of the expected value of \(y\), \(f(E(Y|X)) = X \beta\). There are three components to a GLM: the random component is the probability distribution of the response variable (normal, binomial, Poisson, etc.); the systematic component is the explanatory variables \(X\beta\); and the link function \(\eta\) specifies the link between random and systematic components, converting the response range to \([-\infty, +\infty]\).

Linear regression is thus a special case of GLM where link function is the identity function, \(f(E(Y|X)) = E(Y|X)\). For a logistic regression, where the data generating process is binomial, the link function is

\[f(E(Y|X)) = \ln \left( \frac{E(Y|X)}{1 - E(Y|X)} \right) = \ln \left( \frac{\pi}{1 - \pi} \right) = logit(\pi)\]

where \(\pi\) is the event probability. (As an aside, you have probably heard of the related “probit” regression. The probit regression link function is \(f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)\). The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can probably safely ignore probit).

For a Poisson regression, the link function is

\[f(E(Y|X)) = \ln (E(Y|X)) = \ln(\lambda)\]

where \(\lambda\) is the expected event rate.

For an exponential regression, the link function is

\[f(E(Y|X) = -E(Y|X) = -\lambda\]

where \(\lambda\) is the expected time to event.

GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

In R, specify a GLM just like an linear model, but with the glm() function, specifying the distribution with the family parameter.

  • family = "gaussian": linear regression
  • family = "binomial": logistic regression
  • family = binomial(link = "probit"): probit regression
  • family = "poisson": Poisson regression