# Chapter 5 Generalized Linear Models

These notes are primarily from PSU STAT 504 which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU STAT 501, DataCamp’s Generalized Linar Models in R, DataCamp’s Multiple and Logistic Regression, and **Interpretable machine learning*"** (Molnar 2020).

The linear regression model, $$E(Y|X) = X \beta$$, structured as $$y_i = X_i \beta + \epsilon_i$$ where $$X_i \beta = \mu_i$$, assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with mean zero and constant variance, $$\epsilon \sim N \left(0, \sigma^2 \right)$$. This implies that given some set of predictors, the response is normally distributed about its expected value, $$y_i \sim N \left(\mu_i, \sigma^2 \right)$$. However, there are many situations where this assumption of normality fails. Generalized linear models (GLMs) are a generalization of the linear regression model that addresses non-normal response distributions.

The response given a set of predictors will not have a normal distribution if its underlying data-generating process is binomial or multinomial (proportions), Poisson (counts), or exponential (time-to-event). In these situations a regular linear regression can predict proportions outside [0, 100] or counts or times that are negative. GLMs solve this problem by modeling a function of the expected value of $$y$$, $$f(E(Y|X)) = X \beta$$. There are three components to a GLM: the random component is the probability distribution of the response variable (normal, binomial, Poisson, etc.); the systematic component is the explanatory variables $$X\beta$$; and the link function $$\eta$$ specifies the link between random and systematic components, converting the response range to $$[-\infty, +\infty]$$.

Linear regression is thus a special case of GLM where link function is the identity function, $$f(E(Y|X)) = E(Y|X)$$. For a logistic regression, where the data generating process is binomial, the link function is

$f(E(Y|X)) = \ln \left( \frac{E(Y|X)}{1 - E(Y|X)} \right) = \ln \left( \frac{\pi}{1 - \pi} \right) = logit(\pi)$

where $$\pi$$ is the event probability. (As an aside, you have probably heard of the related “probit” regression. The probit regression link function is $$f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)$$. The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can probably safely ignore probit).

For a Poisson regression, the link function is

$f(E(Y|X)) = \ln (E(Y|X)) = \ln(\lambda)$

where $$\lambda$$ is the expected event rate.

For an exponential regression, the link function is

$f(E(Y|X) = -E(Y|X) = -\lambda$

where $$\lambda$$ is the expected time to event.

GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

In R, specify a GLM just like an linear model, but with the glm() function, specifying the distribution with the family parameter.

• family = "gaussian": linear regression
• family = "binomial": logistic regression
• family = binomial(link = "probit"): probit regression
• family = "poisson": Poisson regression