Chapter 5 Generalized linear models: The general case
The same idea we used in logistic or poisson regression, namely transforming the conditional expectation of \(Y\) into something that can be modeled by a linear model (this is, a quantity that lives in \(\mathbb{R}\)), can be generalized. This raises the family of generalized linear models, which extend the linear model to different kinds of response variables and provides a convenient parametric framework. The first ingredient is a link function \(g\), that is monotonic and differentiable, which is going to produce a transformed expectation to be modeled by a linear combination of the predictors: \[\begin{align*} g\left(\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]\right)=\eta \end{align*}\] or, equivalently, \[\begin{align*} \mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]&=g^{-1}(\eta), \end{align*}\] where \(\eta:=\beta_0+\beta_1x_1+\ldots+\beta_kx_k\) is the linear predictor.
The second ingredient of generalized linear models is a distribution for \(Y|(X_1,\ldots,X_k)\), just as the linear model assumes normality or the logistic model assumes a Bernoulli random variable. Thus, we have two generalizations with respect to the usual linear model:
- The conditional mean may be modeled by a transformation \(g^{-1}\) of the linear predictor \(\eta\).
- The distribution of \(Y|(X_1,\ldots,X_p)\) may be different from the Normal.
Generalized linear models are intimately related with the exponential family1, which is the family of distributions with pdf expressable as \[\begin{align} f(y;\theta,\phi)=\exp\left\{\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\right\},\tag{5.1} \end{align}\] where \(a(\cdot)\), \(b(\cdot)\), and \(c(\cdot,\cdot)\) are specific functions. If \(Y\) has the pdf (5.1), then we write \(Y\sim\mathrm{E}(\theta,\phi,a,b,c)\). If the scale parameter \(\phi\) is known, this is an exponential family with canonical parameter \(\theta\) (if \(\phi\) is unknown, then it may or not may be a two-parameter exponential family). Distributions from the exponential family have some nice properties. Importantly, if \(Y\sim\mathrm{E}(\theta,\phi,a,b,c)\), then \[\begin{align} \mu:=\mathbb{E}[Y]=b'(\theta),\quad \sigma^2:=\mathbb{V}\mathrm{ar}[Y]=b''(\theta)a(\phi).\tag{5.2} \end{align}\]
The canonical link function is the one that transforms \(\mu=b'(\theta)\) into the canonical parameter \(\theta\). For \(\mathrm{E}(\theta,\phi,a,b,c)\), this is happens if \[\begin{align} \theta=g(\mu)=\eta \tag{5.3} \end{align}\] or, equivalently by (5.2), if \[\begin{align} g(\mu)=(b')^{-1}(\mu). \tag{5.4} \end{align}\] In the case of canonical link function, the one-line summary of the generalized linear model is (independence is implicit) \[\begin{align} Y|(X_1=x_1,\ldots,X_p=x_p)\sim\mathrm{E}(g(\eta),\phi,a,b,c).\tag{5.5} \end{align}\] The following table lists some useful generalized linear models.
Support of \(Y\) | Distribution | Link \(g(\mu)\) | \(g^{-1}(\eta)\) | \(\phi\) | \(Y\vert(X_1=x_1,\ldots,X_p=x_p)\) |
---|---|---|---|---|---|
\(\mathbb{R}\) | \(\mathcal{N}(\mu,\sigma^2)\) | \(\mu\) | \(\eta\) | \(\sigma^2\) | \(\mathcal{N}(\eta,\sigma^2)\) |
\(0,1\) | \(\mathrm{B}(1,p)\) | \(\mathrm{logit}(\mu)\) | \(\mathrm{logistic}(\eta)\) | \(1\) | \(\mathrm{B}\left(1,\mathrm{anti-logit}(\eta)\right)\) |
\(0,1,2,\ldots\) | \(\mathrm{Pois}(\lambda)\) | \(\log(\mu)\) | \(e^\eta\) | \(1\) | \(\mathrm{Pois}(e^{\eta})\) |
5.1 Estimation
The estimation of \(\boldsymbol{\beta}\) by MLE can be done in a unified framework for all for a generalized linear models thanks to (5.1). Given \((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\) and employing a canonical link function (5.4), we have that \[ Y_i|(X_1=x_{i1},\ldots,X_p=x_{ik})\sim\mathrm{E}(\theta_i,\phi,a,b,c),\quad i=1,\ldots,n, \] where \[\begin{align*} \theta_i&:=g(\eta_i)=g(\beta_0+\beta_1x_{i1}+\ldots+\beta_px_{ik}),\\ \mu_i&:=\mathbb{E}[Y_i|X_1=x_{i1},\ldots,X_p=x_{ik}]=g^{-1}(\eta_i). \end{align*}\] Then, the log-likelihood is \[\begin{align} \ell(\boldsymbol{\beta})=\sum_{i=1}^n\left(\frac{Y_i\theta_i-b(\theta_i)}{a(\phi)}+c(Y_i,\phi)\right).\tag{5.6} \end{align}\] Differentiating with respect to \(\boldsymbol{\beta}\) gives \[\begin{align*} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\sum_{i=1}^{n}\frac{\left(Y_i-b'(\theta_i)\right)}{a(\phi)}\frac{\partial \theta_i}{\partial \boldsymbol{\beta}} \end{align*}\] which, exploiting the properties of the exponential family, can be reduced to \[\begin{align*} \frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=\sum_{i=1}^{n}\frac{(Y_i-\mu_i)}{g'(\mu_i)V_i}\mathbf{x}_i, \end{align*}\] where \(\mathbf{x}_i\) is the \(i\)-th row of the design matrix \(\mathbf{X}\) and \(V_i:=\mathbb{V}\mathrm{ar}[Y_i]=a(\phi)b''(\theta_i)\). Solving explicitly the system of equations \(\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=\mathbf{0}\) is not possible in general and a numerical procedure is required. Newton-Raphson is usually employed, which is based in obtaining \(\boldsymbol{\beta}_\mathrm{new}\) from the linear system2 \[\begin{align} \left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial\boldsymbol{\beta}'}\right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}(\boldsymbol{\beta}_\mathrm{new} -\boldsymbol{\beta}_\mathrm{old})=-\left.\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}.\tag{5.7} \end{align}\] A simplifying trick is to consider the expectation of \(\left.\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}\) in (5.7) rather than its actual value. By doing so, we can arrive at a neat iterative algorithm called Iterative Reweighted Least Squares (IRLS). To that aim, we use the following well-known property of the Fisher information matrix of the MLE theory: \[ \mathbb{E}\left[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial\boldsymbol{\beta}'}\right]=-\mathbb{E}\left[\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\left(\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right)'\right]. \] Then, it can be seen that3 \[\begin{align} \mathbb{E}\left[\left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \boldsymbol{\beta}'}\right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}} \right]= -\sum_{i=1}^{n} w_i \mathbf{x}_i \mathbf{x}_i'=-\mathbf{X}' \mathbf{W} \mathbf{X},\tag{5.8} \end{align}\] where \(w_i:=\frac{1}{V_i(g'(\mu_i))^2}\) and \(\mathbf{W}:=\mathrm{diag}(w_1,\ldots,w_n)\). Using this notation, \[\begin{align} \left. \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}= \mathbf{X}'\mathbf{W}(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old}),\tag{5.9} \end{align}\] Substituting (5.8) and (5.9) in (5.7), we have: \[\begin{align} \boldsymbol{\beta}_\mathrm{new}&=\boldsymbol{\beta}_\mathrm{old}-\mathbb{E}\left [\left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \boldsymbol{\beta}'}\right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}} \right]^{-1}\left. \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}\nonumber\\ &=\boldsymbol{\beta}_\mathrm{old}+(\mathbf{X}' \mathbf{W} \mathbf{X})^{-1}\mathbf{X}'\mathbf{W}(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old})\nonumber\\ &=(\mathbf{X}' \mathbf{W} \mathbf{X})^{-1}\mathbf{X}'\mathbf{W} \mathbf{z},\tag{5.10} \end{align}\] where \(\mathbf{z}:=\mathbf{X}\boldsymbol{\beta}_\mathrm{old}+(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old})\) is the working vector.
As a consequence, fitting a generalized linear model by IRLS amounts to performing a series of weighted linear models with changing weights and responses given by the working vector. IRLS can be summarized as:
- Set \(\boldsymbol{\beta}_\mathrm{old}\) with some initial estimation.
- Compute \(\boldsymbol{\mu}_\mathrm{old}\), \(\mathbf{W}\), and \(\mathbf{z}_\mathrm{old}\).
- Compute \(\boldsymbol{\beta}_\mathrm{new}\) using (5.10).
- Iterate steps 2–3 until convergence, then set \(\hat{\boldsymbol{\beta}}=\boldsymbol{\beta}_\mathrm{new}\).
Not to be confused with the exponential distribution \(\mathrm{Exp}(\lambda)\), which is a member of the exponential family.↩︎
The system stems from a first-order Taylor expansion around the root.↩︎
Recall that \(\mathbb{E}[(Y_i-\mu_i)(Y_j-\mu_j)]=\mathrm{Cov}[Y_i,Y_j]=\begin{cases}V_i,&i=j,\\0,&i\neq j,\end{cases}\) because of independence.↩︎