Chapter 5 Generalized linear models: The general case

The same idea we used in logistic or poisson regression, namely transforming the conditional expectation of Y into something that can be modeled by a linear model (this is, a quantity that lives in R), can be generalized. This raises the family of generalized linear models, which extend the linear model to different kinds of response variables and provides a convenient parametric framework. The first ingredient is a link function g, that is monotonic and differentiable, which is going to produce a transformed expectation to be modeled by a linear combination of the predictors: g(E[Y|X1=x1,,Xk=xk])=η or, equivalently, E[Y|X1=x1,,Xk=xk]=g1(η), where η:=β0+β1x1++βkxk is the linear predictor.

The second ingredient of generalized linear models is a distribution for Y|(X1,,Xk), just as the linear model assumes normality or the logistic model assumes a Bernoulli random variable. Thus, we have two generalizations with respect to the usual linear model:

  1. The conditional mean may be modeled by a transformation g1 of the linear predictor η.
  2. The distribution of Y|(X1,,Xp) may be different from the Normal.

Generalized linear models are intimately related with the exponential family1, which is the family of distributions with pdf expressable as f(y;θ,ϕ)=exp{yθb(θ)a(ϕ)+c(y,ϕ)}, where a(), b(), and c(,) are specific functions. If Y has the pdf (5.1), then we write YE(θ,ϕ,a,b,c). If the scale parameter ϕ is known, this is an exponential family with canonical parameter θ (if ϕ is unknown, then it may or not may be a two-parameter exponential family). Distributions from the exponential family have some nice properties. Importantly, if YE(θ,ϕ,a,b,c), then μ:=E[Y]=b(θ),σ2:=Var[Y]=b

The canonical link function is the one that transforms \mu=b'(\theta) into the canonical parameter \theta. For \mathrm{E}(\theta,\phi,a,b,c), this is happens if \begin{align} \theta=g(\mu)=\eta \tag{5.3} \end{align} or, equivalently by (5.2), if \begin{align} g(\mu)=(b')^{-1}(\mu). \tag{5.4} \end{align} In the case of canonical link function, the one-line summary of the generalized linear model is (independence is implicit) \begin{align} Y|(X_1=x_1,\ldots,X_p=x_p)\sim\mathrm{E}(g(\eta),\phi,a,b,c).\tag{5.5} \end{align} The following table lists some useful generalized linear models.

Support of Y Distribution Link g(\mu) g^{-1}(\eta) \phi Y\vert(X_1=x_1,\ldots,X_p=x_p)
\mathbb{R} \mathcal{N}(\mu,\sigma^2) \mu \eta \sigma^2 \mathcal{N}(\eta,\sigma^2)
0,1 \mathrm{B}(1,p) \mathrm{logit}(\mu) \mathrm{logistic}(\eta) 1 \mathrm{B}\left(1,\mathrm{anti-logit}(\eta)\right)
0,1,2,\ldots \mathrm{Pois}(\lambda) \log(\mu) e^\eta 1 \mathrm{Pois}(e^{\eta})

5.1 Estimation

The estimation of \boldsymbol{\beta} by MLE can be done in a unified framework for all for a generalized linear models thanks to (5.1). Given (\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n) and employing a canonical link function (5.4), we have that Y_i|(X_1=x_{i1},\ldots,X_p=x_{ik})\sim\mathrm{E}(\theta_i,\phi,a,b,c),\quad i=1,\ldots,n, where \begin{align*} \theta_i&:=g(\eta_i)=g(\beta_0+\beta_1x_{i1}+\ldots+\beta_px_{ik}),\\ \mu_i&:=\mathbb{E}[Y_i|X_1=x_{i1},\ldots,X_p=x_{ik}]=g^{-1}(\eta_i). \end{align*} Then, the log-likelihood is \begin{align} \ell(\boldsymbol{\beta})=\sum_{i=1}^n\left(\frac{Y_i\theta_i-b(\theta_i)}{a(\phi)}+c(Y_i,\phi)\right).\tag{5.6} \end{align} Differentiating with respect to \boldsymbol{\beta} gives \begin{align*} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\sum_{i=1}^{n}\frac{\left(Y_i-b'(\theta_i)\right)}{a(\phi)}\frac{\partial \theta_i}{\partial \boldsymbol{\beta}} \end{align*} which, exploiting the properties of the exponential family, can be reduced to \begin{align*} \frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=\sum_{i=1}^{n}\frac{(Y_i-\mu_i)}{g'(\mu_i)V_i}\mathbf{x}_i, \end{align*} where \mathbf{x}_i is the i-th row of the design matrix \mathbf{X} and V_i:=\mathbb{V}\mathrm{ar}[Y_i]=a(\phi)b''(\theta_i). Solving explicitly the system of equations \frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=\mathbf{0} is not possible in general and a numerical procedure is required. Newton-Raphson is usually employed, which is based in obtaining \boldsymbol{\beta}_\mathrm{new} from the linear system2 \begin{align} \left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial\boldsymbol{\beta}'}\right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}(\boldsymbol{\beta}_\mathrm{new} -\boldsymbol{\beta}_\mathrm{old})=-\left.\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}.\tag{5.7} \end{align} A simplifying trick is to consider the expectation of \left.\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right |_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}} in (5.7) rather than its actual value. By doing so, we can arrive at a neat iterative algorithm called Iterative Reweighted Least Squares (IRLS). To that aim, we use the following well-known property of the Fisher information matrix of the MLE theory: \mathbb{E}\left[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial\boldsymbol{\beta}'}\right]=-\mathbb{E}\left[\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\left(\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right)'\right]. Then, it can be seen that3 \begin{align} \mathbb{E}\left[\left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \boldsymbol{\beta}'}\right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}} \right]= -\sum_{i=1}^{n} w_i \mathbf{x}_i \mathbf{x}_i'=-\mathbf{X}' \mathbf{W} \mathbf{X},\tag{5.8} \end{align} where w_i:=\frac{1}{V_i(g'(\mu_i))^2} and \mathbf{W}:=\mathrm{diag}(w_1,\ldots,w_n). Using this notation, \begin{align} \left. \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}= \mathbf{X}'\mathbf{W}(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old}),\tag{5.9} \end{align} Substituting (5.8) and (5.9) in (5.7), we have: \begin{align} \boldsymbol{\beta}_\mathrm{new}&=\boldsymbol{\beta}_\mathrm{old}-\mathbb{E}\left [\left.\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \boldsymbol{\beta}'}\right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}} \right]^{-1}\left. \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\boldsymbol{\beta}=\boldsymbol{\beta}_\mathrm{old}}\nonumber\\ &=\boldsymbol{\beta}_\mathrm{old}+(\mathbf{X}' \mathbf{W} \mathbf{X})^{-1}\mathbf{X}'\mathbf{W}(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old})\nonumber\\ &=(\mathbf{X}' \mathbf{W} \mathbf{X})^{-1}\mathbf{X}'\mathbf{W} \mathbf{z},\tag{5.10} \end{align} where \mathbf{z}:=\mathbf{X}\boldsymbol{\beta}_\mathrm{old}+(\mathbf{Y}-\boldsymbol{\mu}_\mathrm{old})\mathbf{g}'(\boldsymbol{\mu}_\mathrm{old}) is the working vector.

As a consequence, fitting a generalized linear model by IRLS amounts to performing a series of weighted linear models with changing weights and responses given by the working vector. IRLS can be summarized as:

  1. Set \boldsymbol{\beta}_\mathrm{old} with some initial estimation.
  2. Compute \boldsymbol{\mu}_\mathrm{old}, \mathbf{W}, and \mathbf{z}_\mathrm{old}.
  3. Compute \boldsymbol{\beta}_\mathrm{new} using (5.10).
  4. Iterate steps 2–3 until convergence, then set \hat{\boldsymbol{\beta}}=\boldsymbol{\beta}_\mathrm{new}.

  1. Not to be confused with the exponential distribution \mathrm{Exp}(\lambda), which is a member of the exponential family.↩︎

  2. The system stems from a first-order Taylor expansion around the root.↩︎

  3. Recall that \mathbb{E}[(Y_i-\mu_i)(Y_j-\mu_j)]=\mathrm{Cov}[Y_i,Y_j]=\begin{cases}V_i,&i=j,\\0,&i\neq j,\end{cases} because of independence.↩︎