Chapter 27 A Generalized Linear Model for Binomial Response Data

For all \(i = 1, \ldots, n\), \(y_i\sim \text{Binomial}(m_i, \pi_i)\), where \(m_i\) is known number of trials for observation \(i\), \(\pi_i = \frac{\exp(x_i'\beta)}{1+\exp(x_i'\beta)}\), and \(y_1, \ldots, y_n\) are independent. The Binomial log likelihood is \[ \ell(\beta\mid y) = \sum_{i = 1}^n [y_ix_i'\beta - m_i\log(1 + \exp(-x_i'\beta))] + \text{constant} \] We can compare the fit of a logitstic regression model known as saturated model. The MLE of \(\pi_i\) under the logistic regression model is \(\hat\pi_i = \frac{\exp(x_i'\hat\beta)}{1 + \exp(x_i'\hat\beta)}\), and the MLE of \(\pi_i\) under saturated model is \(y_i/m_i\). Then the likelihood ratio statistic for testing the logistic regression model as the reduced model VS. the saturated model as the full model is \[ 2 \sum_{i=1}^{n}\left[y_{i} \log \left(\frac{y_{i} / m_{i}}{\hat{\pi}_{i}}\right)+\left(m_{i}-y_{i}\right) \log \left(\frac{1-y_{i} / m_{i}}{1-\hat{\pi}_{i}}\right)\right] \] which is called the Deviance Statistics, the Residual Deviance or just the Deviance.

A Lack-of-fit Test: when \(n\) is large, and/or \(m_1, \ldots m_n\), are each suitablely large, the Deviance Statistic is approximately \(\chi_{n-p}^2\) if the logistic regression model is correct.

Deviance Residual: \[ d_{i} \equiv \operatorname{sign}\left(y_{i} / m_{i}-\hat{\pi}_{i}\right) \sqrt{2\left[y_{i} \log \left(\frac{y_{i}}{m_{i} \hat{\pi}_{i}}\right)+\left(m_{i}-y_{i}\right) \log \left(\frac{m_{i}-y_{i}}{m_{i}-m_{i} \hat{\pi}_{i}}\right)\right]} \] The residual deviance statistic is the sum of the squared deviance residuals \((\sum_{i = 1}^n d_i^2)\).

Pearson’s Chi-Square Statistic: Another lack of fit statistic that is approximately \(\chi_{n-p}^2\) under the null is Pearson’s Chi-Square Statistic: \[ \begin{aligned} X^{2} = \sum_{i=1}^{n}\left(\frac{y_{i}-\widehat{E}\left(y_{i}\right)}{\sqrt{\widehat{\operatorname{Var}}\left(y_{i}\right)}}\right)^{2} = \sum_{i=1}^{n}\left(\frac{y_{i}-m_{i} \hat{\pi}_{i}}{\sqrt{m_{i} \hat{\pi}_{i}\left(1-\hat{\pi}_{i}\right)}}\right)^{2} . \end{aligned} \] The term \(r_i = \frac{y_{i}-m_{i} \hat{\pi}_{i}}{\sqrt{m_{i} \hat{\pi}_{i}\left(1-\hat{\pi}_{i}\right)}}\) is known as Pearson residual.

Residual Diagnostics: For large \(m_i\) values, both \(d_i\) and \(r_i\) should be approximately distributed as standard normal random variables if the logistic regression model is correct.

o=g1m(cbind(tumor, total-tumor)~dose, 
      family=binomial(link=logit), data=d)
summary(o)

Overdispersion: in the GLM framework, its often the case that \(Var(y_i)\) is the function of \(E(y_i)\). That is the case for logistic regression where \(Var(y_i) = m_i\pi(1-\pi_i) = m_i\pi_i - (m_i\pi_i)^2/m_i = E(y_i) - [E(y_i)]^2/m_i\). If the variability of our response is greater than we should expect based on our estimates of the mean, we say that there is overdispersion.

Quasi-likelihood Inference: in the binomial case, we make all the same assumptions as before except that we assume \(Var(y_i) = \phi m_i\pi_i(1-\pi_i)\) for some unknown dispersion parameter \(\phi > 1\). The dispersion parameter can be estimated by \(\hat\phi = \sum_{i=1}^n d_i^2/(n-p)\) or \(\hat\phi = \sum_{i=1}^n r_i^2/(n-p)\).

  • The estimated variance of \(\hat\beta\) is multiplied by \(\hat\phi\).
  • For Wald type inferences, the standard normal null distribution is replaced by \(t\) with \(n - p\) degrees of freedom.
  • Any test statistic \(T\) that was assumed \(\chi_q^2\) under \(H_0\) is replaced with \(T/(q\hat\phi)\) and compared to an \(F\) distribution with \(q\) and \(n-p\) degrees of freedom.
oq=g1m(cbind(tumor, total-tumor)~dosef, 
       family=quasibinomial(link=logit),data=d)
summary(oq)