# Chapter 27 A Generalized Linear Model for Binomial Response Data

For all \(i = 1, \ldots, n\), \(y_i\sim \text{Binomial}(m_i, \pi_i)\), where \(m_i\) is known number of trials for observation \(i\), \(\pi_i = \frac{\exp(x_i'\beta)}{1+\exp(x_i'\beta)}\), and \(y_1, \ldots, y_n\) are independent. The Binomial log likelihood is
\[
\ell(\beta\mid y) = \sum_{i = 1}^n [y_ix_i'\beta - m_i\log(1 + \exp(-x_i'\beta))] + \text{constant}
\]
We can compare the fit of a logitstic regression model known as **saturated model**. The MLE of \(\pi_i\) under the logistic regression model is \(\hat\pi_i = \frac{\exp(x_i'\hat\beta)}{1 + \exp(x_i'\hat\beta)}\), and the MLE of \(\pi_i\) under saturated model is \(y_i/m_i\). Then the *likelihood ratio statistic* for testing the logistic regression model as the reduced model VS. the saturated model as the full model is
\[
2 \sum_{i=1}^{n}\left[y_{i} \log \left(\frac{y_{i} / m_{i}}{\hat{\pi}_{i}}\right)+\left(m_{i}-y_{i}\right) \log \left(\frac{1-y_{i} / m_{i}}{1-\hat{\pi}_{i}}\right)\right]
\]
which is called the **Deviance Statistics**, the **Residual Deviance** or just the **Deviance**.

**A Lack-of-fit Test**: when \(n\) is large, and/or \(m_1, \ldots m_n\), are each suitablely large, the Deviance Statistic is approximately \(\chi_{n-p}^2\) if the logistic regression model is correct.

**Deviance Residual**:
\[
d_{i} \equiv \operatorname{sign}\left(y_{i} / m_{i}-\hat{\pi}_{i}\right) \sqrt{2\left[y_{i} \log \left(\frac{y_{i}}{m_{i} \hat{\pi}_{i}}\right)+\left(m_{i}-y_{i}\right) \log \left(\frac{m_{i}-y_{i}}{m_{i}-m_{i} \hat{\pi}_{i}}\right)\right]}
\]
The residual deviance statistic is the sum of the squared deviance residuals \((\sum_{i = 1}^n d_i^2)\).

**Pearson’s Chi-Square Statistic**: Another lack of fit statistic that is approximately \(\chi_{n-p}^2\) under the null is Pearson’s Chi-Square Statistic:
\[
\begin{aligned}
X^{2} = \sum_{i=1}^{n}\left(\frac{y_{i}-\widehat{E}\left(y_{i}\right)}{\sqrt{\widehat{\operatorname{Var}}\left(y_{i}\right)}}\right)^{2} = \sum_{i=1}^{n}\left(\frac{y_{i}-m_{i} \hat{\pi}_{i}}{\sqrt{m_{i} \hat{\pi}_{i}\left(1-\hat{\pi}_{i}\right)}}\right)^{2} .
\end{aligned}
\]
The term \(r_i = \frac{y_{i}-m_{i} \hat{\pi}_{i}}{\sqrt{m_{i} \hat{\pi}_{i}\left(1-\hat{\pi}_{i}\right)}}\) is known as *Pearson residual*.

**Residual Diagnostics**: For large \(m_i\) values, both \(d_i\) and \(r_i\) should be approximately distributed as __standard normal random__
__variables__ if the logistic regression model is correct.

```
=g1m(cbind(tumor, total-tumor)~dose,
ofamily=binomial(link=logit), data=d)
summary(o)
```

**Overdispersion**: in the GLM framework, its often the case that \(Var(y_i)\) is the function of \(E(y_i)\). That is the case for logistic regression where \(Var(y_i) = m_i\pi(1-\pi_i) = m_i\pi_i - (m_i\pi_i)^2/m_i = E(y_i) - [E(y_i)]^2/m_i\). If the variability of our response is greater than we should expect based on our estimates of the mean, we say that there is **overdispersion**.

**Quasi-likelihood Inference**: in the binomial case, we make all the same assumptions as before except that we assume \(Var(y_i) = \phi m_i\pi_i(1-\pi_i)\) for some unknown dispersion parameter \(\phi > 1\). The dispersion parameter can be estimated by \(\hat\phi = \sum_{i=1}^n d_i^2/(n-p)\) or \(\hat\phi = \sum_{i=1}^n r_i^2/(n-p)\).

- The estimated variance of \(\hat\beta\) is multiplied by \(\hat\phi\).
- For Wald type inferences, the standard normal null distribution is replaced by \(t\) with \(n - p\) degrees of freedom.
- Any test statistic \(T\) that was assumed \(\chi_q^2\) under \(H_0\) is replaced with \(T/(q\hat\phi)\) and compared to an \(F\) distribution with \(q\) and \(n-p\) degrees of freedom.

```
=g1m(cbind(tumor, total-tumor)~dosef,
oqfamily=quasibinomial(link=logit),data=d)
summary(oq)
```