7 Generalized Linear Models
Even though we call it generalized linear model, it is still under the paradigm of nonlinear regression, because the form of the regression model is nonlinear. The name generalized linear model derived from the fact that we have \(\mathbf{x'_i \beta}\) (which is linear form) in the model.
7.1 Logistic Regression
\[ p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \]
Equivalently,
\[ logit(p_i) = log(\frac{p_i}{1+p_i}) = \mathbf{x_i'\beta} \]
where \(\frac{p_i}{1+p_i}\)is the odds.
In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models
The likelihood function
\[ L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1p_i)^{1Y_i} \]
where \(p_i = \frac{\mathbf{x'_i \beta}}{1+\mathbf{x'_i \beta}}\) and \(1p_i = (1+ exp(\mathbf{x'_i \beta}))^{1}\)
Hence, our objective function is
\[ Q(\beta) = log(L(\beta)) = \sum_{i=1}^n Y_i \mathbf{x'_i \beta}  \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \]
we could maximize this function numerically using the optimization method above, which allows us to find numerical MLE for \(\hat{\beta}\). Then we can use the standard asymptotic properties of MLEs to make inference.
Property of MLEs is that parameters are asymptotically unbiased with sample variancecovariance matrix given by the inverse Fisher information matrix
\[ \hat{\beta} \dot{\sim} AN(\beta,[\mathbf{I}(\beta)]^{1}) \]
where the Fisher Information matrix, \(\mathbf{I}(\beta)\) is
\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial \log(L(\beta))}{\partial (\beta)}\frac{\partial \log(L(\beta))}{\partial \beta'}] \\ &= E[(\frac{\partial \log(L(\beta))}{\partial \beta_i} \frac{\partial \log(L(\beta))}{\partial \beta_j})_{ij}] \end{aligned} \]
Under regularity conditions, this is equivalent to the negative of the expected value of the Hessian Matrix
\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial^2 \log(L(\beta))}{\partial \beta \partial \beta'}] \\ &= E[(\frac{\partial^2 \log(L(\beta))}{\partial \beta_i \partial \beta_j})_{ij}] \end{aligned} \]
Example:
\[ x_i' \beta = \beta_0 + \beta_1 x_i \]
\[ \begin{aligned}  \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} &= \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)}  [\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n p_i (1p_i) \\  \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} &= \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)}  [\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_i^2p_i (1p_i) \\  \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} &= \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)}  x_i[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_ip_i (1p_i) \end{aligned} \]
Hence,
\[ \mathbf{I} (\beta) = \left[ \begin{array} {cc} \sum_i p_i(1p_i) & \sum_i x_i p_i(1p_i) \\ \sum_i x_i p_i(1p_i) & \sum_i x_i^2 p_i(1p_i) \end{array} \right] \]
Inference
Likelihood Ratio Tests
To formulate the test, let \(\beta = [\beta_1', \beta_2']'\). If you are interested in testing a hypothesis about \(\beta_1\), then we leave \(\beta_2\) unspecified (called nuisance parameters). \(\beta_1\) and \(\beta_2\) can either a vector or scalar, or \(\beta_2\) can be null.
Example: \(H_0: \beta_1 = \beta_{1,0}\) (where \(\beta_{1,0}\) is specified) and \(\hat{\beta}_{2,0}\) be the MLE of \(\beta_2\) under the restriction that \(\beta_1 = \beta_{1,0}\). The likelihood ratio test statistic is
\[ 2\log\Lambda = 2[\log(L(\beta_{1,0},\hat{\beta}_{2,0}))  \log(L(\hat{\beta}_1,\hat{\beta}_2))] \]
where
 the first term is the value fo the likelihood for the fitted restricted model
 the second term is the likelihood value of the fitted unrestricted model
Under the null,
\[ 2 \log \Lambda \sim \chi^2_{\upsilon} \]
where \(\upsilon\) is the dimension of \(\beta_1\)
We reject the null when \(2\log \Lambda > \chi_{\upsilon,1\alpha}^2\)
Wald Statistics
Based on
\[ \hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)^{1}]) \]
\[ H_0: \mathbf{L}\hat{\beta} = 0 \]
where \(\mathbf{L}\) is a \(q \times p\) matrix with \(q\) linearly independent rows. Then
\[ W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{1}L'})^{1}(\mathbf{L\hat{\beta}}) \]
under the null hypothesis
Confidence interval
\[ \hat{\beta}_i \pm 1.96 \hat{s}_{ii}^2 \]
where \(\hat{s}_{ii}^2\) is the ith diagonal of \(\mathbf{[I(\hat{\beta})]}^{1}\)
If you have
 large sample size, the likelihood ratio and Wald tests have similar results.
 small sample size, the likelihood ratio test is better.
Logistic Regression: Interpretation of \(\beta\)
For single regressor, the model is
\[ logit\{\hat{p}_{x_i}\} \equiv logit (\hat{p}_i) = \log(\frac{\hat{p}_i}{1  \hat{p}_i}) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]
When \(x= x_i + 1\)
\[ logit\{\hat{p}_{x_i +1}\} = \hat{\beta}_0 + \hat{\beta}(x_i + 1) = logit\{\hat{p}_{x_i}\} + \hat{\beta}_1 \]
Then,
\[ \begin{aligned} logit\{\hat{p}_{x_i +1}\}  logit\{\hat{p}_{x_i}\} &= log\{odds[\hat{p}_{x_i +1}]\}  log\{odds[\hat{p}_{x_i}]\} \\ &= log(\frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]}) = \hat{\beta}_1 \end{aligned} \]
and
\[ exp(\hat{\beta}_1) = \frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]} \]
the estimated odds ratio
the estimated odds ratio, when there is a difference of c units in the regressor x, is \(exp(c\hat{\beta}_1)\). When there are multiple covariates, \(exp(\hat{\beta}_k)\) is the estimated odds ratio for the variable \(x_k\), assuming that all of the other variables are held constant.
Inference on the Mean Response
Let \(x_h = (1, x_{h1}, ...,x_{h,p1})'\). Then
\[ \hat{p}_h = \frac{exp(\mathbf{x'_h \hat{\beta}})}{1 + exp(\mathbf{x'_h \hat{\beta}})} \]
and \(s^2(\hat{p}_h) = \mathbf{x'_h[I(\hat{\beta})]^{1}x_h}\)
For new observation, we can have a cutoff point to decide whether y = 0 or 1.
7.1.1 Application
library(kableExtra)
library(dplyr)
library(pscl)
library(ggplot2)
library(faraway)
library(nnet)
library(agridat)
library(nlstools)
Logistic Regression
\(x \sim Unif(0.5,2.5)\). Then \(\eta = 0.5 + 0.75 x\)
set.seed(23) #set seed for reproducibility
x < runif(1000, min = 0.5, max = 2.5)
eta1 < 0.5 + 0.75 * x
Passing \(\eta\)’s into the inverselogit function, we get
\[ p = \frac{\exp(\eta)}{1+ \exp(\eta)} \]
where \(p \in [0,1]\)
Then, we generate \(y \sim Bernoulli(p)\)
p < exp(eta1) / (1 + exp(eta1))
y < rbinom(1000, 1, p)
BinData < data.frame(X = x, Y = y)
Model Fit
Logistic_Model < glm(formula = Y ~ X,
family = binomial, # family = specifies the response distribution
data = BinData)
summary(Logistic_Model)
#>
#> Call:
#> glm(formula = Y ~ X, family = binomial, data = BinData)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 2.2317 0.4153 0.5574 0.7922 1.1469
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 0.46205 0.10201 4.530 5.91e06 ***
#> X 0.78527 0.09296 8.447 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1106.7 on 999 degrees of freedom
#> Residual deviance: 1027.4 on 998 degrees of freedom
#> AIC: 1031.4
#>
#> Number of Fisher Scoring iterations: 4
nlstools::confint2(Logistic_Model)
#> 2.5 % 97.5 %
#> (Intercept) 0.2618709 0.6622204
#> X 0.6028433 0.9676934
OddsRatio < coef(Logistic_Model) %>% exp
OddsRatio
#> (Intercept) X
#> 1.587318 2.192995
Based on the odds ratio, when
 \(x = 0\) , the odds of success of 1.59
 \(x = 1\), the odds of success increase by a factor of 2.19 (i.e., 119.29% increase).
Deviance Tests
 \(H_0\): No variables are related to the response (i.e., model with just the intercept)
 \(H_1\): At least one variable is related to the response
Test_Dev < Logistic_Model$null.deviance  Logistic_Model$deviance
p_val_dev < 1  pchisq(q = Test_Dev, df = 1)
Since we see the pvalue of 0, we reject the null that no variables are related to the response
Deviance residuals
Logistic_Resids < residuals(Logistic_Model, type = "deviance")
plot(
y = Logistic_Resids,
x = BinData$X,
xlab = 'X',
ylab = 'Deviance Resids'
)
However, this plot is not informative. Hence, we can can see the residuals plots that are grouped into bins based on prediction values.
plot_bin < function(Y,
X,
bins = 100,
return.DF = FALSE) {
Y_Name < deparse(substitute(Y))
X_Name < deparse(substitute(X))
Binned_Plot < data.frame(Plot_Y = Y, Plot_X = X)
Binned_Plot$bin <
cut(Binned_Plot$Plot_X, breaks = bins) %>% as.numeric
Binned_Plot_summary < Binned_Plot %>%
group_by(bin) %>%
summarise(
Y_ave = mean(Plot_Y),
X_ave = mean(Plot_X),
Count = n()
) %>% as.data.frame
plot(
y = Binned_Plot_summary$Y_ave,
x = Binned_Plot_summary$X_ave,
ylab = Y_Name,
xlab = X_Name
)
if (return.DF)
return(Binned_Plot_summary)
}
plot_bin(Y = Logistic_Resids,
X = BinData$X,
bins = 100)
We can also see the predicted value against the residuals.
Logistic_Predictions < predict(Logistic_Model, type = "response")
plot_bin(Y = Logistic_Resids, X = Logistic_Predictions, bins = 100)
We can also look at a binned plot of the logistic prediction versus the true category
NumBins < 10
Binned_Data < plot_bin(
Y = BinData$Y,
X = Logistic_Predictions,
bins = NumBins,
return.DF = TRUE
)
Binned_Data
#> bin Y_ave X_ave Count
#> 1 1 0.5833333 0.5382095 72
#> 2 2 0.5200000 0.5795887 75
#> 3 3 0.6567164 0.6156540 67
#> 4 4 0.7014925 0.6579674 67
#> 5 5 0.6373626 0.6984765 91
#> 6 6 0.7500000 0.7373341 72
#> 7 7 0.7096774 0.7786747 93
#> 8 8 0.8503937 0.8203819 127
#> 9 9 0.8947368 0.8601232 133
#> 10 10 0.8916256 0.9004734 203
abline(0, 1, lty = 2, col = 'blue')
Formal deviance test
HosmerLemeshow test
Null hypothesis: the observed events match the expected evens
\[ X^2_{HL} = \sum_{j=1}^{J} \frac{(y_j  m_j \hat{p}_j)^2}{m_j \hat{p}_j(1\hat{p}_j)} \]
where
 within the jth bin, \(y_j\) is the number of successes
 \(m_j\) = number of observations
 \(\hat{p}_j\) = predicted probability
Under the null hypothesis, \(X^2_{HLL} \sim \chi^2_{J1}\)
HL_BinVals <
(Binned_Data$Count * Binned_Data$Y_ave  Binned_Data$Count * Binned_Data$X_ave) ^ 2 / Binned_Data$Count * Binned_Data$X_ave * (1  Binned_Data$X_ave)
HLpval < pchisq(q = sum(HL_BinVals),
df = NumBins,
lower.tail = FALSE)
HLpval
#> [1] 0.9999989
Since \(p\)value = 0.99, we do not reject the null hypothesis (i.e., the model is fitting well).
7.2 Probit Regression
\[ E(Y_i) = p_i = \Phi(\mathbf{x_i'\theta}) \]
where \(\Phi()\) is the CDF of a \(N(0,1)\) random variable.
Other models (e..g, t–distribution; loglog; I complimentary loglog)
We let \(Y_i = 1\) success, \(Y_i =0\) no success.
assume \(Y \sim Ber\) and \(p_i = P(Y_i =1)\), the success probability.
consider a logistic regression with the response function \(logit(p_i) = x'_i \beta\)
Confusion matrix
Predicted  

Truth  1  0 
1  True Positive (TP)  False Negative (FN) 
0  False Positive (FP)  True Negative (TN) 
Sensitivity: ability to identify positive results
\[ \text{Sensitivity} = \frac{TP}{TP + FN} \]
Specificity: ability to identify negative results
\[ \text{Specificity} = \frac{TN}{TN + FP} \]
False positive rate: Type I error (1 specificity)
\[ \text{ False Positive Rate} = \frac{FP}{TN+ FP} \]
False Negative Rate: Type II error (1sensitivity)
\[ \text{False Negative Rate} = \frac{FN}{TP + FN} \]
Predicted  

Truth  1  0 
1  Sensitivity  False Negative Rate 
0  False Positive Rate  Specificity 
7.3 Binomial Regression
Binomial
Here, cancer case = successes, and control case = failures.
data("esoph")
head(esoph, n = 3)
#> agegp alcgp tobgp ncases ncontrols
#> 1 2534 039g/day 09g/day 0 40
#> 2 2534 039g/day 1019 0 10
#> 3 2534 039g/day 2029 0 6
plot(
esoph$ncases / (esoph$ncases + esoph$ncontrols) ~ esoph$alcgp,
ylab = "Proportion",
xlab = 'Alcohol consumption',
main = 'Esophageal Cancer data'
)
# only the alcohol consumption as a predictor
model < glm(cbind(ncases, ncontrols) ~ alcgp, data = esoph, family = binomial)
summary(model)
#>
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ alcgp, family = binomial,
#> data = esoph)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 4.0759 1.2037 0.0183 1.0928 3.7336
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 2.5885 0.1925 13.444 < 2e16 ***
#> alcgp4079 1.2712 0.2323 5.472 4.46e08 ***
#> alcgp80119 2.0545 0.2611 7.868 3.59e15 ***
#> alcgp120+ 3.3042 0.3237 10.209 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 367.95 on 87 degrees of freedom
#> Residual deviance: 221.46 on 84 degrees of freedom
#> AIC: 344.51
#>
#> Number of Fisher Scoring iterations: 5
#Coefficient Odds
coefficients(model) %>% exp
#> (Intercept) alcgp4079 alcgp80119 alcgp120+
#> 0.07512953 3.56527094 7.80261593 27.22570533
deviance(model)/df.residual(model)
#> [1] 2.63638
model$aic
#> [1] 344.5109
# alcohol consumption and age as predictors
better_model <
glm(cbind(ncases, ncontrols) ~ agegp + alcgp,
data = esoph,
family = binomial)
summary(better_model)
#>
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial,
#> data = esoph)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 2.2395 0.7186 0.2324 0.7930 3.3538
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 6.1472 1.0419 5.900 3.63e09 ***
#> agegp3544 1.6311 1.0800 1.510 0.130973
#> agegp4554 3.4258 1.0389 3.297 0.000976 ***
#> agegp5564 3.9435 1.0346 3.811 0.000138 ***
#> agegp6574 4.3568 1.0413 4.184 2.87e05 ***
#> agegp75+ 4.4242 1.0914 4.054 5.04e05 ***
#> alcgp4079 1.4343 0.2448 5.859 4.64e09 ***
#> alcgp80119 2.0071 0.2776 7.230 4.84e13 ***
#> alcgp120+ 3.6800 0.3763 9.778 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 367.95 on 87 degrees of freedom
#> Residual deviance: 105.88 on 79 degrees of freedom
#> AIC: 238.94
#>
#> Number of Fisher Scoring iterations: 6
better_model$aic #smaller AIC is better
#> [1] 238.9361
coefficients(better_model) %>% exp
#> (Intercept) agegp3544 agegp4554 agegp5564 agegp6574 agegp75+
#> 0.002139482 5.109601844 30.748594216 51.596634690 78.005283850 83.448437749
#> alcgp4079 alcgp80119 alcgp120+
#> 4.196747169 7.441782227 39.646885126
pchisq(
q = model$deviance  better_model$deviance,
df = model$df.residual  better_model$df.residual,
lower = FALSE
)
#> [1] 2.713923e23
# specify link function as probit
Prob_better_model < glm(
cbind(ncases, ncontrols) ~ agegp + alcgp,
data = esoph,
family = binomial(link = probit)
)
summary(Prob_better_model)
#>
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial(link = probit),
#> data = esoph)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 2.1325 0.6877 0.1661 0.7654 3.3258
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 3.3741 0.4922 6.855 7.13e12 ***
#> agegp3544 0.8562 0.5081 1.685 0.092003 .
#> agegp4554 1.7829 0.4904 3.636 0.000277 ***
#> agegp5564 2.1034 0.4876 4.314 1.61e05 ***
#> agegp6574 2.3374 0.4930 4.741 2.13e06 ***
#> agegp75+ 2.3694 0.5275 4.491 7.08e06 ***
#> alcgp4079 0.8080 0.1330 6.076 1.23e09 ***
#> alcgp80119 1.1399 0.1558 7.318 2.52e13 ***
#> alcgp120+ 2.1204 0.2060 10.295 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 367.95 on 87 degrees of freedom
#> Residual deviance: 104.48 on 79 degrees of freedom
#> AIC: 237.53
#>
#> Number of Fisher Scoring iterations: 6
7.4 Poisson Regression
From the Poisson distribution
\[ \begin{aligned} f(Y_i) &= \frac{\mu_i^{Y_i}exp(\mu_i)}{Y_i!}, Y_i = 0,1,.. \\ E(Y_i) &= \mu_i \\ var(Y_i) &= \mu_i \end{aligned} \]
which is a natural distribution for counts. We can see that the variance is a function of the mean. If we let \(\mu_i = f(\mathbf{x_i; \theta})\), it would be similar to Logistic Regression since we can choose \(f()\) as \(\mu_i = \mathbf{x_i'\theta}, \mu_i = \exp(\mathbf{x_i'\theta}), \mu_i = \log(\mathbf{x_i'\theta})\)
7.4.1 Application
Count Data and Poisson regression
data(bioChemists, package = "pscl")
bioChemists < bioChemists %>%
rename(
Num_Article = art, #articles in last 3 years of PhD
Sex = fem, #coded 1 if female
Married = mar, #coded 1 if married
Num_Kid5 = kid5, #number of childeren under age 6
PhD_Quality = phd, #prestige of PhD program
Num_MentArticle = ment #articles by mentor in last 3 years
)
hist(bioChemists$Num_Article, breaks = 25, main = 'Number of Articles')
Poisson_Mod < glm(Num_Article ~ ., family=poisson, bioChemists)
summary(Poisson_Mod)
#>
#> Call:
#> glm(formula = Num_Article ~ ., family = poisson, data = bioChemists)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 3.5672 1.5398 0.3660 0.5722 5.4467
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 0.304617 0.102981 2.958 0.0031 **
#> SexWomen 0.224594 0.054613 4.112 3.92e05 ***
#> MarriedMarried 0.155243 0.061374 2.529 0.0114 *
#> Num_Kid5 0.184883 0.040127 4.607 4.08e06 ***
#> PhD_Quality 0.012823 0.026397 0.486 0.6271
#> Num_MentArticle 0.025543 0.002006 12.733 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 1817.4 on 914 degrees of freedom
#> Residual deviance: 1634.4 on 909 degrees of freedom
#> AIC: 3314.1
#>
#> Number of Fisher Scoring iterations: 5
Residual of 1634 with 909 df isn’t great.
We see Pearson \(\chi^2\)
Predicted_Means < predict(Poisson_Mod,type = "response")
X2 < sum((bioChemists$Num_Article  Predicted_Means)^2/Predicted_Means)
X2
#> [1] 1662.547
pchisq(X2,Poisson_Mod$df.residual, lower.tail = FALSE)
#> [1] 7.849882e47
With interaction terms, there are some improvements
Poisson_Mod_All2way < glm(Num_Article ~ .^2, family=poisson, bioChemists)
Poisson_Mod_All3way < glm(Num_Article ~ .^3, family=poisson, bioChemists)
Consider the \(\hat{\phi} = \frac{\text{deviance}}{df}\)
Poisson_Mod$deviance / Poisson_Mod$df.residual
#> [1] 1.797988
This is evidence for overdispersion. Likely cause is missing variables. And remedies could either be to include more variables or consider random effects.
A quick fix is to force the Poisson Regression to include this value of \(\phi\), and this model is called “QuasiPoisson”.
phi_hat = Poisson_Mod$deviance/Poisson_Mod$df.residual
summary(Poisson_Mod,dispersion = phi_hat)
#>
#> Call:
#> glm(formula = Num_Article ~ ., family = poisson, data = bioChemists)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 3.5672 1.5398 0.3660 0.5722 5.4467
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 0.30462 0.13809 2.206 0.02739 *
#> SexWomen 0.22459 0.07323 3.067 0.00216 **
#> MarriedMarried 0.15524 0.08230 1.886 0.05924 .
#> Num_Kid5 0.18488 0.05381 3.436 0.00059 ***
#> PhD_Quality 0.01282 0.03540 0.362 0.71715
#> Num_MentArticle 0.02554 0.00269 9.496 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1.797988)
#>
#> Null deviance: 1817.4 on 914 degrees of freedom
#> Residual deviance: 1634.4 on 909 degrees of freedom
#> AIC: 3314.1
#>
#> Number of Fisher Scoring iterations: 5
Or directly rerun the model as
quasiPoisson_Mod < glm(Num_Article ~ ., family=quasipoisson, bioChemists)
QuasiPoisson is not recommended, but Negative Binomial Regression that has an extra parameter to account for overdispersion is.
7.5 Negative Binomial Regression
library(MASS)
NegBinom_Mod < MASS::glm.nb(Num_Article ~ .,bioChemists)
summary(NegBinom_Mod)
#>
#> Call:
#> MASS::glm.nb(formula = Num_Article ~ ., data = bioChemists, init.theta = 2.264387695,
#> link = log)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 2.1678 1.3617 0.2806 0.4476 3.4524
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>z)
#> (Intercept) 0.256144 0.137348 1.865 0.062191 .
#> SexWomen 0.216418 0.072636 2.979 0.002887 **
#> MarriedMarried 0.150489 0.082097 1.833 0.066791 .
#> Num_Kid5 0.176415 0.052813 3.340 0.000837 ***
#> PhD_Quality 0.015271 0.035873 0.426 0.670326
#> Num_MentArticle 0.029082 0.003214 9.048 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
#>
#> Null deviance: 1109.0 on 914 degrees of freedom
#> Residual deviance: 1004.3 on 909 degrees of freedom
#> AIC: 3135.9
#>
#> Number of Fisher Scoring iterations: 1
#>
#>
#> Theta: 2.264
#> Std. Err.: 0.271
#>
#> 2 x loglikelihood: 3121.917
We can see the dispersion is 2.264 with SE = 0.271, which is significantly different from 1, indicating overdispersion. Check OverDispersion for more detail
7.6 Multinomial
If we have more than two categories or groups that we want to model relative to covariates (e.g., we have observations \(i = 1,…,n\) and groups/ covariates \(j = 1,2,…,J\)), multinomial is our candidate model
Let
 \(p_{ij}\) be the probability that the ith observation belongs to the jth group
 \(Y_{ij}\) be the number of observations for individual i in group j; An individual will have observations \(Y_{i1},Y_{i2},…Y_{iJ}\)
 assume the probability of observing this response is given by a multinomial distribution in terms of probabilities \(p_{ij}\), where \(\sum_{j = 1}^J p_{ij} = 1\) . For interpretation, we have a baseline category \(p_{i1} = 1  \sum_{j = 2}^J p_{ij}\)
The link between the mean response (probability) \(p_{ij}\) and a linear function of the covariates
\[ \eta_{ij} = \mathbf{x'_i \beta_j} = \log \frac{p_{ij}}{p_{i1}}, j = 2,..,J \]
We compare \(p_{ij}\) to the baseline \(p_{i1}\), suggesting
\[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{i=2}^J \exp(\eta_{ij})} \]
which is known as multinomial logistic model.
Note:
 Softmax coding for multinomial logistic regression: rather than selecting a baseline class, we treat all \(K\) class symmetrically  equally important (no baseline).
\[ P(Y = k  X = x) = \frac{exp(\beta_{k1} + \dots + \beta_{k_p x_p})}{\sum_{l = 1}^K exp(\beta_{l0} + \dots + \beta_{l_p x_p})} \]
then the log odds ratio between kth and k’th classes is
\[ \log (\frac{P(Y=kX=x)}{P(Y = k'  X=x)}) = (\beta_{k0}  \beta_{k'0}) + \dots + (\beta_{kp}  \beta_{k'p}) x_p \]
library(faraway)
library(dplyr)
data(nes96, package="faraway")
head(nes96,3)
#> popul TVnews selfLR ClinLR DoleLR PID age educ income vote
#> 1 0 7 extCon extLib Con strRep 36 HS $3Kminus Dole
#> 2 190 1 sliLib sliLib sliCon weakDem 20 Coll $3Kminus Clinton
#> 3 31 7 Lib Lib Con weakDem 24 BAdeg $3Kminus Clinton
We try to understand their political strength
table(nes96$PID)
#>
#> strDem weakDem indDem indind indRep weakRep strRep
#> 200 180 108 37 94 150 175
nes96$Political_Strength < NA
nes96$Political_Strength[nes96$PID %in% c("strDem", "strRep")] <
"Strong"
nes96$Political_Strength[nes96$PID %in% c("weakDem", "weakRep")] <
"Weak"
nes96$Political_Strength[nes96$PID %in% c("indDem", "indind", "indRep")] <
"Neutral"
nes96 %>% group_by(Political_Strength) %>% summarise(Count = n())
#> # A tibble: 3 × 2
#> Political_Strength Count
#> <chr> <int>
#> 1 Neutral 239
#> 2 Strong 375
#> 3 Weak 330
visualize the political strength variable
library(ggplot2)
Plot_DF < nes96 %>%
mutate(Age_Grp = cut_number(age, 4)) %>%
group_by(Age_Grp, Political_Strength) %>%
summarise(count = n()) %>%
group_by(Age_Grp) %>%
mutate(etotal = sum(count), proportion = count / etotal)
Age_Plot < ggplot(
Plot_DF,
aes(
x = Age_Grp,
y = proportion,
group = Political_Strength,
linetype = Political_Strength,
color = Political_Strength
)
) +
geom_line(size = 2)
Age_Plot
Fit the multinomial logistic model:
model political strength as a function of age and education
library(nnet)
Multinomial_Model <
multinom(Political_Strength ~ age + educ, nes96, trace = F)
summary(Multinomial_Model)
#> Call:
#> multinom(formula = Political_Strength ~ age + educ, data = nes96,
#> trace = F)
#>
#> Coefficients:
#> (Intercept) age educ.L educ.Q educ.C educ^4
#> Strong 0.08788729 0.010700364 0.1098951 0.2016197 0.1757739 0.02116307
#> Weak 0.51976285 0.004868771 0.1431104 0.2405395 0.2411795 0.18353634
#> educ^5 educ^6
#> Strong 0.1664377 0.1359449
#> Weak 0.1489030 0.2173144
#>
#> Std. Errors:
#> (Intercept) age educ.L educ.Q educ.C educ^4
#> Strong 0.3017034 0.005280743 0.4586041 0.4318830 0.3628837 0.2964776
#> Weak 0.3097923 0.005537561 0.4920736 0.4616446 0.3881003 0.3169149
#> educ^5 educ^6
#> Strong 0.2515012 0.2166774
#> Weak 0.2643747 0.2199186
#>
#> Residual Deviance: 2024.596
#> AIC: 2056.596
Alternatively, stepwise model selection based AIC
Multinomial_Step < step(Multinomial_Model,trace = 0)
#> trying  age
#> trying  educ
#> trying  age
Multinomial_Step
#> Call:
#> multinom(formula = Political_Strength ~ age, data = nes96, trace = F)
#>
#> Coefficients:
#> (Intercept) age
#> Strong 0.01988977 0.009832916
#> Weak 0.59497046 0.005954348
#>
#> Residual Deviance: 2030.756
#> AIC: 2038.756
compare the best model to the full model based on deviance
pchisq(q = deviance(Multinomial_Step)  deviance(Multinomial_Model),
df = Multinomial_Model$edfMultinomial_Step$edf,lower=F)
#> [1] 0.9078172
We see no significant difference
Plot of the fitted model
PlotData < data.frame(age = seq(from = 19, to = 91))
Preds <
PlotData %>% bind_cols(data.frame(predict(
object = Multinomial_Step,
PlotData, type = "probs"
)))
plot(
x = Preds$age,
y = Preds$Neutral,
type = "l",
ylim = c(0.2, 0.6),
col = "black",
ylab = "Proportion",
xlab = "Age"
)
lines(x = Preds$age,
y = Preds$Weak,
col = "blue")
lines(x = Preds$age,
y = Preds$Strong,
col = "red")
legend(
'topleft',
legend = c('Neutral', 'Weak', 'Strong'),
col = c('black', 'blue', 'red'),
lty = 1
)
predict(Multinomial_Step,data.frame(age = 34)) # predicted result (categoriy of political strength) of 34 year old
#> [1] Weak
#> Levels: Neutral Strong Weak
predict(Multinomial_Step,data.frame(age = c(34,35)),type="probs") # predicted result of the probabilities of each level of political strength for a 34 and 35
#> Neutral Strong Weak
#> 1 0.2597275 0.3556910 0.3845815
#> 2 0.2594080 0.3587639 0.3818281
If categories are ordered (i.e., ordinal data), we must use another approach (still multinomial, but use cumulative probabilities).
Another example
library(agridat)
dat < agridat::streibig.competition
# See Schaberger and Pierce, pages 370+
# Consider only the monospecies barley data (no competition from Sinapis)
gammaDat < subset(dat, sseeds < 1)
gammaDat <
transform(gammaDat,
x = bseeds,
y = bdwt,
block = factor(block))
# Inverse yield looks like it will be a good fit for Gamma's inverse link
ggplot(gammaDat, aes(x = x, y = 1 / y)) +
geom_point(aes(color = block, shape = block)) +
xlab('Seeding Rate') +
ylab('Inverse yield') +
ggtitle('Streibig Competion  Barley only')
\[ Y \sim Gamma \]
because Gamma is nonnegative as opposed to Normal. The canonical Gamma link function is the inverse (or reciprocal) link
\[ \begin{aligned} \eta_{ij} &= \beta_{0j} + \beta_{1j}x_{ij} + \beta_2x_{ij}^2 \\ Y_{ij} &= \eta_{ij}^{1} \end{aligned} \]
The linear predictor is a quadratic model fit to each of the jth blocks. A different model (not fitted) could be one with common slopes: glm(y ~ x + I(x^2),…)
# linear predictor is quadratic, with separate intercept and slope per block
m1 <
glm(y ~ block + block * x + block * I(x ^ 2),
data = gammaDat,
family = Gamma(link = "inverse"))
summary(m1)
#>
#> Call:
#> glm(formula = y ~ block + block * x + block * I(x^2), family = Gamma(link = "inverse"),
#> data = gammaDat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> 1.21708 0.44148 0.02479 0.17999 0.80745
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>t)
#> (Intercept) 1.115e01 2.870e02 3.886 0.000854 ***
#> blockB2 1.208e02 3.880e02 0.311 0.758630
#> blockB3 2.386e02 3.683e02 0.648 0.524029
#> x 2.075e03 1.099e03 1.888 0.072884 .
#> I(x^2) 1.372e05 9.109e06 1.506 0.146849
#> blockB2:x 5.198e04 1.468e03 0.354 0.726814
#> blockB3:x 7.475e04 1.393e03 0.537 0.597103
#> blockB2:I(x^2) 5.076e06 1.184e05 0.429 0.672475
#> blockB3:I(x^2) 6.651e06 1.123e05 0.592 0.560012
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Gamma family taken to be 0.3232083)
#>
#> Null deviance: 13.1677 on 29 degrees of freedom
#> Residual deviance: 7.8605 on 21 degrees of freedom
#> AIC: 225.32
#>
#> Number of Fisher Scoring iterations: 5
For predict new value of \(x\)
newdf <
expand.grid(x = seq(0, 120, length = 50), block = factor(c('B1', 'B2', 'B3')))
newdf$pred < predict(m1, new = newdf, type = 'response')
ggplot(gammaDat, aes(x = x, y = y)) +
geom_point(aes(color = block, shape = block)) +
xlab('Seeding Rate') + ylab('Inverse yield') +
ggtitle('Streibig Competion  Barley only Predictions') +
geom_line(data = newdf, aes(
x = x,
y = pred,
color = block,
linetype = block
))
7.7 Generalization
We can see that Poisson regression looks similar to logistic regression. Hence, we can generalize to a class of modeling. Thanks to Nelder and Wedderburn (1972), we have the generalized linear models (GLMs). Estimation is generalize in these models.
Exponential Family
The theory of GLMs is developed for data with distribution given y the exponential family.
The form of the data distribution that is useful for GLMs is
\[ f(y;\theta, \phi) = \exp(\frac{\theta y  b(\theta)}{a(\phi)} + c(y, \phi)) \]
where
 \(\theta\) is called the natural parameter
 \(\phi\) is called the dispersion parameter
Note:
This family includes the Gamma, Normal, Poisson, and other. For all parameterization of the exponential family, check this link
Example
if we have \(Y \sim N(\mu, \sigma^2)\)
\[ \begin{aligned} f(y; \mu, \sigma^2) &= \frac{1}{(2\pi \sigma^2)^{1/2}}\exp(\frac{1}{2\sigma^2}(y \mu)^2) \\ &= \exp(\frac{1}{2\sigma^2}(y^2  2y \mu +\mu^2) \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{y \mu  \mu^2/2}{\sigma^2}  \frac{y^2}{2\sigma^2}  \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{\theta y  b(\theta)}{a(\phi)} + c(y , \phi)) \end{aligned} \]
where
 \(\theta = \mu\)
 \(b(\theta) = \frac{\mu^2}{2}\)
 \(a(\phi) = \sigma^2 = \phi\)
 \(c(y , \phi) =  \frac{1}{2}(\frac{y^2}{\phi}+\log(2\pi \sigma^2))\)
Properties of GLM exponential families
\(E(Y) = b' (\theta)\) where \(b'(\theta) = \frac{\partial b(\theta)}{\partial \theta}\) (here
'
is “prime”, not transpose)
\(var(Y) = a(\phi)b''(\theta)= a(\phi)V(\mu)\).
 \(V(\mu)\) is the variance function; however, it is only the variance in the case that \(a(\phi) =1\)
If \(a(), b(), c()\) are identifiable, we will derive expected value and variance of Y.
Example
Normal distribution
\[ \begin{aligned} b'(\theta) &= \frac{\partial b(\mu^2/2)}{\partial \mu} = \mu \\ V(\mu) &= \frac{\partial^2 (\mu^2/2)}{\partial \mu^2} = 1 \\ \to var(Y) &= a(\phi) = \sigma^2 \end{aligned} \]
Poisson distribution
\[ \begin{aligned} f(y, \theta, \phi) &= \frac{\mu^y \exp(\mu)}{y!} \\ &= \exp(y\log(\mu)  \mu  \log(y!)) \\ &= \exp(y\theta  \exp(\theta)  \log(y!)) \end{aligned} \]
where
 \(\theta = \log(\mu)\)
 \(a(\phi) = 1\)
 \(b(\theta) = \exp(\theta)\)
 \(c(y, \phi) = \log(y!)\)
Hence,
\[ \begin{aligned} E(Y) = \frac{\partial b(\theta)}{\partial \theta} = \exp(\theta) &= \mu \\ var(Y) = \frac{\partial^2 b(\theta)}{\partial \theta^2} &= \mu \end{aligned} \]
Since \(\mu = E(Y) = b'(\theta)\)
In GLM, we take some monotone function (typically nonlinear) of \(\mu\) to be linear in the set of covariates
\[ g(\mu) = g(b'(\theta)) = \mathbf{x'\beta} \]
Equivalently,
\[ \mu = g^{1}(\mathbf{x'\beta}) \]
where \(g(.)\) is the link function since it links mean response (\(\mu = E(Y)\)) and a linear expression of the covariates
Some people use \(\eta = \mathbf{x'\beta}\) where \(\eta\) = the “linear predictor”
GLM is composed of 2 components
The random component:
is the distribution chosen to model the response variables \(Y_1,...,Y_n\)
is specified by the choice fo \(a(), b(), c()\) in the exponential form

Notation:
 Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
\[ f(y_i ; \theta_i, \phi) = \exp(\frac{\theta_i y_i  b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \] notice each observation might have different densities  Assume that \(\phi\) is constant for all \(i = 1,...,n\), but \(\theta_i\) will vary. \(\mu_i = E(Y_i)\) for all i.
 Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
The systematic component
is the portion of the model that gives the relation between \(\mu\) and the covariates \(\mathbf{x}\)

consists of 2 parts:
 the link function, \(g(.)\)
 the linear predictor, \(\eta = \mathbf{x'\beta}\)

Notation:
 assume \(g(\mu_i) = \mathbf{x'\beta} = \eta_i\) where \(\mathbf{\beta} = (\beta_1,..., \beta_p)'\)
 The parameters to be estimated are \(\beta_1,...\beta_p , \phi\)
The Canonical Link
To choose \(g(.)\), we can use canonical link function (Remember: Canonical link is just a special case of the link function)
If the link function \(g(.)\) is such \(g(\mu_i) = \eta_i = \theta_i\), the natural parameter, then \(g(.)\) is the canonical link.
 \(b(\theta)\) = cumulant moment generating function
 \(g(\mu)\) is the link function, which relates the linear predictor to the mean and is required to be monotone increasing, continuously differentiable and invertible.
Equivalently, we can think of canonical link function as
\[ \gamma^{1} \circ g^{1} = I \] which is the identity. Hence,
\[ \theta = \eta \]
The inverse link
\(g^{1}(.)\) is also known as the mean function, take linear predictor output (ranging from \(\infty\) to \(\infty\)) and transform it into a different scale.
Exponential: converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and \(\infty\) (which you can see that is useful in case you want to convert a linear predictor into a nonnegative value). \(\lambda = \exp(y) = \mathbf{\beta X}\)

Inverse Logit (also known as logistic): converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and 1, which is useful in case you want to convert a linear predictor to a probability. \(\theta = \frac{1}{1 + \exp(y)} = \frac{1}{1 + \exp( \mathbf{\beta X})}\)
 \(y\) = linear predictor value
 \(\theta\) = transformed value
The identity link is that
\[ \begin{aligned} \eta_i &= g(\mu_i) = \mu_i \\ \mu_i &= g^{1}(\eta_i) = \eta_i \end{aligned} \]
Table 15.1 Generalized Linear Models 15.1 the Structure of Generalized Linear Models
More example on the link functions and their inverses can be found on page 380
Example
Normal random component
Mean Response: \(\mu_i = \theta_i\)
Canonical Link: \(g( \mu_i) = \mu_i\) (the identity link)
Binomial random component
Mean Response: \(\mu_i = \frac{n_i \exp( \theta)}{1+\exp (\theta_i)}\) and \(\theta(\mu_i) = \log(\frac{p_i }{1p_i}) = \log (\frac{\mu_i} {n_i  \mu_i})\)
Canonical link: \(g(\mu_i) = \log(\frac{\mu_i} {n_i  \mu_i})\) (logit link)
Poisson random component
Mean Response: \(\mu_i = \exp(\theta_i)\)
Canonical Link: \(g(\mu_i) = \log(\mu_i)\)
Gamma random component:
Mean response: \(\mu_i = \frac{1}{\theta_i}\) and \(\theta(\mu_i) =  \mu_i^{1}\)
Canonical Link: \(g(\mu\_i) =  \frac{1}{\mu_i}\)
Inverse Gaussian random
 Canonical Link: \(g(\mu_i) = \frac{1}{\mu_i^2}\)
7.7.1 Estimation
 MLE for parameters of the systematic component (\(\beta\))
 Unification of derivation and computation (thanks to the exponential forms)
 No unification for estimation of the dispersion parameter (\(\phi\))
7.7.1.1 Estimation of \(\beta\)
We have
\[ \begin{aligned} f(y_i ; \theta_i, \phi) &= \exp(\frac{\theta_i y_i  b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \\ E(Y_i) &= \mu_i = b'(\theta) \\ var(Y_i) &= b''(\theta)a(\phi) = V(\mu_i)a(\phi) \\ g(\mu_i) &= \mathbf{x}_i'\beta = \eta_i \end{aligned} \]
If the loglikelihood for a single observation is \(l_i (\beta,\phi)\). The loglikelihood for all n observations is
\[ \begin{aligned} l(\beta,\phi) &= \sum_{i=1}^n l_i (\beta,\phi) \\ &= \sum_{i=1}^n (\frac{\theta_i y_i  b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \end{aligned} \]
Using MLE to find \(\beta\), we use the chain rule to get the derivatives
\[ \begin{aligned} \frac{\partial l_i (\beta,\phi)}{\partial \beta_j} &= \frac{\partial l_i (\beta, \phi)}{\partial \theta_i} \times \frac{\partial \theta_i}{\partial \mu_i} \times \frac{\partial \mu_i}{\partial \eta_i}\times \frac{\partial \eta_i}{\partial \beta_j} \\ &= \sum_{i=1}^{n}(\frac{ y_i  \mu_i}{a(\phi)} \times \frac{1}{V(\mu_i)} \times \frac{\partial \mu_i}{\partial \eta_i} \times x_{ij}) \end{aligned} \]
If we let
\[ w_i \equiv ((\frac{\partial \eta_i}{\partial \mu_i})^2 V(\mu_i))^{1} \]
Then,
\[ \frac{\partial l_i (\beta,\phi)}{\partial \beta_j} = \sum_{i=1}^n (\frac{y_i \mu_i}{a(\phi)} \times w_i \times \frac{\partial \eta_i}{\partial \mu_i} \times x_{ij}) \]
We can also get the second derivatives using the chain rule.
Example:
For the \[NewtonRaphson\] algorithm, we need
\[  E(\frac{\partial^2 l(\beta,\phi)}{\partial \beta_j \partial \beta_k}) \]
where \((j,k)\)th element of the Fisher information matrix \(\mathbf{I}(\beta)\)
Hence,
\[  E(\frac{\partial^2 l(\beta,\phi)}{\partial \beta_j \partial \beta_k}) = \sum_{i=1}^n \frac{w_i}{a(\phi)}x_{ij}x_{ik} \]
for the (j,k)th element
If Bernoulli model with logit link function (which is the canonical link)
\[ \begin{aligned} b(\theta) &= \log(1 + \exp(\theta)) = \log(1 + \exp(\mathbf{x'\beta})) \\ a(\phi) &= 1 \\ c(y_i, \phi) &= 0 \\ E(Y) = b'(\theta) &= \frac{\exp(\theta)}{1 + \exp(\theta)} = \mu = p \\ \eta = g(\mu) &= \log(\frac{\mu}{1\mu}) = \theta = \log(\frac{p}{1p}) = \mathbf{x'\beta} \end{aligned} \]
For \(Y_i\), i = 1,.., the loglikelihood is
\[ l_i (\beta, \phi) = \frac{y_i \theta_i  b(\theta_i)}{a(\phi)} + c(y_i, \phi) = y_i \mathbf{x}'_i \beta  \log(1+ \exp(\mathbf{x'\beta})) \]
Additionally,
\[ \begin{aligned} V(\mu_i) &= \mu_i(1\mu_i)= p_i (1p_i) \\ \frac{\partial \mu_i}{\partial \eta_i} &= p_i(1p_i) \end{aligned} \]
Hence,
\[ \begin{aligned} \frac{\partial l(\beta, \phi)}{\partial \beta_j} &= \sum_{i=1}^n[\frac{y_i  \mu_i}{a(\phi)} \times \frac{1}{V(\mu_i)}\times \frac{\partial \mu_i}{\partial \eta_i} \times x_{ij}] \\ &= \sum_{i=1}^n (y_i  p_i) \times \frac{1}{p_i(1p_i)} \times p_i(1p_i) \times x_{ij} \\ &= \sum_{i=1}^n (y_i  p_i) x_{ij} \\ &= \sum_{i=1}^n (y_i  \frac{\exp(\mathbf{x'_i\beta})}{1+ \exp(\mathbf{x'_i\beta})})x_{ij} \end{aligned} \]
then
\[ w_i = ((\frac{\partial \eta_i}{\partial \mu_i})^2 V(\mu_i))^{1} = p_i (1p_i) \]
\[ \mathbf{I}_{jk}(\mathbf{\beta}) = \sum_{i=1}^n \frac{w_i}{a(\phi)} x_{ij}x_{ik} = \sum_{i=1}^n p_i (1p_i)x_{ij}x_{ik} \]
The Fisherscoring algorithm for the MLE of \(\mathbf{\beta}\) is
\[ \left( \begin{array} {c} \beta_1 \\ \beta_2 \\ . \\ . \\ . \\ \beta_p \\ \end{array} \right)^{(m+1)} = \left( \begin{array} {c} \beta_1 \\ \beta_2 \\ . \\ . \\ . \\ \beta_p \\ \end{array} \right)^{(m)} + \mathbf{I}^{1}(\mathbf{\beta}) \left( \begin{array} {c} \frac{\partial l (\beta, \phi)}{\partial \beta_1} \\ \frac{\partial l (\beta, \phi)}{\partial \beta_2} \\ . \\ . \\ . \\ \frac{\partial l (\beta, \phi)}{\partial \beta_p} \\ \end{array} \right)_{\beta = \beta^{(m)}} \]
Similar to \[NewtonRaphson\] expect the matrix of second derivatives by the expected value of the second derivative matrix.
In matrix notation,
\[ \begin{aligned} \frac{\partial l }{\partial \beta} &= \frac{1}{a(\phi)}\mathbf{X'W\Delta(y  \mu)} \\ &= \frac{1}{a(\phi)}\mathbf{F'V^{1}(y  \mu)} \\ \end{aligned} \]
\[ \mathbf{I}(\beta) = \frac{1}{a(\phi)}\mathbf{X'WX} = \frac{1}{a(\phi)}\mathbf{F'V^{1}F} \]
where
 \(\mathbf{X}\) is an \(n \times p\) matrix of covariates
 \(\mathbf{W}\) is an \(n \times n\) diagonal matrix with \((i,i)\)th element given by \(w_i\)
 \(\mathbf{\Delta}\) an \(n \times n\) diagonal matrix with \((i,i)\)th element given by \(\frac{\partial \eta_i}{\partial \mu_i}\)
 \(\mathbf{F} = \mathbf{\frac{\partial \mu}{\partial \beta}}\) an \(n \times p\) matrix with \(i\)th row \(\frac{\partial \mu_i}{\partial \beta} = (\frac{\partial \mu_i}{\partial \eta_i})\mathbf{x}'_i\)
 \(\mathbf{V}\) an \(n \times n\) diagonal matrix with \((i,i)\)th element given by \(V(\mu_i)\)
Setting the derivative of the loglikelihood equal to 0, ML estimating equations are
\[ \mathbf{F'V^{1}y= F'V^{1}\mu} \]
where all components of this equation expect y depends on the parameters \(\beta\)
Special Cases
If one has a canonical link, the estimating equations reduce to
\[ \mathbf{X'y= X'\mu} \]
If one has an identity link, then
\[ \mathbf{X'V^{1}y = X'V^{1}X\hat{\beta}} \]
which gives the generalized least squares estimator
Generally, we can rewrite the Fisherscoring algorithm as
\[ \beta^{(m+1)} = \beta^{(m)} + \mathbf{(\hat{F}'\hat{V}^{1}\hat{F})^{1}\hat{F}'\hat{V}^{1}(y \hat{\mu})} \]
Since \(\hat{F},\hat{V}, \hat{\mu}\) depend on \(\beta\), we evaluate at \(\beta^{(m)}\)
From starting values \(\beta^{(0)}\), we can iterate until convergence.
Notes:
 if \(a(\phi)\) is a constant or of the form \(m_i \phi\) with known \(m_i\), then \(\phi\) cancels.
7.7.1.2 Estimation of \(\phi\)
2 approaches:
 MLE
\[ \frac{\partial l_i}{\partial \phi} = \frac{(\theta_i y_i  b(\theta_i)a'(\phi))}{a^2(\phi)} + \frac{\partial c(y_i,\phi)}{\partial \phi} \]
the MLE of \(\phi\) solves
\[ \frac{a^2(\phi)}{a'(\phi)}\sum_{i=1}^n \frac{\partial c(y_i, \phi)}{\partial \phi} = \sum_{i=1}^n(\theta_i y_i  b(\theta_i)) \]
Situation others than normal error case, expression for \(\frac{\partial c(y,\phi)}{\partial \phi}\) are not simple
Even for the canonical link and \(a(\phi)\) constant, there is no nice general expression for \(E(\frac{\partial^2 l}{\partial \phi^2})\), so the unification GLMs provide for estimation of \(\beta\) breaks down for \(\phi\)

Moment Estimation (“Bias Corrected \(\chi^2\)”)
 The MLE is not conventional approach to estimation of \(\phi\) in GLMS.
 For the exponential family \(var(Y) =V(\mu)a(\phi)\). This implies
\[ \begin{aligned} a(\phi) &= \frac{var(Y)}{V(\mu)} = \frac{E(Y \mu)^2}{V(\mu)} \\ a(\hat{\phi}) &= \frac{1}{np} \sum_{i=1}^n \frac{(y_i \hat{\mu}_i)^2}{V(\hat{\mu})} \end{aligned} \] where \(p\) is the dimension of \(\beta\)  GLM with canonical link function \(g(.)= (b'(.))^{1}\)
\[ \begin{aligned} g(\mu) &= \theta = \eta = \mathbf{x'\beta} \\ \mu &= g^{1}(\eta)= b'(\eta) \end{aligned} \]  so the method estimator for \(a(\phi)=\phi\) is
\[ \hat{\phi} = \frac{1}{np} \sum_{i=1}^n \frac{(y_i  g^{1}(\hat{\eta}_i))^2}{V(g^{1}(\hat{\eta}_i))} \]
7.7.2 Inference
We have
\[ \hat{var}(\beta) = a(\phi)(\mathbf{\hat{F}'\hat{V}\hat{F}})^{1} \]
where
 \(\mathbf{V}\) is an \(n \times n\) diagonal matrix with diagonal elements given by \(V(\mu_i)\)
 \(\mathbf{F}\) is an \(n \times p\) matrix given by \(\mathbf{F} = \frac{\partial \mu}{\partial \beta}\)
 Both \(\mathbf{V,F}\) are dependent on the mean \(\mu\), and thus \(\beta\). Hence, their estimates (\(\mathbf{\hat{V},\hat{F}}\)) depend on \(\hat{\beta}\).
\[ H_0: \mathbf{L\beta = d} \]
where \(\mathbf{L}\) is a q x p matrix with a Wald test
\[ W = \mathbf{(L \hat{\beta}d)'(a(\phi)L(\hat{F}'\hat{V}^{1}\hat{F})L')^{1}(L \hat{\beta}d)} \]
which follows \(\chi_q^2\) distribution (asymptotically), where \(q\) is the rank of \(\mathbf{L}\)
In the simple case \(H_0: \beta_j = 0\) gives \(W = \frac{\hat{\beta}^2_j}{\hat{var}(\hat{\beta}_j)} \sim \chi^2_1\) asymptotically
Likelihood ratio test
\[ \Lambda = 2 (l(\hat{\beta}_f)l(\hat{\beta}_r)) \sim \chi^2_q \]
where
 \(q\) is the number of constraints used to fit the reduced model \(\hat{\beta}_r\), and \(\hat{\beta}_r\) is the fit under the full model.
Wald test is easier to implement, but likelihood ratio test is better (especially for small samples).
7.7.3 Deviance
Deviance is necessary for goodness of fit, inference and for alternative estimation of the dispersion parameter. We define and consider Deviance from a likelihood ratio perspective.
Assume that \(\phi\) is known. Let \(\tilde{\theta}\) denote the full and \(\hat{\theta}\) denote the reduced model MLEs. Then, the likelihood ratio (2 times the difference in loglikelihoods) is \[ 2\sum_{i=1}^{n} \frac{y_i (\tilde{\theta}_i \hat{\theta}_i)b(\tilde{\theta}_i) + b(\hat{\theta}_i)}{a_i(\phi)} \]
For exponential families, \(\mu = E(y) = b'(\theta)\), so the natural parameter is a function of \(\mu: \theta = \theta(\mu) = b'^{1}(\mu)\), and the likelihood ratio turns into
\[ 2 \sum_{i=1}^m \frac{y_i\{\theta(\tilde{\mu}_i  \theta(\hat{\mu}_i)\}  b(\theta(\tilde{\mu}_i)) + b(\theta(\hat{\mu}_i))}{a_i(\phi)} \]Comparing a fitted model to “the fullest possible model”, which is the saturated model: \(\tilde{\mu}_i = y_i\), i = 1,..,n. If \(\tilde{\theta}_i^* = \theta(y_i), \hat{\theta}_i^* = \theta (\hat{\mu})\), the likelihood ratio is
\[ 2 \sum_{i=1}^{n} \frac{y_i (\tilde{\theta}_i^*  \hat{\theta}_i^* + b(\hat{\theta}_i^*))}{a_i(\phi)} \](McCullagh 2019) specify \(a(\phi) = \phi\), then the likelihood ratio can be written as
\[ D^*(\mathbf{y, \hat{\mu}}) = \frac{2}{\phi}\sum_{i=1}^n\{y_i (\tilde{\theta}_i^* \hat{\theta}_i^*) b(\tilde{\theta}_i^*) +b(\hat{\theta}_i^*) \} \] where\(D^*(\mathbf{y, \hat{\mu}})\) = scaled deviance
\(D(\mathbf{y, \hat{\mu}}) = \phi D^*(\mathbf{y, \hat{\mu}})\) = deviance
Note:

in some random component distributions, we can write \(a_i(\phi) = \phi m_i\), where

\(m_i\) is some known scalar that may change with the observations. Then, the scaled deviance components are divided by \(m_i\):
\[ D^*(\mathbf{y, \hat{\mu}}) \equiv 2\sum_{i=1}^n\{y_i (\tilde{\theta}_i^* \hat{\theta}_i^*) b(\tilde{\theta}_i^*) +b(\hat{\theta}_i^*)\} / (\phi m_i) \]

\(m_i\) is some known scalar that may change with the observations. Then, the scaled deviance components are divided by \(m_i\):
\(D^*(\mathbf{y, \hat{\mu}}) = \sum_{i=1}^n d_i\)m where \(d_i\) is the deviance contribution from the \(i\)th observation.
\(D\) is used in model selection
\(D^*\) is used in goodness of fit tests (as it is a likelihood ratio statistic). \[ D^*(\mathbf{y, \hat{\mu}}) = 2\{l(\mathbf{y,\tilde{\mu}})l(\mathbf{y,\hat{\mu}})\} \]
\(d_i\) are used to form deviance residuals
Normal
We have
\[ \begin{aligned} \theta &= \mu \\ \phi &= \sigma^2 \\ b(\theta) &= \frac{1}{2} \theta^2 \\ a(\phi) &= \phi \end{aligned} \]
Hence,
\[ \begin{aligned} \tilde{\theta}_i &= y_i \\ \hat{\theta}_i &= \hat{\mu}_i = g^{1}(\hat{\eta}_i) \end{aligned} \]
And
\[ \begin{aligned} D &= 2 \sum_{1=1}^n Y^2_i  y_i \hat{\mu}_i  \frac{1}{2}y^2_i + \frac{1}{2} \hat{\mu}_i^2 \\ &= \sum_{i=1}^n y_i^2  2y_i \hat{\mu}_i + \hat{\mu}_i^2 \\ &= \sum_{i=1}^n (y_i  \hat{\mu}_i)^2 \end{aligned} \]
which is the residual sum of squares
Poisson
\[ \begin{aligned} f(y) &= \exp\{y\log(\mu)  \mu  \log(y!)\} \\ \theta &= \log(\mu) \\ b(\theta) &= \exp(\theta) \\ a(\phi) &= 1 \\ \tilde{\theta}_i &= \log(y_i) \\ \hat{\theta}_i &= \log(\hat{\mu}_i) \\ \hat{\mu}_i &= g^{1}(\hat{\eta}_i) \end{aligned} \]
Then,
\[ \begin{aligned} D &= 2 \sum_{i = 1}^n y_i \log(y_i)  y_i \log(\hat{\mu}_i)  y_i + \hat{\mu}_i \\ &= 2 \sum_{i = 1}^n y_i \log(\frac{y_i}{\hat{\mu}_i})  (y_i  \hat{\mu}_i) \end{aligned} \]
and
\[ d_i = 2\{y_i \log(\frac{y_i}{\hat{\mu}}) (y_i  \hat{\mu}_i)\} \]
7.7.3.1 Analysis of Deviance
The difference in deviance between a reduced and full model, where q is the difference in the number of free parameters, has an asymptotic \(\chi^2_q\). The likelihood ratio test
\[ D^*(\mathbf{y;\hat{\mu}_r})  D^*(\mathbf{y;\hat{\mu}_f}) = 2\{l(\mathbf{y;\hat{\mu}_f})l(\mathbf{y;\hat{\mu}_r})\} \]
this comparison of models is Analysis of Deviance. GLM uses this analysis for model selection.
An estimation of \(\phi\) is
\[ \hat{\phi} = \frac{D(\mathbf{y, \hat{\mu}})}{n  p} \]
where \(p\) = number of parameters fit.
Excessive use of \(\chi^2\) test could be problematic since it is asymptotic (McCullagh 2019)
7.7.3.2 Deviance Residuals
We have \(D = \sum_{i=1}^{n}d_i\). Then, we define deviance residuals
\[ r_{D_i} = \text{sign}(y_i \hat{\mu}_i)\sqrt{d_i} \]
Standardized version of deviance residuals is
\[ r_{s,i} = \frac{y_i \hat{\mu}}{\hat{\sigma}(1h_{ii})^{1/2}} \]
Let \(\mathbf{H^{GLM} = W^{1/2}X(X'WX)^{1}X'W^{1/2}}\), where \(\mathbf{W}\) is an \(n \times n\) diagonal matrix with \((i,i)\)th element given by \(w_i\) (see Estimation of \(\beta\)). Then Standardized deviance residuals is equivalently
\[ r_{s, D_i} = \frac{r_{D_i}}{\{\hat{\phi}(1h_{ii}^{glm}\}^{1/2}} \]
where \(h_{ii}^{glm}\) is the \(i\)th diagonal of \(\mathbf{H}^{GLM}\)
7.7.3.3 Pearson Chisquare Residuals
Another \(\chi^2\) statistic is Pearson \(\chi^2\) statistics: (assume \(m_i = 1\))
\[ X^2 = \sum_{i=1}^{n} \frac{(y_i  \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]
where \(\hat{\mu}_i\) is the fitted mean response fo the model of interest.
The Scaled Pearson \(\chi^2\) statistic is given by \(\frac{X^2}{\phi} \sim \chi^2_{np}\) where p is the number of parameters estimated. Hence, the Pearson \(\chi^2\) residuals are
\[ X^2_i = \frac{(y_i  \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]
If we have the following assumptions:
 Independent samples
 No overdispersion: If \(\phi = 1\), \(\frac{D(\mathbf{y;\hat{\mu}})}{np}\) and \(\frac{X^2}{np}\) have a value substantially larger 1 indicates improperly specified model or overdispersion
 Multiple groups
then \(\frac{X^2}{\phi}\) and \(D^*(\mathbf{y; \hat{\mu}})\) both follow \(\chi^2_{np}\)
7.7.4 Diagnostic Plots

Standardized residual Plots:
 plot(\(r_{s, D_i}\), \(\hat{\mu}_i\)) or plot(\(r_{s, D_i}\), \(T(\hat{\mu}_i)\)) where \(T(\hat{\mu}_i)\) is transformation(\(\hat{\mu}_i\)) called constant information scale:
 plot(\(r_{s, D_i}\), \(\hat{\eta}_i\))
 plot(\(r_{s, D_i}\), \(\hat{\mu}_i\)) or plot(\(r_{s, D_i}\), \(T(\hat{\mu}_i)\)) where \(T(\hat{\mu}_i)\) is transformation(\(\hat{\mu}_i\)) called constant information scale:
Random Component  \(T(\hat{\mu}_i)\) 

Normal  \(\hat{\mu}\) 
Poisson  \(2\sqrt{\mu}\) 
Binomial  \(2 \sin^{1}(\sqrt{\hat{\mu}})\) 
Gamma  \(2 \log(\hat{\mu})\) 
Inverse Gaussian  \(2\hat{\mu}^{1/2}\) 

If we see:
 Trend, it means we might have a wrong link function, or choice of scale
 Systematic change in range of residuals with a change in \(T(\hat{\mu})\) (incorrect random component) (systematic \(\neq\) random)
 Trend, it means we might have a wrong link function, or choice of scale
plot(\(r_{D_i},\hat{\mu}_i\)) to check Variance Function.
7.7.5 Goodness of Fit
To assess goodness of fit, we can use
In nested model, we could use likelihoodbased information measures:
\[ \begin{aligned} AIC &= 2l(\mathbf{\hat{\mu}}) + 2p \\ AICC &= 2l(\mathbf{\hat{\mu}}) + 2p(\frac{n}{np1}) \\ BIC &= 2l(\hat{\mu}) + p \log(n) \end{aligned} \]
where
 \(l(\hat{\mu})\) is the loglikelihood evaluated at the parameter estimates
 \(p\) is the number of parameters
 \(n\) is the number of observations.
Note: you have to use the same data with the same model (i.e., same link function, same random underlying random distribution). but you can have different number of parameters.
Even though statisticians try to come up with measures that are similar to \(R^2\), in practice, it is not so appropriate. For example, they compare the loglikelihood of the fitted model against the that of a model with just the intercept:
\[ R^2_p = 1  \frac{l(\hat{\mu})}{l(\hat{\mu}_0)} \]
For certain specific random components such as binary response model, we have rescaled generalized \(R^2\)
\[ \bar{R}^2 = \frac{R^2_*}{\max(R^2_*)} = \frac{1\exp\{\frac{2}{n}(l(\hat{\mu})  l(\hat{\mu}_0) \}}{1  \exp\{\frac{2}{n}l(\hat{\mu}_0)\}} \]
7.7.6 OverDispersion
Random Components  \(var(Y)\)  \(V(\mu)\) 

Binomial  \(var(Y) = n \mu (1 \mu)\)  \(V(\mu) = \phi n \mu(1 \mu)\) where \(m_i =n\) 
Poisson  \(var(Y) = \mu\)  \(V(\mu) = \phi \mu\) 
In both cases \(\phi = 1\). Recall \(b''(\theta)= V(\mu)\) check Estimation of \(\phi\).
If we find
 \(\phi >1\): overdispersion (i.e., too much variation for an independent binomial or Poisson distribution).
 \(\phi<1\): underdispersion (i.e., too little variation for an independent binomial or Poisson distribution).
If we have either over or underdispersion, it means we might have unspecified random component, we could
 Select a different random component distribution that can accommodate over or underdispersion (e.g., negative binomial, ConwayMaxwell Poisson)
 use Nonlinear and Generalized Linear Mixed Models to handle random effects in generalized linear models.