## 7.4 Poisson Regression

From the Poisson distribution

$f(Y_i) = \frac{\mu_i^{Y_i}exp(-\mu_i)}{Y_i!}, Y_i = 0,1,.. \\ E(Y_i) = \mu_i \\ var(Y_i) = \mu_i$

which is a natural distribution for counts. We can see that the variance is a function of the mean. If we let $$\mu_i = f(\mathbf{x_i; \theta})$$, it would be similar to Logistic Regression since we can choose $$f()$$ as $$\mu_i = \mathbf{x_i'\theta}, \mu_i = \exp(\mathbf{x_i'\theta}), \mu_i = \log(\mathbf{x_i'\theta})$$

### 7.4.1 Application

Count Data and Poisson regression

data(bioChemists, package = "pscl")
bioChemists <- bioChemists %>%
rename(
Num_Article = art, #articles in last 3 years of PhD
Sex = fem, #coded 1 if female
Married = mar, #coded 1 if married
Num_Kid5 = kid5, #number of childeren under age 6
PhD_Quality = phd, #prestige of PhD program
Num_MentArticle = ment #articles by mentor in last 3 years
)
hist(bioChemists$Num_Article, breaks = 25, main = 'Number of Articles') Poisson_Mod <- glm(Num_Article ~ ., family=poisson, bioChemists) summary(Poisson_Mod) ## ## Call: ## glm(formula = Num_Article ~ ., family = poisson, data = bioChemists) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.5672 -1.5398 -0.3660 0.5722 5.4467 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.304617 0.102981 2.958 0.0031 ** ## SexWomen -0.224594 0.054613 -4.112 3.92e-05 *** ## MarriedMarried 0.155243 0.061374 2.529 0.0114 * ## Num_Kid5 -0.184883 0.040127 -4.607 4.08e-06 *** ## PhD_Quality 0.012823 0.026397 0.486 0.6271 ## Num_MentArticle 0.025543 0.002006 12.733 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 1817.4 on 914 degrees of freedom ## Residual deviance: 1634.4 on 909 degrees of freedom ## AIC: 3314.1 ## ## Number of Fisher Scoring iterations: 5 Residual of 1634 with 909 df isn’t great. We see Pearson $$\chi^2$$ Predicted_Means <- predict(Poisson_Mod,type = "response") X2 <- sum((bioChemists$Num_Article - Predicted_Means)^2/Predicted_Means)
X2
##  1662.547
pchisq(X2,Poisson_Mod$df.residual, lower.tail = FALSE) ##  7.849882e-47 With interaction terms, there are some improvements Poisson_Mod_All2way <- glm(Num_Article ~ .^2, family=poisson, bioChemists) Poisson_Mod_All3way <- glm(Num_Article ~ .^3, family=poisson, bioChemists) Consider the $$\hat{\phi} = \frac{\text{deviance}}{df}$$ Poisson_Mod$deviance / Poisson_Mod$df.residual ##  1.797988 This is evidence for over-dispersion. Likely cause is missing variables. And remedies could either be to include more variables or consider random effects. A quick fix is to force the Poisson Regression to include this value of $$\phi$$, and this model is called “Quasi-Poisson.” phi_hat = Poisson_Mod$deviance/Poisson_Mod\$df.residual
summary(Poisson_Mod,dispersion = phi_hat)
##
## Call:
## glm(formula = Num_Article ~ ., family = poisson, data = bioChemists)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.5672  -1.5398  -0.3660   0.5722   5.4467
##
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.30462    0.13809   2.206  0.02739 *
## SexWomen        -0.22459    0.07323  -3.067  0.00216 **
## MarriedMarried   0.15524    0.08230   1.886  0.05924 .
## Num_Kid5        -0.18488    0.05381  -3.436  0.00059 ***
## PhD_Quality      0.01282    0.03540   0.362  0.71715
## Num_MentArticle  0.02554    0.00269   9.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.797988)
##
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
##
## Number of Fisher Scoring iterations: 5

Or directly rerun the model as

quasiPoisson_Mod <- glm(Num_Article ~ ., family=quasipoisson, bioChemists)

Quasi-Poisson is not recommended, but Negative Binomial Regression that has an extra parameter to account for over-dispersion is.