7.4 Poisson Regression

From the Poisson distribution

\[ \begin{aligned} f(Y_i) &= \frac{\mu_i^{Y_i}exp(-\mu_i)}{Y_i!}, Y_i = 0,1,.. \\ E(Y_i) &= \mu_i \\ var(Y_i) &= \mu_i \end{aligned} \]

which is a natural distribution for counts. We can see that the variance is a function of the mean. If we let \(\mu_i = f(\mathbf{x_i; \theta})\), it would be similar to Logistic Regression since we can choose \(f()\) as \(\mu_i = \mathbf{x_i'\theta}, \mu_i = \exp(\mathbf{x_i'\theta}), \mu_i = \log(\mathbf{x_i'\theta})\)

7.4.1 Application

Count Data and Poisson regression

data(bioChemists, package = "pscl")
bioChemists <- bioChemists %>%
    rename(
        Num_Article = art, #articles in last 3 years of PhD
        Sex = fem, #coded 1 if female
        Married = mar, #coded 1 if married
        Num_Kid5 = kid5, #number of childeren under age 6
        PhD_Quality = phd, #prestige of PhD program
        Num_MentArticle = ment #articles by mentor in last 3 years
    )
hist(bioChemists$Num_Article, breaks = 25, main = 'Number of Articles')

Poisson_Mod <- glm(Num_Article ~ ., family=poisson, bioChemists)
summary(Poisson_Mod)
#> 
#> Call:
#> glm(formula = Num_Article ~ ., family = poisson, data = bioChemists)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.5672  -1.5398  -0.3660   0.5722   5.4467  
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      0.304617   0.102981   2.958   0.0031 ** 
#> SexWomen        -0.224594   0.054613  -4.112 3.92e-05 ***
#> MarriedMarried   0.155243   0.061374   2.529   0.0114 *  
#> Num_Kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
#> PhD_Quality      0.012823   0.026397   0.486   0.6271    
#> Num_MentArticle  0.025543   0.002006  12.733  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 1817.4  on 914  degrees of freedom
#> Residual deviance: 1634.4  on 909  degrees of freedom
#> AIC: 3314.1
#> 
#> Number of Fisher Scoring iterations: 5

Residual of 1634 with 909 df isn’t great.

We see Pearson \(\chi^2\)

Predicted_Means <- predict(Poisson_Mod,type = "response")
X2 <- sum((bioChemists$Num_Article - Predicted_Means)^2/Predicted_Means)
X2
#> [1] 1662.547
pchisq(X2,Poisson_Mod$df.residual, lower.tail = FALSE)
#> [1] 7.849882e-47

With interaction terms, there are some improvements

Poisson_Mod_All2way <- glm(Num_Article ~ .^2, family=poisson, bioChemists)
Poisson_Mod_All3way <- glm(Num_Article ~ .^3, family=poisson, bioChemists)

Consider the \(\hat{\phi} = \frac{\text{deviance}}{df}\)

Poisson_Mod$deviance / Poisson_Mod$df.residual
#> [1] 1.797988

This is evidence for over-dispersion. Likely cause is missing variables. And remedies could either be to include more variables or consider random effects.

A quick fix is to force the Poisson Regression to include this value of \(\phi\), and this model is called “Quasi-Poisson”.

phi_hat = Poisson_Mod$deviance/Poisson_Mod$df.residual
summary(Poisson_Mod,dispersion = phi_hat)
#> 
#> Call:
#> glm(formula = Num_Article ~ ., family = poisson, data = bioChemists)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.5672  -1.5398  -0.3660   0.5722   5.4467  
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      0.30462    0.13809   2.206  0.02739 *  
#> SexWomen        -0.22459    0.07323  -3.067  0.00216 ** 
#> MarriedMarried   0.15524    0.08230   1.886  0.05924 .  
#> Num_Kid5        -0.18488    0.05381  -3.436  0.00059 ***
#> PhD_Quality      0.01282    0.03540   0.362  0.71715    
#> Num_MentArticle  0.02554    0.00269   9.496  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1.797988)
#> 
#>     Null deviance: 1817.4  on 914  degrees of freedom
#> Residual deviance: 1634.4  on 909  degrees of freedom
#> AIC: 3314.1
#> 
#> Number of Fisher Scoring iterations: 5

Or directly rerun the model as

quasiPoisson_Mod <- glm(Num_Article ~ ., family=quasipoisson, bioChemists)

Quasi-Poisson is not recommended, but Negative Binomial Regression that has an extra parameter to account for over-dispersion is.