Chapter 4 Poisson Regression

4.1 Learning Objectives

After finishing this chapter, you should be able to:

  • Describe why simple linear regression is not ideal for Poisson data.
  • Write out a Poisson regression model and identify the assumptions for inference.
  • Write out the likelihood for a Poisson regression and describe how it could be used to estimate coefficients for a model.
  • Interpret estimated coefficients from a Poisson regression and construct confidence intervals for them.
  • Use deviances for Poisson regression models to compare and assess models.
  • Use an offset to account for varying effort in data collection.
  • Fit and use a zero-inflated Poisson (ZIP) model.
# Packages required for Chapter 4
library(gridExtra)
library(knitr)
library(kableExtra)
library(mosaic)
library(xtable)
library(pscl) 
library(multcomp)
library(pander)
library(MASS)
library(tidyverse)

4.2 Introduction to Poisson Regression

Consider the following questions:

  1. Are the number of motorcycle deaths in a given year related to a state’s helmet laws?
  2. Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
  3. Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
  4. Has the number of deformed fish in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade?

Each example involves predicting a response using one or more explanatory variables, although these examples have response variables that are counts per some unit of time or space. A Poisson random variable is often used to model counts; see Chapter 3 for properties of the Poisson distribution. Since a Poisson random variable is a count, its minimum value is zero and, in theory, the maximum is unbounded. We’d like to model our main parameter \(\lambda\), the average number of occurrences per unit of time or space, as a function of one or more covariates. For example, in the first question above, \(\lambda_i\) represents the average number of motorcycle deaths in a year for state \(i\), and we hope to show that state-to-state variability in \(\lambda_i\) can be explained by state helmet laws.

For a linear least squares regression model, the parameter of interest is the average response, \(\mu_i\), for subject \(i\), and \(\mu_i\) is modeled as a line in the case of one explanatory variable. By analogy, it might seem reasonable to try to model the Poisson parameter \(\lambda_i\) as a linear function of an explanatory variable, but there are some problems with this approach. In fact, a model like \(\lambda_i=\beta_0+\beta_1x_i\) doesn’t work well for Poisson data. A line is certain to yield negative values for certain \(x_i\), but \(\lambda_i\) can only take on values from 0 to \(\infty\). In addition, the equal variance assumption in linear regression inference is violated because as the mean rate for a Poisson variable increases, the variance also increases (recall from Chapter 3 that if \(Y\) is the observed count, then \(E(Y)=Var(Y)=\lambda\)).

One way to avoid these problems is to model log(\(\lambda_i\)) instead of \(\lambda_i\) as a function of the covariates. The log(\(\lambda_i\)) takes on values from \(-\infty\) to \(\infty\). We can also take into account the increase in the variance with an increasing mean using this approach. (Note that throughout Beyond Multiple Linear Regression we use log to represent the natural logarithm.) Thus, we will consider the Poisson regression model:

\[\begin{equation*} log(\lambda_i)=\beta_0+\beta_1 x_i \end{equation*}\] where the observed values \(Y_i \sim\) Poisson with \(\lambda=\lambda_i\) for a given \(x_i\). For example, each state \(i\) can potentially have a different \(\lambda\) depending on its value of \(x_i\), where \(x_i\) could represent presence or absence of a particular helmet law. Note that the Poisson regression model contains no separate error term like the \(\epsilon\) we see in linear regression, because \(\lambda\) determines both the mean and the variance of a Poisson random variable.

4.2.1 Poisson Regression Assumptions

Much like linear least squares regression (LLSR), using Poisson regression to make inferences requires model assumptions.

  1. Poisson Response The response variable is a count per unit of time or space, described by a Poisson distribution.
  2. Independence The observations must be independent of one another.
  3. Mean=Variance By definition, the mean of a Poisson random variable must be equal to its variance.
  4. Linearity The log of the mean rate, log(\(\lambda\)), must be a linear function of x.

4.2.2 A Graphical Look at Poisson Regression

Regression models: Linear regression (left) and Poisson regression (right).

Figure 4.1: Regression models: Linear regression (left) and Poisson regression (right).

Figure 4.1 illustrates a comparison of the LLSR model for inference to Poisson regression using a log function of \(\lambda\).

  1. The graphic displaying the LLSR inferential model appears in the left panel of Figure 4.1. It shows that, for each level of X, the responses are approximately normal. The panel on the right side of Figure 4.1 depicts what a Poisson regression model looks like. For each level of X, the responses follow a Poisson distribution (Assumption 1). For Poisson regression, small values of \(\lambda\) are associated with a distribution that is noticeably skewed with lots of small values and only a few larger ones. As \(\lambda\) increases the distribution of the responses begins to look more and more like a normal distribution.
  2. In the LLSR model, the variation in \(Y\) at each level of X, \(\sigma^2\), is the same. For Poisson regression the responses at each level of X become more variable with increasing means, where variance=mean (Assumption 3).
  3. In the case of LLSR, the mean responses for each level of X, \(\mu_{Y|X}\), fall on a line. In the case of the Poisson model, the mean values of \(Y\) at each level of \(X\), \(\lambda_{Y|X}\), fall on a curve, not a line, although the logs of the means should follow a line (Assumption 4).

4.3 Case Studies Overview

We take a look at the Poisson regression model in the context of three case studies. Each case study is based on real data and real questions. Modeling household size in the Philippines introduces the idea of regression with a Poisson response along with its assumptions. A quadratic term is added to a model to determine an optimal size per household, and methods of model comparison are introduced. The campus crime case study introduces two big ideas in Poisson regression modeling: offsets, to account for sampling effort, and overdispersion, when actual variability exceeds what is expected by the model. Finally, the weekend drinking example uses a modification of a Poisson model to account for more zeros than would be expected for a Poisson random variable. These three case studies also provide context for some of the familiar concepts related to modeling such as exploratory data analysis (EDA), estimation, and residual plots.

4.4 Case Study: Household Size in the Philippines

How many other people live with you in your home? The number of people sharing a house differs from country to country and often from region to region. International agencies use household size when determining needs of populations, and the household sizes determine the magnitude of the household needs.

The Philippine Statistics Authority (PSA) spearheads the Family Income and Expenditure Survey (FIES) nationwide. The survey, which is undertaken every three years, is aimed at providing data on family income and expenditure, including levels of consumption by item of expenditure. Our data, from the 2015 FIES, is a subset of 1500 of the 40,000 observations (Philippine Statistics Authority 2015). Our data set focuses on five regions: Central Luzon, Metro Manila, Ilocos, Davao, and Visayas (see Figure 4.2).

Regions of the Philippines.

Figure 4.2: Regions of the Philippines.

At what age are heads of households in the Philippines most likely to find the largest number of people in their household? Is this association similar for poorer households (measured by the presence of a roof made from predominantly light/salvaged materials)? We begin by explicitly defining our response, \(Y=\) number of household members other than the head of the household. We then define the explanatory variables: age of the head of the household, type of roof (predominantly light/salvaged material or predominantly strong material), and location (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas). Note that predominantly light/salvaged materials are a combination of light material, mixed but predominantly light material, and mixed but predominantly salvaged material, and salvaged matrial. Our response is a count, so we consider a Poisson regression where the parameter of interest is \(\lambda\), the average number of people, other than the head, per household. We will primarily examine the relationship between household size and age of the head of household, controlling for location and income.

4.4.1 Data Organization

The first five rows from our data set fHH1.csv are illustrated in Table 4.1. Each line of the data file refers to a household at the time of the survey:

  • location = where the house is located (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas)
  • age = the age of the head of household
  • total = the number of people in the household other than the head
  • numLT5 = the number in the household under 5 years of age
  • roof = the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material, where stronger material can sometimes be used as a proxy for greater wealth)
Table 4.1: The first five observations from the Philippines Household case study.
X1 location age total numLT5 roof
1 CentralLuzon 65 0 0 Predominantly Strong Material
2 MetroManila 75 3 0 Predominantly Strong Material
3 DavaoRegion 54 4 0 Predominantly Strong Material
4 Visayas 49 3 0 Predominantly Strong Material
5 MetroManila 74 3 0 Predominantly Strong Material

4.4.2 Exploratory Data Analyses

For the rest of this case study, we will refer to the number of people in a household as the total number of people in that specific household besides the head of household. The average number of people in a household is 3.68 (Var = 5.53), and there are anywhere from 0 to 16 people in the houses. Over 11.1% of these households are made from predominantly light and salvaged material. The mean number of people in a house for houses with a roof made from predominantly strong material is 3.69 (Var=5.55), whereas houses with a roof made from predominantly light/salvaged material average 3.64 people (Var=5.41). Of the various locations, Visayas has the largest household size, on average, with a mean of 3.90 in the household, and the Davao Region has the smallest with a mean of 3.39.

Distribution of household size in 5 Philippine regions.

Figure 4.3: Distribution of household size in 5 Philippine regions.

Figure 4.3 reveals a fair amount of variability in the number in each house; responses range from 0 to 16 with many of the respondents reporting between 1 and 5 people in the house. Like many Poisson distributions, this graph is right skewed. It clearly does not suggest that the number of people in a household is a normally distributed response.

Distribution of household sizes by age group of the household head.

Figure 4.4: Distribution of household sizes by age group of the household head.

Figure 4.4 further shows that responses can be reasonably modeled with a Poisson distribution when grouped by a key explanatory variable: age of the household head. These last two plots together suggest that Assumption 1 (Poisson Response) is satisfactory in this case study.

For Poisson random variables, the variance of \(Y\) (i.e., the square of the standard deviation of \(Y\)), is equal to its mean, where \(Y\) represents the size of an individual household. As the mean increases, the variance increases. So, if the response is a count and the mean and variance are approximately equal for each group of \(X\), a Poisson regression model may be a good choice. In Table 4.2 we display age groups by 5-year increments, to check to see if the empirical means and variances of the number in the house are approximately equal for each age group. This provides us one way in which to check the Poisson Assumption 3 (mean = variance).

Table 4.2: Compare mean and variance of household size within each age group.
Age Groups Mean Variance n
(15,20] 1.667 0.6667 6
(20,25] 2.167 1.5588 18
(25,30] 2.918 1.4099 49
(30,35] 3.444 2.1931 108
(35,40] 3.842 3.5735 158
(40,45] 4.234 4.4448 175
(45,50] 4.490 6.3963 194
(50,55] 4.011 5.2512 188
(55,60] 3.807 6.5319 145
(60,65] 3.706 6.1958 153
(65,70] 3.339 7.9980 115
NA 2.550 5.5436 191

If there is a problem with this assumption, most often we see variances much larger than means. Here, as expected, we see more variability as age increases. However, it appears that the variance is smaller than the mean for lower ages, while the variance is greater than the mean for higher ages. Thus, there is some evidence of a violation of the mean=variance assumption (Assumption 3), although any violations are modest.

The Poisson regression model also implies that log(\(\lambda_i\)), not the mean household size \(\lambda_i\), is a linear function of age; i.e., \(log(\lambda_i)=\beta_0+\beta_1\textrm{age}_i\). Therefore, to check the linearity assumption (Assumption 4) for Poisson regression, we would like to plot log(\(\lambda_i\)) by age. Unfortunately, \(\lambda_i\) is unknown. Our best guess of \(\lambda_i\) is the observed mean number in the household for each age (level of \(X\)). Because these means are computed for observed data, they are referred to as empirical means. Taking the logs of the empirical means and plotting by age provides a way to assess the linearity assumption. The smoothed curve added to Figure 4.5 suggests that there is a curvilinear relationship between age and the log of the mean household size, implying that adding a quadratic term should be considered. This finding is consistent with the researchers’ hypothesis that there is an age at which a maximum household size occurs. It is worth noting that we are not modeling the log of the empirical means, rather it is the log of the true rate that is modeled. Looking at empirical means, however, does provide an idea of the form of the relationship between log(\(\lambda)\) and \(x_i\).

The log of the mean household sizes, besides the head of household, by age of the head of household, with loess smoother.

Figure 4.5: The log of the mean household sizes, besides the head of household, by age of the head of household, with loess smoother.

We can extend Figure 4.5 by fitting separate curves for each region (see Figure 4.6). This allows us to see if the relationship between mean household size and age is consistent across region. In this case, the relationships are pretty similar; if they weren’t, we could consider adding an age-by-region interaction to our eventual Poisson regression model.

Empirical log of the mean household sizes vs. age of the head of household, with loess smoother by region.

Figure 4.6: Empirical log of the mean household sizes vs. age of the head of household, with loess smoother by region.

Finally, the independence assumption (Assumption 2) can be assessed using knowledge of the study design and the data collection process. In this case, we do not have enough information to assess the independence assumption with the information we are given. If each household was not selected individually in a random manner, but rather groups of households were selected from different regions with differing customs about living arrangements, the independence assumption would be violated. If this were the case, we could use a multilevel model like those discussed in later chapters with a village term.

4.4.3 Estimation and Inference

We first consider a model for which log(\(\lambda\)) is linear in age. We then will determine whether a model with a quadratic term in age provides a significant improvement based on trends we observed in the exploratory data analysis.

R reports an estimated regression equation for the linear Poisson model as:

\[\begin{equation*} log(\hat{\lambda}) = 1.55 - 0.0047 \textrm{age} \end{equation*}\]

modela = glm(total ~ age, family = poisson, data = fHH1)
##              Estimate Std. Error z value   Pr(>|z|)
## (Intercept)  1.549942  0.0502754  30.829 1.070e-208
## age         -0.004706  0.0009363  -5.026  5.013e-07
##  Residual deviance =  2337  on  1498 df 
##  Dispersion parameter =  1

How can the coefficient estimates be interpreted in terms of this example? As done when interpreting slopes in the LLSR models, we consider how the estimated mean number in the house, \(\lambda\), changes as the age of the household head increases by an additional year. But in place of looking at change in the mean number in the house, with a Poisson regression we consider the log of the mean number in the house and then convert back to original units. For example, consider a comparison of two models—one for a given age (\(x\)) and one after increasing age by 1 (\(x+1\)):

\[\begin{equation} \begin{split} log(\lambda_X) &= \beta_0 + \beta_1X \\ log(\lambda_{X+1}) &= \beta_0 + \beta_1(X+1) \\ log(\lambda_{X+1})-log(\lambda_X) &= \beta_1 \\ log \left(\frac{\lambda_{X+1}}{\lambda_X}\right) &= \beta_1\\ \frac{\lambda_{X+1}}{\lambda_X} &= e^{\beta_1} \end{split} \tag{4.1} \end{equation}\]

These results suggest that by exponentiating the coefficient on age we obtain the multiplicative factor by which the mean count changes. In this case, the mean number in the house changes by a factor of \(e^{-0.0047}=0.995\) or decreases by 0.5% (since \(1-.995 = .005\)) with each additional year older the household head is; or, we predict a 0.47% increase in mean household size for a 1-year decrease in age of the household head (since \(1/.995=1.0047\)). The quantity on the left-hand side of Equation (4.1) is referred to as a rate ratio or relative risk, and it represents a percent change in the response for a unit change in X. In fact, for regression models in general, whenever a variable (response or explanatory) is logged, we make interpretations about multiplicative effects on that variable, while with unlogged variables we can reach our usual interpretations about additive effects.

Typically, the standard errors for the estimated coefficients are included in Poisson regression output. Here the standard error for the estimated coefficient for age is 0.00094. We can use the standard error to construct a confidence interval for \(\beta_1\). A 95% CI provides a range of plausible values for the age coefficient and can be constructed:

\[(\hat\beta_1-Z^*\cdot SE(\hat\beta_1), \quad \hat\beta_1+Z^*\cdot SE(\hat\beta_1))\] \[(-0.0047-1.96*0.00094, \quad -0.0047+1.96*0.00094)\] \[ (-0.0065, -0.0029). \]

Exponentiating the endpoints yields a confidence interval for the relative risk; i.e., the percent change in household size for each additional year older. Thus \((e^{-0.0065},e^{-0.0029})=(0.993,0.997)\) suggests that we are 95% confident that the mean number in the house decreases between 0.7% and 0.3% for each additional year older the head of household is. It is best to construct a confidence interval for the coefficient and then exponentiate the endpoints because the estimated coefficients more closely follow a normal distribution than the exponentiated coefficients. There are other approaches to constructing intervals in these circumstances, including profile likelihood, the delta method, and bootstrapping, and we will discuss some of those approaches later. In this case, for instance, the profile likelihood interval is nearly identical to the Wald-type (normal theory) confidence interval above.

# CI for betas using profile likelihood
confint(modela)
##                 2.5 %    97.5 %
## (Intercept)  1.451170  1.648249
## age         -0.006543 -0.002873
exp(confint(modela))
##              2.5 % 97.5 %
## (Intercept) 4.2681 5.1979
## age         0.9935 0.9971

If there is no association between age and household size, there is no change in household size for each additional year, so \(\lambda_X\) is equal to \(\lambda_{X+1}\) and the ratio \(\lambda_{X+1}/\lambda_X\) is 1. In other words, if there is no association between age and household size, then \(\beta_1=0\) and \(e^{\beta_1}=1\). Note that our interval for \(e^{\beta_1}\), (0.993,0.997), does not include 1, so the model with age is preferred to a model without age; i.e., age is significantly associated with household size. Note that we could have similarly confirmed that our interval for \(\beta_1\) does not include 0 to show the significance of age as a predictor of household size.

Another way to test the significance of the age term is to calculate a Wald-type statistic. A Wald-type test statistic is the estimated coefficient divided by its standard error. When the true coefficient is 0, this test statistic follows a standard normal distribution for sufficiently large \(n\). The estimated coefficient associated with the linear term in age is \({\hat{\beta}_1}=-0.0047\) with standard error \(SE(\hat{\beta}_1)=0.00094\). The value for the Wald test statistic is then \(Z=\hat{\beta}_1/SE(\hat{\beta}_1)=-5.026\), where \(Z\) follows a standard normal distribution if \(\beta_1=0\). In this case, the two-sided p-value based on the standard normal distribution for testing \(H_0:\beta_1=0\) is almost 0 (\(p=0.000000501\)). Therefore, we have statistically significant evidence (Z = -5.026, p < .001) that average household size decreases as age of the head of household increases.

4.4.4 Using Deviances to Compare Models

There is another way in which to assess how useful age is in our model. A deviance is a way in which to measure how the observed data deviates from the model predictions; it will be defined more precisely in Section 4.4.8, but it is similar to sum of squared errors (unexplained variability in the response) in LLSR regression. Because we want models that minimize deviance, we calculate the drop-in-deviance when adding age to the model with no covariates (the null model). The deviances for the null model and the model with age can be found in the model output. A residual deviance for the model with age is reported as 2337.1 with 1498 df. The output also includes the deviance and degrees of freedom for the null model (2362.5 with 1499 df). The drop-in-deviance is 25.4 (2362.5 - 2337.1) with a difference of only 1 df, so that the addition of one extra term (age) reduced unexplained variability by 25.4. If the null model were true, we would expect the drop-in-deviance to follow a \(\chi^2\) distribution with 1 df. Therefore, the p-value for comparing the null model to the model with age is found by determining the probability that the value for a \(\chi^2\) random variable with one degree of freedom exceeds 25.4, which is essentially 0. Once again, we can conclude that we have statistically significant evidence (\(\chi^2_{\text{df} =1}=25.4\), \(p < .001\)) that average household size decreases as age of the head of household increases.

# model0 is the null/reduced model
model0 <- glm(total ~ 1, family = poisson, data = fHH1)
drop_in_dev <- anova(model0, modela, test = "Chisq")
  ResidDF ResidDev Deviance Df      pval
1    1499     2362       NA NA        NA
2    1498     2337     25.4  1 4.661e-07

More formally, we are testing:

\[\textrm{Null (reduced) Model}: \log(\lambda) = \beta_0 \textrm{ or } \beta_1=0\] \[\textrm{Larger (full) Model}: \log(\lambda) = \beta_0 + \beta_1\textrm{age} \textrm{ or } \beta_1 \neq 0 \]

In order to use the drop-in-deviance test, the models being compared must be nested; e.g., all the terms in the smaller model must appear in the larger model. Here the smaller model is the null model with the single term \(\beta_0\) and the larger model has \(\beta_0\) and \(\beta_1\), so the two models are indeed nested. For nested models, we can compare the models’ residual deviances to determine whether the larger model provides a significant improvement.

Here, then, is a summary of these two approaches to hypothesis testing about terms in Poisson regression models:

Drop-in-deviance test to compare models

  • Compute the deviance for each model, then calculate: drop-in-deviance = residual deviance for reduced model – residual deviance for the larger model.
  • When the reduced model is true, the drop-in-deviance \(\sim \chi^2_d\) where d= the difference in the degrees of freedom associated with the two models (that is, the difference in the number of terms/coefficients).
  • A large drop-in-deviance favors the larger model.

Wald test for a single coefficient

  • Wald-type statistic = estimated coefficient / standard error
  • When the true coefficient is 0, for sufficiently large \(n\), the test statistic \(\sim\) N(0,1).
  • If the magnitude of the test statistic is large, there is evidence that the true coefficient is not 0.

The drop-in-deviance and the Wald-type tests usually provide consistent results; however, if there is a discrepancy, the drop-in-deviance is preferred. Not only does the drop-in-deviance test perform better in more cases, but it’s also more flexible. If two models differ by one term, then the drop-in-deviance test essentially tests if a single coefficient is 0 like the Wald test does, while if two models differ by more than one term, the Wald test is no longer appropriate.

4.4.5 Using Likelihoods to Fit Models (optional)

Before continuing with model building, we take a short detour to see how coefficient estimates are determined in a Poisson regression model. The least squares approach requires a linear relationship between the parameter, \(\lambda_i\) (the expected or mean response for observation \(i\)), and \(x_i\) (the age for observation \(i\)). However, it is log\((\lambda_i)\), not \(\lambda_i\), that is linearly related to X with the Poisson model. The assumptions of equal variance and normality also do not hold for Poisson regression. Thus, the method of least squares will not be helpful for inference in Poisson Regression. Instead of least squares, we employ the likelihood principle to find estimates of our model coefficients. We look for those coefficient estimates for which the likelihood of our data is maximized; these are the maximum likelihood estimates.

The likelihood for n independent observations is the product of the probabilities. For example, if we observe five households with household sizes of 4, 2, 8, 6, and 1 person beyond the head, the likelihood is:

\[ Likelihood = P(Y_1=4)*P(Y_2=2)*P(Y_3=8)*P(Y_4=6)*P(Y_5=1)\] Recall that the probability of a Poisson response can be written

\[P(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}\] for \(y = 0, 1, 2, ...\) So, the likelihood can be written as

\[\begin{align*} Likelihood&= \frac{ e^{-\lambda_1}\lambda_1^4 }{ 4! }* \frac{ e^{-\lambda_2}\lambda_2^2 }{ 2! } *\frac{e^{-\lambda_3}\lambda_3^8}{8!}* \frac{e^{-\lambda_4}\lambda_4^6}{6!}*\frac{e^{-\lambda_5}\lambda_5^1}{1!} \end{align*}\] where each \(\lambda_i\) can differ for each household depending on a particular \(x_i\). As in Chapter 2, it will be easier to find a maximum if we take the log of the likelihood and ignore the constant term resulting from the sum of the factorials:

\[\begin{align} -logL& \propto \lambda_{1}-4log(\lambda_{1})+\lambda_{2}-2log(\lambda_{2}) \nonumber \\ & +\lambda_{3}-8log(\lambda_{3})+\lambda_{4}-6log(\lambda_{4}) \nonumber \\ & +\lambda_{5}-log(\lambda_{5}) \tag{4.2} \end{align}\]

Now if we had the age of the head of the household for each house (\(x_i\)), we consider the Poisson regression model:

\[log(\lambda_i)=\beta_0+\beta_1x_i \]

This implies that \(\lambda\) differs for each age and can be determined using

\[\lambda_i=e^{\beta_0+\beta_1x_i.}\]

If the ages are \(X=c(32,21,55,44,28)\) years, our loglikelihood can be written:

\[\begin{align} logL & \propto [-e^{\beta_0+\beta_132}+4({\beta_0+\beta_132})]+ [-e^{\beta_0+\beta_121}+2({\beta_0+\beta_121})]+ \nonumber \\ & [-e^{\beta_0+\beta_155}+8({\beta_0+\beta_155})]+ [-e^{\beta_0+\beta_144}+6({\beta_0+\beta_144})]+ \nonumber \\ & [-e^{\beta_0+\beta_128}+({\beta_0+\beta_128})] \tag{4.3} \end{align}\]

To see this, match the terms in Equation (4.2) with those in Equation (4.3), noting that \(\lambda_i\) has been replaced with \(e^{\beta_0+\beta_1x_i}\). It is Equation (4.3) that will be used to estimate the coefficients \(\beta_0\) and \(\beta_1\). Although this looks a little more complicated than the loglikelihoods we saw in Chapter 2, the fundamental ideas are the same. In theory, we try out different possible values of \(\beta_0\) and \(\beta_1\) until we find the two for which the loglikelihood is largest. Most statistical software packages have automated search algorithms to find those values for \(\beta_0\) and \(\beta_1\) that maximize the loglikelihood.

4.4.6 Second Order Model

In Section 4.4.4, the Wald-type test and drop-in-deviance test both suggest that a linear term in age is useful. But our exploratory data analysis in Section 4.4.2 suggests that a quadratic model might be more appropriate. A quadratic model would allow us to see if there exists an age where the number in the house is, on average, a maximum. The output for a quadratic model appears below.

fHH1 <- fHH1 %>% mutate(age2 = age*age)
modela2 = glm(total ~ age + age2, family = poisson, 
              data = fHH1)
##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.3325296  1.788e-01  -1.859 6.297e-02
## age          0.0708868  6.890e-03  10.288 8.007e-25
## age2        -0.0007083  6.406e-05 -11.058 2.009e-28
##  Residual deviance =  2201  on  1497 df 
##  Dispersion parameter =  1

We can assess the importance of the quadratic term in two ways. First, the p-value for the Wald-type statistic for age\(^2\) is statistically significant (Z = -11.058, p < 0.001). Another approach is to perform a drop-in-deviance test.

drop_in_dev <- anova(modela, modela2, test = "Chisq")
  ResidDF ResidDev Deviance Df      pval
1    1498     2337       NA NA        NA
2    1497     2201    136.1  1 1.854e-31

\(H_0\): log(\(\lambda\))=\(\beta_0+\beta_1 \textrm{age}\) (reduced model)

\(H_A:\) log(\(\lambda\))=\(\beta_0+\beta_1 \textrm{age} + \beta_2 \textrm{age}^2\) (larger model)

The first order model has a residual deviance of 2337.1 with 1498 df and the second order model, the quadratic model, has a residual deviance of 2200.9 with 1497 df. The drop-in-deviance by adding the quadratic term to the linear model is 2337.1 - 2200.9 = 136.2 which can be compared to a \(\chi^2\) distribution with one degree of freedom. The p-value is essentially 0, so the observed drop of 136.2 again provides significant support for including the quadratic term.

We now have an equation in age which yields the estimated log(mean number in the house).

\[\textrm{log(mean numHouse)} = -0.333 + 0.071 \textrm{age} - 0.00071 \textrm{age}^2\]

As shown in the following, with calculus we can determine that the maximum estimated additional number in the house is \(e^{1.441} = 4.225\) when the head of the household is 50.04 years old.

\[\begin{align*} \textrm{log(total)} & = -0.333 + 0.071\textrm{age} - 0.00071 \textrm{age}^2 \\ \frac{d}{d\textrm{age}}\textrm{log(total)} & = 0 + 0.071 - 0.0014 \textrm{age} = 0 \\ \textrm{age} & = 50.04 \\ \textrm{max[log(total)]} & = -0.333 + 0.071 \times 50.04 - 0.00071 \times (50.04)^2 = 1.441 \end{align*}\]

4.4.7 Adding a Covariate

We should consider other covariates that may be related to household size. By controlling for important covariates, we can obtain more precise estimates of the relationship between age and household size. In addition, we may discover that the relationship between age and household size may differ by levels of a covariate. One important covariate to consider is location. As described earlier in the case study, there are 5 different regions that are associated with the Location variable: Central Luzon, Metro Manila, Visayas, Davao Region, and Ilocos Region. Assessing the utility of including the covariate Location is, in essence, comparing two nested models; here the quadratic model is compared to the quadratic model plus terms for Location. Results from the fitted model appears below; note that Central Luzon is the reference region that all other regions are compared to.

modela2L = glm(total ~ age + age2 + location, 
               family = poisson, data = fHH1)
##                        Estimate Std. Error  z value
## (Intercept)          -0.3843338  0.1820919  -2.1107
## age                   0.0703628  0.0069051  10.1900
## age2                 -0.0007026  0.0000642 -10.9437
## locationDavaoRegion  -0.0193872  0.0537827  -0.3605
## locationIlocosRegion  0.0609820  0.0526598   1.1580
## locationMetroManila   0.0544801  0.0472012   1.1542
## locationVisayas       0.1121092  0.0417496   2.6853
##                       Pr(>|z|)
## (Intercept)          3.480e-02
## age                  2.197e-24
## age2                 7.126e-28
## locationDavaoRegion  7.185e-01
## locationIlocosRegion 2.468e-01
## locationMetroManila  2.484e-01
## locationVisayas      7.247e-03
##  Residual deviance =  2188  on  1493 df 
##  Dispersion parameter =  1

Our Poisson regression model now looks like:

\[\begin{align*} \textrm{log(total)} & = -0.384 + 0.070 \cdot \textrm{age} - 0.00070 \cdot \textrm{age}^2 +0.061 \cdot \textrm{IlocosRegion} + \\ & 0.054 \cdot\textrm{MetroManila} +0.112 \cdot\textrm{Visayas} - 0.019 \cdot \textrm{DavaoRegion} \end{align*}\] Notice that because there are 5 different locations, we must represent the effects of different locations through 4 indicator variables. For example, \(\hat{\beta}_6=-0.0194\) indicates that, after controlling for the age of the head of household, the log mean household size is 0.0194 lower for households in the Davao Region than for households in the reference location of Central Luzon. In more interpretable terms, mean household size is \(e^{-0.0194}=0.98\) times “higher” (i.e., 2% lower) in the Davao Region than in Central Luzon, when holding age constant.

drop_in_dev <- anova(modela2, modela2L, test = "Chisq")
  ResidDF ResidDev Deviance Df    pval
1    1497     2201       NA NA      NA
2    1493     2188    13.14  4 0.01059

To test if the mean household size significantly differs by location, we must use a drop-in-deviance test, rather than a Wald-type test, because four terms (instead of just one) are added when including the location variable. From the Analysis of Deviance table above, adding the four terms corresponding to location to the quadratic model with age produces a statistically significant improvement \((\chi^2=13.144, df = 4, p=0.0106)\), so there is significant evidence that mean household size differs by location, after controlling for age of the head of household. Further modeling (not shown) shows that after controlling for location and age of the head of household, mean household size did not differ between the two types of roofing material.

4.4.8 Residuals for Poisson Models (optional)

Residual plots may provide some insight into Poisson regression models, especially linearity and outliers, although the plots are not quite as useful here as they are for linear least squares regression. There are a few options for computing residuals and predicted values. Residuals may have the form of residuals for LLSR models or the form of deviance residuals which, when squared, sum to the total deviance for the model. Predicted values can be estimates of the counts, \(e^{\beta_0+\beta_1X}\), or log counts, \(\beta_0+\beta_1X\). We will typically use the deviance residuals and predicted counts.

The residuals for linear least squares regression have the form:

\[\begin{align} \textrm{LLSR residual}_i &= \textrm{obs}_i - \textrm{fit}_i \nonumber \\ &={Y_i-\hat{\mu}_i} \nonumber \\ &= Y_i-(\hat{\beta}_0 +\hat{\beta}_1 X_i) \tag{4.4} \end{align}\] Residual sum of squares (RSS) are formed by squaring and adding these residuals, and we generally seek to minimize RSS in model building. We have several options for creating residuals for Poisson regression models. One is to create residuals in much the same way as we do in LLSR. For Poisson residuals, the predicted values are denoted by \(\hat{\lambda}_i\) (in place of \(\hat{\mu}_i\) in Equation (4.4)); they are then standardized by dividing by the standard error, \(\sqrt{\hat{\lambda}_i}\). These kinds of residuals are referred to as Pearson residuals.

\[\begin{equation*} \textrm{Pearson residual}_i = \frac{Y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}} \end{equation*}\]

Pearson residuals have the advantage that you are probably familiar with their meaning and the kinds of values you would expect. For example, after standardizing we expect most Pearson residuals to fall between -2 and 2. However, deviance residuals have some useful properties that make them a better choice for Poisson regression.

First, we define a deviance residual for an observation from a Poisson regression:

\[\begin{equation*} \textrm{deviance residual}_i = \textrm{sign}(Y_i-\hat{\lambda}_i) \sqrt{ 2 \left[Y_i log\left(\frac{Y_i}{\hat{\lambda}_i}\right) -(Y_i - \hat{\lambda}_i) \right]} \end{equation*}\] where \(\textrm{sign}(x)\) is defined such that:

\[ \textrm{sign}(x) = \begin{cases} 1 & \textrm{if }\ x > 0 \\ -1 & \textrm{if }\ x < 0 \\ 0 & \textrm{if }\ x = 0\end{cases}\]

As its name implies, a deviance residual describes how the observed data deviates from the fitted model. Squaring and summing the deviances for all observations produces the residual deviance \(=\sum (\textrm{deviance residual})^2_i\). Relatively speaking, observations for good fitting models will have small deviances; that is, the predicted values will deviate little from the observed. However, you can see that the deviance for an observation does not easily translate to a difference in observed and predicted responses as is the case with LLSR models.

A careful inspection of the deviance formula reveals several places where the deviance compares \(Y\) to \(\hat{\lambda}\): the sign of the deviance is based on the difference between \(Y\) and \(\hat{\lambda}\), and under the radical sign we see the ratio \(Y/\hat{\lambda}\) and the difference \(Y -\hat{\lambda}\). When \(Y = \hat{\lambda}\), that is, when the model fits perfectly, the difference will be 0 and the ratio will be 1 (so that its log will be 0). So like the residuals in LLSR, an observation that fits perfectly will not contribute to the sum of the squared deviances. This definition of a deviance depends on the likelihood for Poisson models. Other models will have different forms for the deviance depending on their likelihood.

Residual plot for the Poisson model of household size by age of the household head.

Figure 4.7: Residual plot for the Poisson model of household size by age of the household head.

A plot (Figure 4.7) of the deviance residuals versus predicted responses for the first order model exhibits curvature, supporting the idea that the model may improved by adding a quadratic term. Other details related to residual plots can be found in a variety of sources including McCullagh and Nelder (1989).

4.4.9 Goodness-of-Fit

The model residual deviance can be used to assess the degree to which the predicted values differ from the observed. When a model is true, we can expect the residual deviance to be distributed as a \(\chi^2\) random variable with degrees of freedom equal to the model’s residual degrees of freedom. Our model thus far, the quadratic terms for age plus the indicators for location, has a residual deviance of 2187.8 with 1493 df. The probability of observing a deviance this large if the model fits is esentially 0, saying that there is significant evidence of lack-of-fit.

1-pchisq(modela2$deviance, modela2$df.residual)  # GOF test
[1] 0

There are several reasons why lack-of-fit may be observed. (1) We may be missing important covariates or interactions; a more comprehensive data set may be needed. (2) There may be extreme observations that may cause the deviance to be larger than expected; however, our residual plots did not reveal any unusual points. (3) Lastly, there may be a problem with the Poisson model. In particular, the Poisson model has only a single parameter, \(\lambda\), for each combination of the levels of the predictors which must describe both the mean and the variance. This limitation can become manifest when the variance appears to be larger than the corresponding means. In that case, the response is more variable than the Poisson model would imply, and the response is considered to be overdispersed.

4.5 Linear Least Squares vs. Poisson Regression

\[\begin{gather*} \underline{\textrm{Response}} \\ \mathbf{LLSR:}\textrm{ Normal} \\ \mathbf{Poisson Regression:}\textrm{ Counts} \\ \textrm{ } \\ \underline{\textrm{Variance}} \\ \mathbf{LLSR:}\textrm{ Equal for each level of X} \\ \mathbf{Poisson Regression:}\textrm{ Equal to the mean for each level of X} \\ \textrm{ } \\ \underline{\textrm{Model Fitting}} \\ \mathbf{LLSR:}\ \mu=\beta_0+\beta_1x \textrm{ using Least Squares}\\ \mathbf{Poisson Regression:}\ log(\lambda)=\beta_0+\beta_1x \textrm{ using Maximum Likelihood}\\ \end{gather*}\]

\[\begin{gather*} \underline{\textrm{EDA}} \\ \mathbf{LLSR:}\textrm{ Plot X vs. Y; add line} \\ \mathbf{Poisson Regression:}\textrm{ Find }log(\bar{y})\textrm{ for several subgroups; plot vs. X} \\ \textrm{ } \\ \underline{\textrm{Comparing Models}} \\ \mathbf{LLSR:}\textrm{ Extra sum of squares F-tests; AIC/BIC} \\ \mathbf{Poisson Regression:}\textrm{ Drop in Deviance tests; AIC/BIC} \\ \textrm{ } \\ \underline{\textrm{Interpreting Coefficients}} \\ \mathbf{LLSR:}\ \beta_1=\textrm{ change in }\mu_y\textrm{ for unit change in X} \\ \mathbf{Poisson Regression:}\ e^{\beta_1}=\textrm{ percent change in }\lambda\textrm{ for unit change in X} \end{gather*}\]

4.6 Case Study: Campus Crime

Students want to feel safe and secure when attending a college or university. In response to legislation, the US Department of Education seeks to provide data and reassurances to students and parents alike. All postsecondary institutions that participate in federal student aid programs are required by the Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act and the Higher Education Opportunity Act to collect and report data on crime occurring on campus to the Department of Education. In turn, this data is publicly available on the website of the Office of Postsecondary Education. We are interested in looking at whether there are regional differences in violent crime on campus, controlling for differences in the type of school.

4.6.1 Data Organization

Each row of c_data.csv contains crime information from a post secondary institution, either a college or university. The variables include:

  • Enrollment = enrollment at the school
  • type = college (C) or university (U)
  • nv = the number of violent crimes for that institution for the given year
  • nvrate = number of violent crimes per 1000 students
  • enroll1000 = enrollment at the school, in thousands
  • region = region of the country (C = Central, MW = Midwest, NE = Northeast, SE = Southeast, SW = Southwest, and W = West)
# A tibble: 10 x 6
   Enrollment type     nv nvrate enroll1000 region
        <dbl> <chr> <dbl>  <dbl>      <dbl> <chr> 
 1       5590 U        30 5.37         5.59 SE    
 2        540 C         0 0            0.54 SE    
 3      35747 U        23 0.643       35.7  W     
 4      28176 C         1 0.0355      28.2  W     
 5      10568 U         1 0.0946      10.6  SW    
 6       3127 U         0 0            3.13 SW    
 7      20675 U         7 0.339       20.7  W     
 8      12548 C         0 0           12.5  W     
 9      30063 U        19 0.632       30.1  C     
10       4429 C         4 0.903        4.43 C     

4.6.2 Exploratory Data Analysis

Histogram of number of violent crimes by institution.

Figure 4.8: Histogram of number of violent crimes by institution.

A graph of the number of violent crimes, Figure 4.8, reveals the pattern often found with distributions of counts of rare events. Many schools reported no violent crimes or very few crimes. A few schools have a large number of crimes making for a distribution that appears to be far from normal. Therefore, Poisson regression should be used to model our data; Poisson random variables are often used to represent counts (e.g., number of violent crimes) per unit of time or space (e.g., one year).

Let’s take a look at two covariates of interest for these schools: type of institution and region. In our data, the majority of institutions are universities (65% of the 81 schools) and only 35% are colleges. Interest centers on whether the different regions tend to have different crime rates. Table 4.3 contains the name of each region and each column represents the percentage of schools in that region which are colleges or universities. The proportion of colleges varies from a low of 20% in the Southwest (SW) to a high of 50% in the West (W).

Table 4.3: Proportion of colleges and universities within region in the campus crime data set.
C MW NE SE SW W
C 0.294 0.3 0.381 0.4 0.2 0.5
U 0.706 0.7 0.619 0.6 0.8 0.5

While a Poisson regression model is a good first choice because the responses are counts per year, it is important to note that the counts are not directly comparable because they come from different size schools. This issue sometimes is referred to as the need to account for sampling effort; in other words, we expect schools with more students to have more reports of violent crime since there are more students who could be affected. We cannot directly compare the 30 violent crimes from the first school in the data set to no violent crimes for the second school when their enrollments are vastly different: 5,590 for school 1 versus 540 for school 2. We can take the differences in enrollments into account by including an offset in our model, which we will discuss in the next section. For the remainder of the EDA, we examine the violent crime counts in terms of the rate per 1,000 enrolled (\(\frac{\textrm{number of violent crimes}}{\textrm{number enrolled}} \cdot 1000\)).

Note that there is a noticeable outlier for a Southeastern school (5.4 violent crimes per 1000 students), and there is an observed rate of 0 for the Southwestern colleges which can lead to some computational issues. We therefore combined the SW and SE to form a single category of the South, and we also removed the extreme observation from the data set.

Table 4.4: The mean and variance of the violent crime rate by region and type of institution.
region type MeanCount VarCount MeanRate VarRate n
C C 1.6000 3.3000 0.3980 0.2781 5
C U 4.7500 30.9318 0.2219 0.0349 12
MW C 0.3333 0.3333 0.0163 0.0008 3
MW U 8.7143 30.9048 0.4019 0.0621 7
NE C 6.0000 32.8571 1.1250 1.1821 8
NE U 5.9231 79.2436 0.4359 0.3850 13
S C 1.1250 5.8393 0.1866 0.1047 8
S U 8.6250 68.2500 0.5713 0.2778 16
W C 0.5000 0.3333 0.0680 0.0129 4
W U 12.5000 57.0000 0.4679 0.0247 4
Boxplot of violent crime rate by region and type of institution (colleges (C) on the left, and universities (U) on the right).

Figure 4.9: Boxplot of violent crime rate by region and type of institution (colleges (C) on the left, and universities (U) on the right).

Table 4.4 and Figure 4.9 display mean violent crime rates that are generally lower at the colleges within a region (with the exception of the Northeast). In addition, the regional pattern of rates at universities appears to differ from that of the colleges.

4.6.3 Accounting for Enrollment

Although working with the observed rates (per 1000 students) is useful during the exploratory data analysis, we do not use these rates explicitly in the model. The counts (per year) are the Poisson responses when modeling, so we must take into account the enrollment in a different way. Our approach is to include a term on the right side of the model called an offset, which is the log of the enrollment, in thousands. There is an intuitive heuristic for the form of the offset. If we think of \(\lambda\) as the mean number of violent crimes per year, then \(\lambda/\textrm{enroll1000}\) represents the number per 1000 students, so that the yearly count is adjusted to be comparable across schools of different sizes. Adjusting the yearly count by enrollment is equivalent to adding \(log(\textrm{enroll1000})\) to the right-hand side of the Poisson regression equation—essentially adding a predictor with a fixed coefficient of 1:

\[\begin{align*} log(\frac{\lambda}{\textrm{enroll1000}} )= \beta_0 + \beta_1(\textrm{type}) \nonumber \\ log(\lambda)-log(\textrm{enroll1000}) = \beta_0 + \beta_1(\textrm{type}) \nonumber \\ log(\lambda) = \beta_0 + \beta_1(\textrm{type}) + log(\textrm{enroll1000}) \end{align*}\]

While this heuristic is helpful, it is important to note that it is not \(\frac{\lambda}{ \textrm{enroll1000}}\) that we are modeling. We are still modeling \(log(\lambda)\), but we’re adding an offset to adjust for differing enrollments, where the offset has the unusual feature that the coefficient is fixed at 1.0. As a result, no estimated coefficient for enroll1000 or \(log(\textrm{enroll1000})\) will appear in the output. As this heuristic illustrates, modeling \(log(\lambda)\) and adding an offset is equivalent to modeling rates, and coefficients can be interpreted that way.

4.7 Modeling Assumptions

In Table 4.4, we see that the variances are greatly higher than the mean counts in almost every group. Thus, we have reason to question the Poisson regression assumption of variability equal to the mean; we will have to return to this issue after some initial modeling. The fact that the variance of the rate of violent crimes per 1000 students tends to be on the same scale as the mean tells us that adjusting for enrollment may provide some help, although that may not completely solve our issues with excessive variance.

As far as other model assumptions, linearity with respect to \(log(\lambda)\) is difficult to discern without continuous predictors, and it is not possible to assess independence without knowing how the schools were selected.

4.8 Initial Models

We are interested primarily in differences in violent crime between institutional types controlling for difference in regions, so we fit a model with region, institutional type, and our offset. Note that the central region is the reference level in our model.

modeltr <- glm(nv ~ type + region, family = poisson,
               offset = log(enroll1000), data = c.data)
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.54780     0.1711 -9.0439 1.512e-19
## typeU        0.27956     0.1331  2.0997 3.576e-02
## regionMW     0.09912     0.1775  0.5583 5.766e-01
## regionNE     0.77813     0.1531  5.0836 3.703e-07
## regionS      0.58238     0.1490  3.9098 9.238e-05
## regionW      0.26275     0.1875  1.4011 1.612e-01
##  Residual deviance =  348.7  on  74 df 
##  Dispersion parameter =  1

From our model, the Northeast and the South differ significantly from the Central region (p= 0.00000037 and p=0.0000924, respectively). The estimated coefficient of 0.778 means that the violent crime rate per 1,000 in the Northeast is nearly 2.2 (\(e^{0.778}\)) times that of the Central region controlling for the type of school. A Wald-type confidence interval for this factor can be constructed by first calculating a CI for the coefficient (0.778 \(\pm\) \(1.96 \cdot 0.153\)) and then exponentiating (1.61 to 2.94).

4.8.1 Tukey’s Honestly Significant Differences

Comparisons to regions other than the Central region can be accomplished by changing the reference region. If many comparisons are made, it would be best to adjust for multiple comparisons using a method such as Tukey’s Honestly Significant Differences, which considers all pairwise comparisons among regions. This method helps control the large number of false positives that we would see if we ran multiple t-tests comparing groups. The honestly significant difference compares a standardized mean difference between two groups to a critical value from a studentized range distribution.

mult_comp <- summary(glht(modeltr, mcp(region="Tukey")))
## # A tibble: 10 x 5
##    comparison estimate    SE z_value    p_value
##    <chr>         <dbl> <dbl>   <dbl>      <dbl>
##  1 MW - C       0.0991 0.178   0.558 0.980     
##  2 NE - C       0.778  0.153   5.08  0.00000349
##  3 S - C        0.582  0.149   3.91  0.000828  
##  4 W - C        0.263  0.188   1.40  0.621     
##  5 NE - MW      0.679  0.155   4.37  0.000109  
##  6 S - MW       0.483  0.151   3.19  0.0121    
##  7 W - MW       0.164  0.189   0.864 0.908     
##  8 S - NE      -0.196  0.122  -1.61  0.486     
##  9 W - NE      -0.515  0.166  -3.11  0.0157    
## 10 W - S       -0.320  0.163  -1.96  0.280

In our case, Tukey’s Honestly Significant Differences simultaneously evaluates all 10 mean differences between pairs of regions. We find that the Northeast has significantly higher rates of violent crimes than the Central, Midwest, and Western regions, while the South has significantly higher rates of violent crimes than the Central and the Midwest, controlling for the type of institution. In the primary model, the University indicator is significant and, after exponentiating the coefficient, can be interpreted as an approximately (\(e^{0.280}\)) 32% increase in violent crime rate over colleges after controlling for region.

These results certainly suggest significant differences in regions and type of institution. However, the EDA findings suggest the effect of the type of institution may vary depending upon the region, so we consider a model with an interaction between region and type.

modeli <- glm(nv ~ type + region + region:type, 
              family = poisson,
              offset = log(enroll1000), data = c.data)
##                Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     -1.4741     0.3536 -4.1694 3.054e-05
## typeU            0.1959     0.3775  0.5190 6.038e-01
## regionMW        -1.9765     1.0607 -1.8635 6.239e-02
## regionNE         1.5529     0.3819  4.0664 4.775e-05
## regionS         -0.1562     0.4859 -0.3216 7.478e-01
## regionW         -1.8337     0.7906 -2.3194 2.037e-02
## typeU:regionMW   2.1965     1.0765  2.0403 4.132e-02
## typeU:regionNE  -1.0698     0.4200 -2.5473 1.086e-02
## typeU:regionS    0.8121     0.5108  1.5899 1.118e-01
## typeU:regionW    2.4106     0.8140  2.9616 3.061e-03
##  Residual deviance =  276.7  on  70 df 
##  Dispersion parameter =  1

These results provide convincing evidence of an interaction between the effect of region and the type of institution. A drop-in-deviance test like the one we carried out in the previous case study confirms the significance of the contribution of the interaction to this model. We have statistically significant evidence (\(\chi^2=71.98, df=4, p<.001\)) that the difference between colleges and universities in violent crime rate differs by region. For example, our model estimates that violent crime rates are 13.6 (\(e^{.196+2.411}\)) times higher in universities in the West compared to colleges, while in the Northeast we estimate that violent crime rates are 2.4 (\(\frac{1}{e^{.196-1.070}}\)) times higher in colleges.

drop_in_dev <- anova(modeltr, modeli, test = "Chisq")
  ResidDF ResidDev Deviance Df      pval
1      74    348.7       NA NA        NA
2      70    276.7    71.98  4 8.664e-15

The residual deviance (276.70 with 70 df) suggests significant lack-of-fit in the interaction model (p < .001). One possibility is that there are other important covariates that could be used to describe the differences in the violent crime rates. Without additional covariates to consider, we look for extreme observations, but we have already eliminated the most extreme of the observations.

In the absence of other covariates or extreme observations, we consider overdispersion as a possible explanation of the significant lack-of-fit.

4.9 Overdispersion

4.9.1 Dispersion Parameter Adjustment

Overdispersion suggests that there is more variation in the response than the model implies. Under a Poisson model, we would expect the means and variances of the response to be about the same in various groups. Without adjusting for overdispersion, we use incorrect, artificially small standard errors leading to artificially small p-values for model coefficients. We may also end up with artificially complex models.

We can take overdispersion into account in several different ways. The simplest is to use an estimated dispersion factor to inflate standard errors. Another way is to use a negative-binomial regression model. We begin with using an estimate of the dispersion parameter.

We can estimate a dispersion parameter, \(\phi\), by dividing the model deviance by its corresponding degrees of freedom; i.e., \(\hat\phi=\frac{\sum(\textrm{Pearson residuals})^2}{n-p}\) where \(p\) is the number of model parameters. It follows from what we know about the \(\chi^2\) distribution that if there is no overdispersion, this estimate should be close to one. It will be larger than one in the presence of overdispersion. We inflate the standard errors by multiplying the variance by \(\phi\), so that the standard errors are larger than the likelihood approach would imply; i.e., \(SE_Q(\hat\beta)=\sqrt{\hat\phi}*SE(\hat\beta)\), where \(Q\) stands for “quasi-Poisson” since multiplying variances by \(\phi\) is an ad-hoc solution. Our process for model building and comparison is called quasilikelihood—similar to likelihood but without exact underlying distributions. If we choose to use a dispersion parameter with our model, we refer to the approach as quasilikelihood. The following output illustrates a quasi-Poisson approach to the interaction model:

modeliq <- glm(nv ~ type + region + region:type, 
               family = quasipoisson,
               offset = log(enroll1000), data = c.data)
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -1.4741     0.7455 -1.9773  0.05195
## typeU            0.1959     0.7961  0.2461  0.80631
## regionMW        -1.9765     2.2366 -0.8837  0.37987
## regionNE         1.5529     0.8053  1.9284  0.05786
## regionS         -0.1562     1.0246 -0.1525  0.87924
## regionW         -1.8337     1.6671 -1.0999  0.27513
## typeU:regionMW   2.1965     2.2701  0.9676  0.33659
## typeU:regionNE  -1.0698     0.8856 -1.2080  0.23111
## typeU:regionS    0.8121     1.0771  0.7540  0.45338
## typeU:regionW    2.4106     1.7164  1.4045  0.16460
##  Residual deviance =  276.7  on  70 df 
##  Dispersion parameter =  4.447

In the absence of overdispersion, we expect the dispersion parameter estimate to be 1.0. The estimated dispersion parameter here is much larger than 1.0 (4.447) indicating overdispersion (extra variance) that should be accounted for. The larger estimated standard errors in the quasi-Poisson model reflect the adjustment. For example, the standard error for the West region term from a likelihood based approach is 0.7906, whereas the quasilikelihood standard error is \(\sqrt{4.47}*0.7906\) or 1.6671. This term is no longer significant under the quasi-Poisson model. In fact, after adjusting for overdispersion (extra variation), none of the model coefficients in the quasi-Poisson model are significant at the .05 level! This is because standard errors were all increased by a factor of 2.1 (\(\sqrt{\hat\phi}=\sqrt{4.447}=2.1\)), while estimated coefficients remain unchanged.

Note that tests for individual parameters are now based on the t-distribution rather than a standard normal distribution, with test statistic \(t=\frac{\hat\beta}{SE_Q(\hat\beta)}\) following an (approximate) t-distribution with \(n-p\) degrees of freedom if the null hypothesis is true (\(H_O:\beta=0\)). Drop-in-deviance tests can be similarly adjusted for overdispersion in the quasi-Poisson model. In this case, you can divide the test statistic (per degree of freedom) by the estimated dispersion parameter and compare the result to an F-distribution with the difference in the model degrees of freedom for the numerator and the degrees of freedom for the larger model in the denominator. That is, \(F=\frac{\textrm{drop in deviance}}{\textrm{difference in df}} / {\hat\phi}\) follows an (approximate) F-distribution when the null hypothesis is true (\(H_0\): reduced model sufficient). The output below tests for an interaction between region and type of institution after adjusting for overdispersion (extra variance):

modeltrq <- glm(nv ~ type + region, family = quasipoisson,
               offset = log(enroll1000), data = c.data)
drop_in_dev <- anova(modeltrq, modeliq, test = "F")
  ResidDF ResidDev     F Df     pval
1      74    348.7    NA NA       NA
2      70    276.7 4.047  4 0.005213

Here, even after adjusting for overdispersion, we still have statistically significant evidence (\(F=4.05, p=.0052\)) that the difference between colleges and universities in violent crime rate differs by region.

4.9.2 No Dispersion vs. Overdispersion

Table 4.5 summarizes the comparison between Poisson inference (tests and confidence intervals assuming no overdispersion) and quasi-Poisson inference (tests and confidence intervals after accounting for overdispersion).

Table 4.5: Comparison of Poisson and quasi-Poisson inference.
Poisson quasi-Poisson
Estimate \(\hat{\beta}\) \(\hat{\beta}\)
Std error \(SE(\hat{\beta})\) \(SE_Q(\hat{\beta}) = \sqrt{\hat{\phi}} SE(\hat{\beta})\)
Wald-type test stat \(Z = \hat{\beta} / SE(\hat{\beta})\) \(t = \hat{\beta} / SE_Q(\hat{\beta})\)
Confidence interval \(\hat{\beta} \pm z^{'} SE(\hat{\beta})\) \(\hat{\beta} \pm t^{'} SE_Q(\hat{\beta})\)
Drop in deviance test \(\chi^2 = \textrm{resid dev(reduced) - resid dev(full)}\) \(F = (\chi^2 / \textrm{difference in df}) / \hat{\phi}\)

4.9.3 Negative Binomial Modeling

Another approach to dealing with overdispersion is to model the response using a negative binomial instead of a Poisson distribution. An advantage of this approach is that it introduces another parameter in addition to \(\lambda\), which gives the model more flexibility and, as opposed to the quasi-Poisson model, the negative binomial model assumes an explicit likelihood model. You may recall that negative binomial random variables take on non-negative integer values, which is consistent with modeling counts. This model posits selecting a \(\lambda\) for each institution and then generating a count using a Poisson random variable with the selected \(\lambda\). With this approach, the counts will be more dispersed than would be expected for observations based on a single Poisson variable with rate \(\lambda\). (See Guided Exercises on the Gamma-Poisson mixture in Chapter 3.)

Mathematically, you can think of the negative binomial model as a Poisson model where \(\lambda\) is also random, following a gamma distribution. Specifically, if \(Y|\lambda \sim \textrm{Poisson}(\lambda)\) and \(\lambda \sim \textrm{gamma}(r,\frac{1-p}{p})\), then \(Y \sim \textrm{NegBinom}(r,p)\) where \(E(Y)=\frac{pr}{1-p}=\mu\) and \(Var(Y)=\frac{pr}{(1-p)^2}=\mu+\frac{\mu^2}{r}\). The overdispersion in this case is given by \(\frac{\mu^2}{r}\), which approaches 0 as \(r\) increases (so smaller values of \(r\) indicate greater overdispersion).

Here is what happens if we apply a negative binomial regression model to the interaction model, which we’ve already established suffers from overdispersion issues under regular Poisson regression:

# Account for overdispersion with negative binomial model
modelinb <- glm.nb(nv ~ type + region + region:type, 
               offset(log(enroll1000)), data = c.data2)
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.4904     0.4281  1.1455 0.252008
## typeU            1.2174     0.4608  2.6422 0.008237
## regionMW        -1.0953     0.8075 -1.3563 0.175004
## regionNE         1.3966     0.5053  2.7641 0.005709
## regionS          0.1461     0.5559  0.2627 0.792752
## regionW         -1.1858     0.6870 -1.7260 0.084347
## typeU:regionMW   1.6342     0.8498  1.9231 0.054469
## typeU:regionNE  -1.1259     0.5601 -2.0102 0.044411
## typeU:regionS    0.4513     0.5995  0.7527 0.451638
## typeU:regionW    2.0387     0.7527  2.7086 0.006758
##  Residual deviance =  199.6  on  69 df 
##  Dispersion parameter (theta) =  1.313

These results differ from the quasi-Poisson model. Several effects are now statistically significant at the .05 level: the effect of type of institution for the Central region (\(Z=2.64, p=.008\)), the difference between Northeast and Central regions for colleges (\(Z=2.76, p=.006\)), the difference between Northeast and Central regions in type effect (\(Z=-2.01, p=.044\)), and the difference between West and Central regions in type effect (\(Z=2.71, p=.007\)). In this case, compared to the quasi-Poisson model, negative binomial coefficient estimates are generally in the same direction and similar in size, but negative binomial standard errors are somewhat smaller.

In summary, we explored the possibility of differences in the violent crime rate between colleges and universities, controlling for region. Our initial efforts seemed to suggest that there are indeed differences between colleges and universities, and the pattern of those differences depends upon the region. However, this model exhibited significant lack-of-fit which remained after the removal of an extreme observation. In the absence of additional covariates, we accounted for the lack-of-fit by using a quasilikelihood approach and a negative binomial regression, which provided slightly different conclusions. We may want to look for additional covariates and/or more data.

4.10 Case Study: Weekend Drinking

Sometimes when analyzing Poisson data, you may see many more zeros in your data set than you would expect for a Poisson random variable. For example, an informal survey of students in an introductory statistics course included the question, “How many alcoholic drinks did you consume last weekend?”. This survey was conducted on a dry campus where no alcohol is officially allowed, even among students of drinking age, so we expect that some portion of the respondents never drink. The non-drinkers would thus always report zero drinks. However, there will also be students who are drinkers reporting zero drinks because they just did not happen to drink during the past weekend. Our zeros, then, are a mixture of responses from non-drinkers and drinkers who abstained during the past weekend. Ideally, we’d like to sort out the non-drinkers and drinkers when performing our analysis.

4.10.1 Research Question

The purpose of this survey is to explore factors related to drinking behavior on a dry campus. What proportion of students on this dry campus never drink? What factors, such as off-campus living and sex, are related to whether students drink? Among those who do drink, to what extent is moving off campus associated with the number of drinks in a weekend? It is commonly assumed that males’ alcohol consumption is greater than females’; is this true on this campus? Answering these questions would be a simple matter if we knew who was and was not a drinker in our sample. Unfortunately, the non-drinkers did not identify themselves as such, so we will need to use the data available with a model that allows us to estimate the proportion of drinkers and non-drinkers.

4.10.2 Data Organization

Each line of weekendDrinks.csv contains data provided by a student in an introductory statistics course. In this analysis, the response of interest is the respondent’s report of the number of alcoholic drinks they consumed the previous weekend, whether the student lives off.campus, and sex. We will also consider whether a student is likely a firstYear student based on the dorm they live in. Here is a sample of observations from this data set:

head(zip.data[2:5])
  drinks sex off.campus firstYear
1      0   f          0      TRUE
2      5   f          0     FALSE
3     10   m          0     FALSE
4      0   f          0     FALSE
5      0   m          0     FALSE
6      3   f          0     FALSE

4.10.3 Exploratory Data Analysis

As always we take stock of the amount of data; here there are 77 observations. Large sample sizes are preferred for the type of model we will consider, and n=77 is on the small side. We proceed with that in mind.

A premise of this analysis is that we believe that those responding zero drinks are coming from a mixture of non-drinkers and drinkers who abstained the weekend of the survey.

  • Non-drinkers: respondents who never drink and would always reply with zero.
  • Drinkers: obviously this includes those responding with one or more drinks, but it also includes people who are drinkers but did not happen to imbibe the past weekend. These people reply zero but are not considered non-drinkers.
Observed (a) versus modeled (b) number of drinks.

Figure 4.10: Observed (a) versus modeled (b) number of drinks.

Beginning the EDA with the response, number of drinks, we find that over 46% of the students reported no drinks during the past weekend. Figure 4.10a portrays the observed number of drinks reported by the students. The mean number of drinks reported the past weekend is 2.013. Our sample consists of 74% females and 26% males, only 9% of whom live off campus.

Because our response is a count, it is natural to consider a Poisson regression model. You may recall that a Poisson distribution has only one parameter, \(\lambda\), for its mean and variance. Here we will include an additional parameter, \(\alpha\). We define \(\alpha\) to be the true proportion of non-drinkers in the population.

The next step in the EDA is especially helpful if you suspect your data contains excess zeros. Figure 4.10b is what we might expect to see under a Poisson model. Bars represent the probabilities for a Poisson distribution (using the Poisson probability formula) with \(\lambda\) equal to the mean observed number of drinks, 2.013 drinks per weekend. Comparing this Poisson distribution to what we observed (Figure 4.10a), it is clear that many more zeros have been reported by the students than you would expect to see if the survey observations were coming from a Poisson distribution. This doesn’t surprise us because we had expected a subset of the survey respondents to be non-drinkers; i.e., they would not be included in this Poisson process. This circumstance actually arises in many Poisson regression settings. We will define \(\lambda\) to be the mean number of drinks among those who drink, and \(\alpha\) to be the proportion of non-drinkers (“true zeros”). Then, we will attempt to model \(\lambda\) and \(\alpha\) (or functions of \(\lambda\) and \(\alpha\)) simultaneously using covariates like sex, first-year status, and off-campus residence. This type of model is referred to as a zero-inflated Poisson model or ZIP model.

4.10.4 Modeling

We first fit a simple Poisson model with the covariates off.campus and sex.

pois.m1 <- glm(drinks ~ off.campus + sex, family = poisson,
               data = zip.data)
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)   0.1293     0.1241   1.041 2.976e-01
## off.campus    0.8976     0.2008   4.470 7.830e-06
## sexm          1.1154     0.1611   6.925 4.361e-12
##  Residual deviance =  230.5  on  74 df 
##  Dispersion parameter =  1
# Exponentiated coefficients
exp(coef(pois.m1))
## (Intercept)  off.campus        sexm 
##       1.138       2.454       3.051
# Goodness-of-fit test
gof.pvalue = 1 - pchisq(pois.m1$deviance, pois.m1$df.residual)
gof.pvalue
## [1] 0

Both covariates are statistically significant, but a goodness-of-fit test reveals that there remains significant lack-of-fit (residual deviance: 230.54 with only 74 df; p<.001 based on \(\chi^2\) test with 74 df). In the absence of important missing covariates or extreme observations, this lack-of-fit may be explained by the presence of a group of non-drinkers.

A zero-inflated Poisson regression model to take non-drinkers into account consists of two parts:

  • One part models the association, among drinkers, between number of drinks and the predictors of sex and off-campus residence.
  • The other part uses a predictor for first-year status to obtain an estimate of the proportion of non-drinkers based on the reported zeros.

The form for each part of the model follows. The first part looks like an ordinary Poisson regression model:

\[ log(\lambda)=\beta_0+\beta_1\textrm{off.campus}+ \beta_2\textrm{sex} \] where \(\lambda\) is the mean number of drinks in a weekend among those who drink. The second part has the form

\[ logit(\alpha)=\beta_0+\beta_1\textrm{firstYear} \] where \(\alpha\) is the probability of being in the non-drinkers group and \(logit(\alpha) = log( \alpha/(1-\alpha))\). We’ll provide more detail on the logit in Chapter 6. There are many ways in which to structure this model; here we use different predictors in the two pieces, athough it would have been perfectly fine to use the same predictors for both pieces, or even no predictors for one of the pieces.

4.10.5 Fitting a ZIP Model

How is it possible to fit such a model? We cannot observe whether a respondent is a drinker or not (which probably would’ve been good to ask). The ZIP model is a special case of a more general type of statistical model referred to as a latent variable model. More specifically, it is a type of a mixture model where observations for one or more groups occur together and the group membership is unknown. Zero-inflated models are a particularly common example of a mixture model, but the response does not need to follow a Poisson distribution. Likelihood methods are at the core of this methodology, but fitting is an iterative process where it is necessary to start out with some guesses (or starting values). In general, it is important to know that models like ZIP exist, although we’ll only explore interpretations and fitting options for a single case study here.

Here is the general idea of how ZIP models are fit. Imagine that the graph of the Poisson distribution in Figure 4.10b is removed from the observed data distribution in Figure 4.10a. Some zero responses will remain. These would correspond to non-drinkers, and the proportion of all observations these zeros constitute might make a reasonable estimate for \(\alpha\), the proportion of non-drinkers. The likelihood is used and some iterating in the fitting process is involved because the Poisson distribution in Figure 4.10b is based on the mean of the observed data, which means it is the average among all students, not only among drinkers. Furthermore, the likelihood incorporates the predictors, sex and off.campus. So there is a little more to it than computing the proportion of zeros, but this heuristic should provide you a general idea of how these kinds of models are fit. We will use the R function zeroinfl from the package pscl to fit a ZIP model.

zip.m2 <- zeroinfl(drinks ~ off.campus + sex | firstYear, 
                   data = zip.data)
## $count
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)   0.7543     0.1440   5.238 1.624e-07
## off.campus    0.4159     0.2059   2.020 4.333e-02
## sexm          1.0209     0.1752   5.827 5.634e-09
## 
## $zero
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -0.6036     0.3114  -1.938  0.05261
## firstYearTRUE   1.1364     0.6095   1.864  0.06226
##  Log likelihood =  -140.8
exp(coef(zip.m2))   # exponentiated coefficients
##  count_(Intercept)   count_off.campus 
##             2.1261             1.5158 
##         count_sexm   zero_(Intercept) 
##             2.7757             0.5468 
## zero_firstYearTRUE 
##             3.1155

Our model uses firstYear to distinguish drinkers and non-drinkers (“Zero-inflation model coefficients”) and off.campus and sex to help explain the differences in the number of drinks among drinkers (“Count model coefficients”). Again, we could have used the same covariates for the two pieces of a ZIP model, but neither off.campus nor sex proved to be a useful predictor of drinkers vs. non-drinkers after we accounted for first-year status.

We’ll first consider the “Count model coefficients,” which provide information on how the sex and off-campus status of a student who is a drinker are related to the number of drinks reported by that student over a weekend. As we have done with previous Poisson regression models, we exponentiate each coefficient for ease of interpretation. Thus, for those who drink, the average number of drinks for males is \(e^{1.0209}\) or 2.76 times the number for females (Z = 5.827, p < 0.001) given that you are comparing people who live in comparable settings, i.e., either both on or both off campus. Among drinkers, the mean number of drinks for students living off campus is \(e^{0.4159}=1.52\) times that of students living on campus for those of the same sex (Z = 2.021, p = 0.0433).

The “Zero-inflation model coefficients” refer to separating drinkers from non-drinkers. An important consideration in separating drinkers from non-drinkers may be whether this is their first year, where firstYear is a 0/1 indicator variable.

We have \[ log(\alpha/(1-\alpha)) =-0.6036+1.1364\textrm{firstYear} \]

However, we are interested in \(\alpha\), the proportion of non-drinkers. Exponentiating the coefficient for the first-year term for this model yields 3.12. Here it is interpreted as the odds (\(\frac{\alpha}{1-\alpha}\)) that a first-year student is a non-drinker is 3.12 times the odds that an upper-class student is a non-drinker. Furthermore, with a little algebra (solving the equation with \(log(\alpha/(1-\alpha)\)) for \(\alpha\)), we have

\[ \hat{\alpha} = \frac{e^ {-0.6036+1.1364(\textrm{firstYear})}} {1+e^{ -0.6036+1.1364(\textrm{firstYear}) } }. \]

The estimated chance that a first-year student is a non-drinker is

\[ \frac{e^{0.533}}{1+e^{0.533}} = 0.630 \] or 63.0%, while for non-first-year students, the estimated probability of being a non-drinker is 0.354. If you have seen logistic regression, you’ll recognize that this transformation is what is used to estimate a probability. More on this in Chapter 6.

4.10.6 The Vuong Test (optional)

Moving from ordinary Poisson to zero-inflated Poisson has helped us address additional research questions: What proportion of students are non-drinkers, and what factors are associated with whether or not a student is a non-drinker? While a ZIP model seems more faithful to the nature and structure of this data, can we quantitatively show that a zero-inflated Poisson is better than an ordinary Poisson model?

We cannot use the drop-in-deviance test we discussed earlier because these two models are not nested within one another. Vuong (1989) devised a test to make this comparison for the special case of comparing a zero-inflated model and ordinary regression model. Essentially, the Vuong Test is able to compare predicted probabilities of non-nested models.

vuong(pois.m1, zip.m2)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A
Raw                      -2.689 model2 > model1
AIC-corrected            -2.534 model2 > model1
BIC-corrected            -2.353 model2 > model1
              p-value
Raw            0.0036
AIC-corrected  0.0056
BIC-corrected  0.0093

Here, we have structured the Vuong Test to compare Model 1: Ordinary Poisson Model to Model 2: Zero-inflation Model. If the two models do not differ, the test statistic for Vuong would be asymptotically standard Normal and the p-value would be relatively large. Here the first line of the output table indicates that the zero-inflation model is better (\(Z=-2.69,p=.0036\)). Note that the test depends upon sufficiently large n for the Normal approximation, so since our sample size (n=77) is somewhat small, we need to interpret this result with caution. More research is underway to address statistical issues related to these comparisons.

4.10.7 Residual Plot

Fitted values (\(\hat{y}\)) and residuals (\(y-\hat{y}\)) can be computed for zero-inflation models and plotted. Figure 4.11 reveals that one observation appears to be extreme (Y=22 drinks during the past weekend). Is this a legitimate observation or was there a transcribing error? Without the original respondents, we cannot settle this question. It might be worthwhile to get a sense of how influential this extreme observation is by removing Y=22 and refitting the model.

Residuals by fitted counts for ZIP model.

Figure 4.11: Residuals by fitted counts for ZIP model.

4.10.8 Limitations

Given that you have progressed this far in your statistical education, the weekend drinking survey question should raise some red flags. What time period constitutes the “weekend”? Will some students be thinking of only Saturday night, while others include Friday night or possibly Sunday evening? What constitutes a drink—a bottle of beer? How many drinks will a respondent report for a bottle of wine? Precise definitions would vastly improve the quality of this data. There is also an issue related to confidentiality. If the data is collected in class, will the teacher be able to identify the respondent? Will respondents worry that a particular response will affect their grade in the class or lead to repercussions on a dry campus?

In addition to these concerns, there are a number of other limitations that should be noted. Following the concern of whether this data represents a random sample of any population (it doesn’t), we also must be concerned with the size of this data set. ZIP models are not appropriate for small samples and this data set is not impressively large.

At times, a mixture of zeros occurs naturally. It may not come about because of neglecting to ask a critical question on a survey, but the information about the subpopulation may simply not be ascertainable. For example, visitors from a state park were asked as they departed how many fish they caught, but those who report 0 could be either non-fishers or fishers who had bad luck. These kinds of circumstances occur often enough that ZIP models are becoming increasingly common.

Actually, applications which extend beyond ordinary Poisson regression applications—ZIPs and other Poisson modeling approaches such as hurdle models and quasi-Poisson applications—are becoming increasingly common. So it is worth taking a look at these variations of Poisson regression models. Here we have only skimmed the surface of zero-inflated models, but we want you to be aware of models of this type. ZIP models demonstrate that modeling can be flexible and creative—a theme we hope you will see throughout this book.

4.11 Exercises

4.11.1 Conceptual Exercises

Exercises 1-4 involve predicting a response using one or more explanatory variables, where these examples have response variables that are counts per some unit of time or space. List the response (both what is being counted and over what unit of time or space) and relevant explanatory variables.

  1. Are the number of motorcycle deaths in a given year related to a state’s helmet laws?
  2. Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
  3. Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
  4. Has the number of deformed fish in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade?
  5. Models of the form \(Y_i=\beta_0+\beta_1X_i+\epsilon_i, \epsilon_i \sim iidN(0,\sigma)\) are fit using the method of least squares. What method is used to fit Poisson regression models?
  6. What should be done before adjusting for overdispersion?
  7. Why are quasi-Poisson models used, and how do the results typically compare for corresponding models using regular Poisson regression?
  8. Why is the log of mean counts, log(\(\bar{Y}\)), not \(\bar{Y}\), plotted against X when assessing the assumptions for Poisson regression?
  9. How can the assumption of mean=variance be checked for Poisson regression? What if there are not many repeated observations at each level of X?
  10. Is it possible that a predictor is significant for a model fit using Poisson regression, but not for a model for the same data fit using quasi-Poisson regression? Explain.

Complete (a)-(d) in the context of the study for Exercises 11-13.

  1. Define the response.
  2. What are the possible values for the response?
  3. What does \(\lambda\) represent?
  4. Would a zero-inflated model be considered here? If so, what would be a “true zero”?
  1. Fish (or, as they say in French, poisson). A state wildlife biologist collected data from 250 park visitors as they left at the end of their stay. Each was asked to report the number of fish they caught during their one-week stay. On average, visitors caught 21.5 fish per week.

  2. Methadone program recidivism. Program facilitators keep track of the number of times their program’s patients relapse within five years of initial treatment. Data on 100 patients yielded a mean number of 2.8 relapses per patient within the five years of initial treatment.

  3. Clutch size. Thirty nests were located and the number of eggs in each nest were counted at the start of a season. Later in the season following a particularly destructive storm, the mean clutch size of the 30 nests was only 1.7 eggs per nest.

  4. Credit card use. A survey of 1,000 consumers asked respondents how many credit cards they use. Interest centers on the relationship between credit card use and income, in $10,000. The estimated coefficient for income is 2.1.

    • Identify the predictor and interpret the estimated coefficient for the predictor in this context.
    • Describe how the assumption of linearity can be assessed in this example.
    • Suggest a way in which to assess the equal mean and variance assumption.
Table 4.6: Sample data for Exercise 15.
Age Time Online Number of Dates Arranged Online
19 35 3
29 20 5
38 15 0
55 10 0
  1. Dating online. Researchers are interested in the number of dates respondents arranged online and whether the rates differ by age group. Questions which elicit responses similar to this can be found in the Pew Survey concerning dating online and relationships (Smith and Duggan 2013). Each survey respondent was asked how many dates they have arranged online in the past 3 months as well as the typical amount of time, \(t\), in hours, they spend online weekly. Some rows of data appear in Table 4.6.

    • Identify the response, predictor, and offset in this context. Does using an offset make sense?
    • Write out a model for this data. As part of your model description, define the parameter, \(\lambda\).
    • Consider a zero-inflated Poisson model for this data. Describe what the `true zeros’ would be in this setting.
Table 4.7: Data from Scotto et al. (1974) on the number of cases of non-melanoma skin cancer for women by age group in two metropolitan areas (Minneapolis-St. Paul and Dallas-Ft. Worth); the year is unknown.
Number of Cases Population Age Group City
1 172675 15-24 1
16 123065 25-34 1
226 29007 75-84 2
65 7538 85+ 2
The columns contain: number of cases, population size, age group, and city (1=Minneapolis-St. Paul, 2=Dallas-Ft. Worth).
  1. Poisson approximation: rare events. For rare diseases, the probability of a case occurring, \(p\), in a very large population, \(n\), is small. With a small \(p\) and large \(n\), the random variable \(Y\)= the number of cases out of \(n\) people can be approximated using a Poisson random variable with \(\lambda = np\). If the count of those with the disease is observed in several different populations independently of one another, the \(Y_i\) represents the number of cases in the \(i^{th}\) population and can be approximated using a Poisson random variable with \(\lambda_i=n_ip_i\) where \(p_i\) is the probability of a case for the \(i^{th}\) population. Poisson regression can take into account the differences in the population sizes, \(n_i\), using as an offset log(\(n_i\)) as well as differences in a population characteristic like \(x_i\). The coefficient of the offset is set at one; it is not estimated like the other coefficients. Thus the model statement has the form: \(log(\lambda_i) = \beta_0+\beta_1x_i + log(n_i)\), where \(Y_i \sim\) Poisson(\(\lambda_i = n_i p_i\)). Note that \(\lambda_i\) depends on \(x_i\) which may differ for the different populations.

    Scotto, Kopf, and Urbach (1974) wondered if skin cancer rates by age group differ by city. Based on their data in Table 4.7, identify and describe the following quantities which appear in the description of the Poisson approximation for rare events:

    • A case,
    • The population size, \(n_i\),
    • Probability, \(p_i\),
    • Poisson parameter, \(\lambda_i\),
    • Poisson random variables, \(Y_i\), and
    • The predictors, \(X_i\).

4.11.2 Guided Exercises

  1. College burglaries. We wish to build a regression model to describe the number of burglaries on a college campus in a year. Our population of interest will be U.S. liberal arts colleges.
    1. Describe why the response variable (\(Y\) = # burglaries on campus in a year) could be modeled by a Poisson distribution.
    2. Describe explanatory variables which might explain differences in \(\lambda_i\) = mean number of burglaries per year on campus \(i\).
    3. Consider a campus with an average of 5 burglaries per year. Use dpois() to sketch a plot of the distribution of \(Y\) for this campus. Use rpois() to verify that both the mean and variance of \(Y\) are given by \(\lambda=5\).
    4. Consider a campus with an average of 20 burglaries per year and repeat (c).
  2. Elephant mating. How does age affect male elephant mating patterns? An article by Poole (1989) investigated whether mating success in male elephants increases with age and whether there is a peak age for mating success. To address this question, the research team followed 41 elephants for one year and recorded both their ages and their number of matings. The data (Ramsey and Schafer 2002) is found in elephant.csv, and the variables are:
    • MATINGS = the number of matings in a given year
    • AGE = the age of the elephant in years.
    1. Create a histogram of MATINGS. Is there preliminary evidence that number of matings could be modeled as a Poisson response? Explain.
    2. Plot MATINGS by AGE. Add a least squares line. Is there evidence that modeling matings using a linear regression with age might not be appropriate? Explain. (Hints: fit a smoother; check residual plots).
    3. For each age, calculate the mean number of matings. Take the log of each mean and plot it by AGE.
      1. What assumption can be assessed with this plot?
      2. Is there evidence of a quadratic trend on this plot?
    4. Fit a Poisson regression model with a linear term for AGE. Exponentiate and then interpret the coefficient for AGE.
    5. Construct a 95% confidence interval for the slope and interpret in context (you may want to exponentiate endpoints).
    6. Are the number of matings significantly related to age? Test with
      1. a Wald test and
      2. a drop in deviance test.
    7. Add a quadratic term in AGE to determine whether there is a maximum age for the number of matings for elephants. Is a quadratic model preferred to a linear model? To investigate this question, use
      1. a Wald test and
      2. a drop in deviance test.
    8. What can we say about the goodness-of-fit of the model with age as the sole predictor? Compare the residual deviance for the linear model to a \(\chi^2\) distribution with the residual model degrees of freedom.
    9. Fit the linear model using quasi-Poisson regression. (Why?)
      1. How do the estimated coefficients change?
      2. How do the standard errors change?
      3. What is the estimated dispersion parameter?
      4. An estimated dispersion parameter greater than 1 suggests overdispersion. When adjusting for overdispersion, are you more or less likely to obtain a significant result when testing coefficients? Why?
Table 4.8: A small subset of hypothetical data on Minnesota workplace rules on smoking.
Subject X (location) Y (cigarettes)
1 0 3
2 1 0
3 1 0
4 1 1
5 0 2
6 0 1
X is 0 for home and 1 for work.
Y is number of cigaretttes in a 2-hour period.
  1. Smoking at work and home. An earlier study examined the effect of workplace rules in Minnesota which require smokers to smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period was recorded, along with whether the smoker was at home or at work. A (very) small subset of the data appears in Table 4.8.

    • Model 1: Assume that \(Y \sim \textrm{Poisson}(\lambda)\); there is no difference between home and work.
    • Model 2: Assume that \(Y \sim \textrm{Poisson}(\lambda_W)\) when the smoker is at work, and \(Y \sim \textrm{Poisson}(\lambda_H)\) when the smoker is at home.
    • Model 3: Assume that \(Y \sim \textrm{Poisson}(\lambda)\) and \(log(\lambda)=\beta_0+\beta_1X\).
    1. Write out the likelihood \(L(\lambda)\) and the log-likelihood \(logL(\lambda)\) in Model 1. Use the data values above, and simplify where possible.
    2. Intuitively, what would be a reasonable estimate for \(\lambda\) based on this data? Why?
    3. Find the maximum likelihood estimator for \(\lambda\) in Model 1 using an optimization routine in R (but not the glm() function). Use R to produce a plot of the likelihood function \(L(\lambda)\).
    4. Write out the log-likelihood function \(logL(\lambda_W, \lambda_H)\) in Model 2. Use the data values above, and simplify where possible.
    5. Intuitively, what would be reasonable estimates for \(\lambda_W\) and \(\lambda_H\) based on this data? Why?
    6. Find the maximum likelihood estimators for \(\lambda_W\) and \(\lambda_H\) in Model 2 using an optimization routine in R (but not the glm() function).
  2. Smoking at work and home (continued). We will use the same data set in this question as we used in Question 3.

    1. Write out the log-likelihood function \(logL(\beta_0, \beta_1)\) in Model 3. Again, use the data values above, and simplify where possible.
    2. Find the maximum likelihood estimators for \(\beta_0\) and \(\beta_1\) in Model 3 using an optimization routine in R (but not the glm() function). Use R to produce a 3D plot of the log-likelihood function.
    3. Confirm your estimates for Model 1 and Model 3 using glm(). Then show that the MLEs for Model 3 agree with the MLEs for Model 2.

    For the remaining questions, we will focus exclusively on Model 3.

    1. State a (one-sided) hypothesis for \(\beta_1\) in the context of the problem (i.e., explain how your hypothesis relates to smoking at home and at work). Note: we will nevertheless use two-sided tests and intervals in the following questions.
    2. Do we need to include an offset in our Poisson regression model? Why or why not?
    3. Give estimates of \(\beta_0\) and \(\beta_1\), and provide interpretations for both in the context of the problem.
    4. Provide and interpret a 95% confidence interval for \(\beta_1\).
    5. Provide two different significance tests for \(\beta_1\), in each case providing a test statistic and a p-value and a conclusion in the context of the problem.
    6. Provide a goodness-of-fit test for Model 3, again providing a test statistic, p-value, and conclusion in context.
    7. Can we generalize results of this study to all Minnesota smokers? Why or why not?
    8. Can we claim that rules restricting smoking in the workplace have caused lower levels of smoking at work? Explain.
    9. Give two ways in which this study might be improved (besides simply “bigger sample size”).
  3. Campus crime. The data set campuscrime09.csv contains the number of burglaries reported at a collection of 47 U.S. public universities with over 10,000 students in the year 2009. In addition, covariates are included which may explain differences in crime rates, including total number of students, percentage of men, average SAT and ACT test scores, and tuition.

    1. Perform an exploratory data analysis. Support your analysis with plots and summary statistics.
      1. Analyze whether number of burglaries could be reasonably modeled with a Poisson distribution.
      2. Analyze which covariates you expect to be the best predictors of burglaries.
    2. Consider a model with 4 predictors: act.comp + tuition + pct.male + total. Try fitting a linear regression with burg09 as the response. Are there any concerns with this linear regression model?
    3. Run a Poisson regression model with the 4 predictors from (b). Interpret the coefficients for tuition and pct.male.
    4. Replace tuition with tuition in thousands in your model from (c) – i.e., tuition.thous=tuition/1000. How does your new model compare to your model in (c)? Interpret the coefficient for tuition.thous.
    5. We will consider the possibility of including the total number of students at a university as an offset.
      1. Explain why we might consider total as an offset.
      2. Refit your model from (d) with total (actually, log(total)) as an offset rather than as a predictor. Does this new model appear to fit better or worse than the model from (d)?
      3. Refit your model from (d) with log(total) rather than total – so log(total) is a predictor and not an offset. If total were a good candidate for an offset, what would we expect the coefficient of log(total) to be? Does a 95% confidence interval for that coefficient contain the value you expected?
    6. Run the following model, then interpret the coefficients for tuition.thous and the interaction between tuition.thous and act.comp.
crime <- mutate(crime, total.thous = total/1000)
fit3 <- glm(burg09 ~ act.comp + tuition.thous + 
            total.thous + act.comp:tuition.thous +
            act.comp:total.thous, family = poisson, 
            data = crime)
  1. U.S. National Medical Expenditure Survey. The data set NMES1988 in the AER package contains a sample of individuals over 65 who are covered by Medicare in order to assess the demand for health care through physician office visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed by installing and loading the AER package and then running data(NMES1988). More background information and references about the NMES1988 data can be found in help pages for the AER package.

    1. Show through graphical means that there are more respondents with 0 visits than might be expected under a Poisson model.
    2. Fit a ZIP model for the number of physician office visits using chronic, health, and insurance as predictors for the Poisson count, and chronic and insurance as the predictors for the binary part of the model. Then, provide interpretations in context for the following model parameters:
    • chronic in the Poisson part of the model
    • poor health in the Poisson part of the model
    • the Intercept in the logistic part of the model
    • insurance in the logistic part of the model
    1. Is there significant evidence that the ZIP model is an improvement over a simple Poisson regression model?
  2. Going vague: ambiguity in political issue statements. In the following exercise, you will use a hurdle model to analyze the data. A hurdle model is similar to a zero-inflated Poisson model, but instead of assuming that “zeros” are comprised of two distinct groups—those who would always be 0 and those who happen to be 0 on this occasion (e.g., non-drinkers and drinkers who had zero drinks over the weekend in Case Study 4.10)—the hurdle model assumes that “zeros” are a single entity. Therefore, in a hurdle model, cases are classified as either “zeros” or “non-zeros”, where “non-zeros” hurdle the 0 threshold—they must always have counts of 1 or above. We will use the pscl package and the hurdle function in it to analyze a hurdle model. Note that coefficients in the “zero hurdle model” section of the output relate predictors to the log-odds of being a non-zero (i.e., having at least one issue statement), which is opposite of the ZIP model.

    In a 2018 study, Chapp et al. (2018) scraped every issue statement from webpages of candidates for the U.S. House of Representatives, counting the number of issues candidates commented on and scoring the level of ambiguity of each statement. We will focus on the issue counts, and determining which attributes (of both the district as a whole and the candidates themselves) are associated with candidate silence (commenting on 0 issues) and a willingness to comment on a greater number of issues. The data set ambiguity.csv contains the following variables:

    • name : candidate name
    • distID : unique identification number for Congressional district
    • ideology : candidate left-right orientation
    • democrat : 1 if Democrat, 0 if Republican
    • mismatch : disagreement between candidate ideology and district voter ideology
    • incumbent : 1 if incumbent, 0 if not
    • demHeterogeneity : how much voters in a district differ according to race, education, occupation, etc.
    • attHeterogeneity : how much voters in a district differ according to political ideology
    • distLean : overall ideological lean in a district
    • totalIssuePages : number of issues candidates commented on (response)
    1. Create a frequency plot of totalIssuePages. Why might we consider using a hurdle model compared to a Poisson model? Why can’t we use a zero-inflated Poisson model?
    2. Create a plot of the empirical log odds of having at least one issue statement by ideology. You may want to group ideology values first. What can you conclude from this plot? (See Chapter 6 for more details.)
    3. Create a scatterplot that shows the log of the mean number of issues vs. ideology group by party, among candidates with at least one issue statement. What can we conclude from this plot?
    4. Create a hurdle model with ideology and democrat as predictors in both parts. Interpret ideology in both parts of the model.
    5. Repeat (d), but include an interaction in both parts. Interpret the interaction in the zero hurdle part of the model.
    6. Find the best model you can to determine totalIssuePages. Write a short paragraph discussing implications of your model.

4.11.3 Open-Ended Exercises

  1. Airbnb in NYC. Awad, Lebo, and Linden (2017) scraped 40628 Airbnb listings from New York City in March 2017 and put together the data set NYCairbnb.csv. Key variables include:

    • id = unique ID number for each unit
    • last_scraped = date when information scraped
    • host_since = date when host first listed the unit on Airbnb
    • days = last_scraped - host_since = number of days the unit has been listed
    • room_type = Entire home/apt., Private room, or Shared room
    • bathrooms = number of bathrooms
    • bedrooms = number of bedrooms
    • price = price per night (dollars)
    • number_of_reviews = number of reviews for the unit on Airbnb
    • review_scores_cleanliness = cleanliness score from reviews (1-10)
    • review_scores_location = location score from reviews (1-10)
    • review_scores_value = value score from reviews (1-10)
    • instant_bookable = “t” if instantly bookable, “f” if not

    Perform an EDA, build a model, and interpret model coefficients to describe variation in the number of reviews (a proxy for the number of rentals, which is not available) as a function of the variables provided. Don’t forget to consider an offset, if needed.

  2. Crab satellites. Brockmann (1996) carried out a study of nesting female horseshoe crabs. Female horseshoe crabs often have male crabs attached to a female’s nest known as satellites. One objective of the study was to determine which characteristics of the female were associated with the number of satellites. Of particular interest is the relationship between the width of the female carapace and satellites.

    The data can be found in crab.csv. It includes:

    • NumSat = number of satellites
    • Width = carapace width (cm)
    • Wt = weight (kg)
    • Sp = spine condition (1 = both good, 2 = one worn or broken, 3 = both worn or broken)
    • C = color (1 = light medium, 2 = medium, 3 = dark medium, 4 = dark)

    Use Poisson regression to investigate the research question. Be sure you work to obtain an appropriate model before considering overdispersion. Should a hurdle model be considered here? If so, fit a hurdle model and interpret in context.

  3. Doctor visits. Data was collected on doctor visits from a sample of 5,190 people in the 1977/1978 Australian Health Survey. Cameron and Trivedi (1986) sought to explain the variation in doctor visits using one or more explanatory variables. The data can be found in an R data set from library(AER) accessible with the command data("DoctorVisits"). Variable descriptions can be found under help("DoctorVisits").

    Explore the use of a zero-inflated model for this data. Begin with a histogram of the number of visits, complete an EDA, and then fit several models. Summarize your results.

  4. More fish. The number of fish caught (count), persons in the party (persons), the number of children in the party (child), whether or not they brought a camper into the park (camper), and the length of stay (LOS) were recorded for 250 camping parties. The data can be found in fish2.csv (source: UCLA Statistical Consulting Group (2018)). Create and assess a model for the number of fish caught.

References

Awad, Annika, Evan Lebo, and Anna Linden. 2017. “Intercontinental Comparative Analysis of Airbnb Booking Factors.”

Brockmann, H. Jane. 1996. “Satellite Male Groups in Horseshoe Crabs, Limulus Polyphemus.” Ethology 102 (1): 1–21. http://dx.doi.org/doi:10.1111/j.1439-0310.1996.tb01099.x.

Cameron, A. C., and P. K. Trivedi. 1986. “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators and Tests.” Journal of Applied Econometrics 1: 29–53.

Chapp, Christopher, Paul Roback, Kendra Jo Johnson-Tesch, Adrian Rossing, and Jack Werner. 2018. “Going Vague: Ambiguity and Avoidance in Online Political Messaging.” Social Science Computer Review, August. https://doi.org/10.1177/0894439318791168.

McCullagh, Peter, and John Ashworth Nelder. 1989. Generalized Linear Models. 2nd ed. Boca Raton, Florida: Chapman & Hall/ CRC.

Philippine Statistics Authority. 2015. “Family Income and Expenditure Survey.” https://www.kaggle.com/grosvenpaul/family-income-and-expenditure.

Poole, Joyce H. 1989. “Mate Guarding, Reproductive Success and Female Choice in African Elephants.” Animal Behaviour 37: 842–49. http://www.sciencedirect.com/science/article/pii/0003347289900687.

Ramsey, Fred, and Daniel Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis. 2nd ed. Boston, Massachusetts: Brooks/Cole Cengage.

Scotto, Joseph, Alfred W. Kopf, and Fredrick Urbach. 1974. “Non-Melanoma Skin Cancer Among Caucasians in Four Areas of the United States.” Cancer 34 (4): 1333–8. https://doi.org/10.1002/1097-0142(197410)34:4<1333::AID-CNCR2820340447>3.0.CO;2-A.

Smith, Aaron, and Maeve Duggan. 2013. “Online Dating and Relationships.” Pew Research Center. https://www.issuelab.org/resources/15934/15934.pdf.

UCLA Statistical Consulting Group. 2018. “Zero-Inflated Negative Binomial Regression: R Data Analysis Examples.” https://stats.idre.ucla.edu/r/dae/zinb/.

Vuong, Q. H. 1989. “Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses.” Econometrica 57: 307–33. https://doi.org/10.2307/1912557.