## 2.4 Poisson

Poisson distributions are used to model the number of events occurring in a fixed period of time. In Poisson processes, events occur continuously and independently at a constant average rate. This blog by Aerin Kim demonstrates how the Poisson distribution is related to the binomial distribution. The key is that the binomial distribution expresses the probability of $$X = k$$ events in $$n$$ trials when the probability of one event in one trial is $$p$$. Sometimes you don’t know $$p$$, but instead know that $$\lambda$$ events occur per some time interval. You could try to back into the binomial by shrinking the time interval so that $$k$$ could never be greater than one. The ratio $$p = \lambda / n$$ approaches the probability of observing one event in one trial as $$n \rightarrow \infty$$. As $$n \rightarrow \infty$$, $$p \rightarrow 0$$. If you do this and follow the math, you wind up with the Poisson distribution!

$\begin{eqnarray} P(X = k) &=& \mathrm{lim}_{n \rightarrow \infty} {n \choose k}p^k(1-p)^{n-k}\\ &=& \mathrm{lim}_{n \rightarrow \infty} \frac{n!}{(n-k)!k!} \left(\frac{\lambda}{n}\right)^k \left(1-\frac{\lambda}{n}\right)^{n-k}\\ &=& \mathrm{lim}_{n \rightarrow \infty} \frac{n!}{(n-k)!k!} \left(\frac{\lambda}{n}\right)^k \left(1-\frac{\lambda}{n}\right)^n\left(1-\frac{\lambda}{n}\right)^{-k}\\ &=& \mathrm{lim}_{n \rightarrow \infty} \frac{n!}{(n-k)!k!} \left(\frac{\lambda}{n}\right)^k \left(e^{-\lambda}\right)(1)\\ &=& \mathrm{lim}_{n \rightarrow \infty} \frac{n!}{(n-k)!n^k} e^{-\lambda}\frac{\lambda^k}{k!} \\ &=& e^{-\lambda}\frac{\lambda^k}{k!} \end{eqnarray}$

I didn’t get that last step, but I’ll take it on faith.

If $$X$$ is the number of successes in $$n$$ (many) trials when the probability of success $$\lambda / n$$ is small, then $$X$$ is a random variable with a Poisson distribution $$X \sim \mathrm{Pois}(\lambda)$$

$f(X = k;\lambda) = \frac{e^{-\lambda} \lambda^k}{k!} \hspace{1cm} k \in (0, 1, ...), \hspace{2mm} \lambda > 0$

with $$E(X)=\lambda$$ and $$\mathrm{Var}(X) = \lambda$$.

The Poisson distribution is also related to the exponential distribution. If the number of events per unit time follows a Poisson distribution, then the amount of time between events follows the exponential distribution.

The Poisson likelihood function is

$L(\lambda; k) = \prod_{i=1}^N f(k_i; \lambda) = \prod_{i=1}^N \frac{e^{-\lambda} \lambda^k_i}{k_i !} = \frac{e^{-n \lambda} \lambda^{\sum k_i}}{\prod k_i}.$

The Poisson loglikelihood function is

$l(\lambda; k) = \sum_{i=1}^N k_i \log \lambda - n \lambda.$

The loglikelihood function is maximized at

$\hat{\lambda} = \sum_{i=1}^N k_i / n.$

Thus, for a Poisson sample, the MLE for $$\lambda$$ is just the sample mean.

Consider data set dat_pois containing frequencies counts of goals scored per match during a soccer tournament. Totals ranged from 0 to 8.

##   goals freq
## 1     0   23
## 2     1   37
## 3     2   20
## 4     3   11
## 5     4    2
## 6     5    1
## 7     6    0
## 8     7    0
## 9     8    1

The MLE of $$\lambda$$ is the sample weighted mean.

(lambda <- weighted.mean(dat_pois$goals, dat_pois$freq))
## [1] 1.378947

The 0.95 CI is $$\lambda \pm z_{.05/2} \sqrt{\lambda / n}$$

n <- sum(dat_pois\$freq)
z <- qnorm(0.975)
se <- sqrt(lambda / n)
lambda + c(-z*se, +z*se)
## [1] 1.142812 1.615082

The expected probability of scoring 2 goals in a match is $$\frac{e^{-1.38} 1.38^2}{2!} = 0.239$$.

dpois(x = 2, lambda = lambda)
## [1] 0.2394397

The expected probability of scoring 2 to 4 goals in a match is

ppois(q = 4, lambda = lambda) -
ppois(q = 1, lambda = lambda)
## [1] 0.3874391
data.frame(events = 0:10,
pmf = dpois(0:10, lambda),
cdf = ppois(0:10, lambda, lower.tail = TRUE)) %>%
ggplot() +
geom_col(aes(x = factor(events), y = pmf)) +
geom_text(aes(x = factor(events), label = round(pmf, 3), y = pmf + 0.01),
position = position_dodge(0.9), size = 3, vjust = 0) +
geom_line(aes(x = events, y = cdf/2), size = 1) +
scale_y_continuous(limits = c(0, .5),
sec.axis = sec_axis(~ . * 2, name = "Cum Prob")) +
labs(title = "PMF and CDF of Poisson Distribution",
subtitle = paste0("Pois(", scales::comma(lambda, accuracy = .01), ")"),
x = "X = k", y = "Density")

How well does the Poisson distribution fit the dat_pois data set?

dat_pois %>%
mutate(pred = n * dpois(x = goals, lambda = lambda)) %>%
rename(obs = freq) %>%
pivot_longer(cols = -goals) %>%
ggplot(aes(x = goals, y = value, color = name)) +
geom_point() +
#  geom_smooth(se = FALSE) +
theme(legend.position = "top") +
labs(
title = "Poisson Dist: Observed vs Expected",
color = "",
y = "frequencey"
)

It seems to fit pretty good! Run a chi-square goodness-of-fit test to confirm. The test requires an expected frequency of at least 5 per cell. Cells 4 through 8 are under 5, so you need to lump them.

x <- dat_pois %>%
group_by(goals = if_else(goals >= 4, 4, goals)) %>%
summarize(.groups = "drop", freq = sum(freq)) %>%
pull(freq)
p <- dat_pois %>%
mutate(p = dpois(x = goals, lambda = lambda)) %>%
group_by(goals = if_else(goals >= 4, 4, goals)) %>%
summarize(.groups = "drop", p = sum(p)) %>%
pull(p)
(chisq_test <- chisq.test(x = x, p = p, rescale.p = TRUE))
##
##  Chi-squared test for given probabilities
##
## data:  x
## X-squared = 1.0414, df = 4, p-value = 0.9035

A chi-square goodness-of-fit test was conducted to determine whether the matches had the same proportion of goals as the theoretical Poisson distribution. The minimum expected frequency was 5. The chi-square goodness-of-fit test indicated that the number of matches with 0, 1, 2, 3, or 4 or more goals was not statistically significantly different from the proportions expected in the theoretical Poisson distribution ($$\chi^2$$(4) = 1.041, p = 0.903).

As demonstrated at the beginning of this section, the $$\mathrm{Pois}(\lambda)$$ approaches the $$\mathrm{Bin}(n, \pi)$$ where $$\lambda = n\pi$$ when $$n \rightarrow \infty$$ and $$\pi \rightarrow 0$$. The Poisson is a useful (computationally inexpensive) approximation to the binomial when $$n \ge 20)$$ and $$\pi \le 0.05$$. For example, suppose a baseball player has a $$\pi = .03$$ chance of hitting a home run. What is the probability of hitting $$X \ge 20$$ home runs in $$n = 500$$ at-bats? This is a binomial process because the sample size is fixed.

pbinom(q = 20, size = 500, prob = 0.03, lower.tail = FALSE)
## [1] 0.07979678

But $$n$$ is large and $$\pi$$ is small, so the Poisson distribution works well too.

ppois(q = 20, lambda = 0.03 * 500, lower.tail = FALSE)
## [1] 0.08297091
n = 500
p = 0.03
x = 0:30
data.frame(
events = x,
Poisson = dpois(x = x, lambda = p * n),
Binomial = dbinom(x = x, size = n, p = p)
) %>%
pivot_longer(cols = -events) %>%
ggplot(aes(x = events, y = value, color = name)) +
geom_line() +
theme(legend.position = "top") +
labs(title = "Pois(.03*500) vs. Bin(500, .03)",
subtitle = "Poisson approximation to binomial.",
x = "Events",
y = "Density",
color = "")

When the observed variance is greater than $$\lambda$$ (overdispersion), the Negative Binomial distribution can be used instead of Poisson.