2.2 Binomial

If \(X\) is the count of successful events in \(n\) identical and independent Bernoulli trials of success probability \(\pi\), then \(X\) is a random variable with a binomial distribution \(X \sim Bin(n,\pi)\)

\[f(x;n, \pi) = \frac{n!}{x!(n-x)!} \pi^x (1-\pi)^{n-x} \hspace{1cm} x \in (0, 1, ..., n), \hspace{2mm} \pi \in [0, 1]\]

with \(E(X)=n\pi\) and \(Var(X) = n\pi(1-\pi)\).

Binomial sampling is used to model counts of one level of a categorical variable over a fixed sample size. Here is a simple analysis of data from a Binomial process. Data set dat contains frequencies of high-risk drinkers vs non-high-risk drinkers in a college survey.

## 
##  No Yes 
## 685 630

The MLE of \(\pi\) from the Binomial distribution is the sample mean.

x <- sum(dat$high_risk == "Yes")
n <- nrow(dat)
p <- x / n
print(p)

## [1] 0.4790875

Here is the binomial distribution \(f(x; \pi), \hspace{5mm} x \in [550, 700]\).

events <- round(seq(from = 550, to = 700, length = 20), 0)
density <- dbinom(x = events, prob = p, size = n)
prob <- pbinom(q = events, prob = p, size = n, lower.tail = TRUE)
df <- data.frame(events, density, prob)
ggplot(df, aes(x = factor(events))) +
#  geom_col(aes(y = density)) +
  geom_col(aes(y = density), fill = mf_pal()(1), alpha = 0.8) +
  geom_text(
    aes(label = round(density, 3), y = density + 0.001),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  geom_line(
    data = df, 
    aes(x = as.numeric(factor(events)), y = prob/40), 
    color = mf_pal()(1), 
    size = 1) +
  scale_y_continuous(sec.axis = sec_axis(~.*40, name = "Cum Prob")) +
  theme_mf() +
  labs(title = "PMF and CDF of Binomial Distribution",
       subtitle = "Bin(1315, 0.479).",
       x = "Events (x)",
       y = "Density")

There are several ways to calculate a confidence interval for \(\pi\). One method is the normal approximation (Wald) interval.

\[\pi = p \pm z_{\alpha /2} \sqrt{\frac{p (1 - p)}{n}}\]

alpha <- .05
z <- qnorm(1 - alpha / 2)
se <- sqrt(p * (1 - p) / n)
p + c(-z*se, z*se)

## [1] 0.4520868 0.5060882

This method is easy to understand and calculate by hand, but its accuracy suffers when \(np<5\) or \(n(1-p)<5\) and it does not work at all when \(p = 0\) or \(p = 1\). Option two is the Wilson method.

\[\frac{p + \frac{z^2}{2n}}{1 + \frac{z^2}{n}} \pm \frac{z}{1 + \frac{z^2}{n}} \sqrt{\frac{p(1 - p)}{n} + \frac{z^2}{4n^2}}\]

est <- (p + (z^2)/(2*n)) / (1 + (z^2) / n)
pm <- z / (1 + (z^2)/n) * sqrt(p*(1-p)/n + (z^2) / (4*(n^2)))
est + c(-pm, pm)

## [1] 0.4521869 0.5061098

This is what prop.test() does when you set correct = FALSE.

prop.test(x = x, n = n, correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.5
## X-squared = 2.3004, df = 1, p-value = 0.1293
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4521869 0.5061098
## sample estimates:
##         p 
## 0.4790875

There is a second version of the Wilson interval that applies a “continuity correction” that aligns the “minimum coverage probability”, rather than the “average probability”, with the nominal value. I’ll need to learn what’s inside those quotations at some point.

prop.test(x = x, n = n)

## 
##  1-sample proportions test with continuity correction
## 
## data:  x out of n, null probability 0.5
## X-squared = 2.2175, df = 1, p-value = 0.1365
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4518087 0.5064898
## sample estimates:
##         p 
## 0.4790875

Finally, there is the Clopper-Pearson exact confidence interval. Clopper-Pearson inverts two single-tailed binomial tests at the desired alpha. This is a non-trivial calculation, so there is no easy formula to crank through. Just use the binom.test() function and pray no one asks for an explanation.

binom.test(x = x, n = n)

## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 630, number of trials = 1315, p-value = 0.1364
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4517790 0.5064896
## sample estimates:
## probability of success 
##              0.4790875

The expected probability of no one being a high-risk drinker is \(f(0;0.479) = \frac{1315!}{0!(1315-0)!} 0.479^0 (1-0.479)^{1315-0} = 0\).

dbinom(x = 0, size = n, p = p)

## [1] 0

The expected probability of half the population being a high-risk drinker, \(f(658, 0.479)\), is impossible to write out, and slow to calculate.

pbinom(q = .5*n, size = n, prob = p, lower.tail = FALSE)

## [1] 0.06455096

As n increases for fixed \(\pi\), the binomial distribution approaches normal distribution \(N(n\pi, n\pi(1−\pi))\). The normal distribution is a good approximation when \(n\) is large.

pnorm(q = 0.5, mean = p, sd = se, lower.tail = FALSE)

## [1] 0.06450357

Here are some more examples using smaller sample sizes. The probability 2 out of 10 coin flips are heads if the probability of heads is 0.3:

dbinom(x = 2, size = 10, prob = 0.3)

## [1] 0.2334744

Here is a simulation from n = 10,000 random samples of size 10. rbinom() generates a random sample of numbers from the binomial distribution.

data.frame(cnt = rbinom(n = 10000, size = 10, prob = 0.3)) %>%
  count(cnt) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         X_eq_x = cnt == 2) %>%
  ggplot(aes(x = as.factor(cnt), y = n, fill = X_eq_x, label = pct)) +
  geom_col(alpha = 0.8) +
  scale_fill_mf() +
  geom_label(aes(label = round(pct, 2)), size = 3, alpha = .6) +
  theme_mf() +
  theme(legend.position = "none") +
  labs(title = "Binomial Distribution", 
       subtitle = paste0(
         "P(X=2) successes in 10 trials when p = 0.3 is ", 
         round(dbinom(2, 10, 0.3), 4), "."
       ),
       x = "Successes",
       y = "Count",
       caption = "Simulation from n = 10,000 binomial random samples.")

What is the probability of <=2 heads in 10 coin flips where probability of heads is 0.3? The cumulative probability is the sum of the first three bars in the simulation above. Function pbinom() calculates the cumulative binomial probability.

pbinom(q = 2, size = 10, prob = 0.3, lower.tail = TRUE)

## [1] 0.3827828

What is the expected number of heads in 25 coin flips if the probability of heads is 0.3?

The expected value, \(\mu = np\), is 7.5. Here’s an empirical test from 10,000 samples.

mean(rbinom(n = 10000, size = 25, prob = .3))

## [1] 7.4738

The variance, \(\sigma^2 = np (1 - p)\), is 5.25. Here’s an empirical test.

var(rbinom(n = 10000, size = 25, prob = .3))

## [1] 5.162316

Suppose X and Y are independent random variables distributed \(X \sim Bin(10, .6)\) and \(Y \sim Bin(10, .7)\). What is the probability that either variable is <=4?

Let \(P(A) = P(X<=4)\) and \(P(B) = P(Y<=4)\). Then \(P(A|B) = P(A) + P(B) - P(AB)\), and because the events are independent, \(P(AB) = P(A)P(B)\).

p_a <- pbinom(q = 4, size = 10, prob = 0.6, lower.tail = TRUE)
p_b <- pbinom(q = 4, size = 10, prob = 0.7, lower.tail = TRUE)
p_a + p_b - (p_a * p_b)

## [1] 0.2057164

Here’s an empirical test.

df <- data.frame(
  x = rbinom(10000, 10, 0.6),
  y = rbinom(10000, 10, 0.7)
  )
mean(if_else(df$x <= 4 | df$y <= 4, 1, 0))

## [1] 0.2019

A couple other points to remember:

The Bernoulli distribution is a special case of the binomial with \(n = 1\).
The binomial distribution assumes independent trials. If you sample without replacement from a finite population, use the hypergeometric distribution.