2.3 Hypergeometric
If \(X = k\) is the count of successful events in a sample of size \(n\) without replacement from a population of size \(N\) containing \(K\) successes, then \(X\) is a random variable with a hypergeometric distribution
\[f_X(k|N, K, n) = \frac{{{K}\choose{k}}{{N-K}\choose{n-k}}}{{N}\choose{n}}.\]
with \(E(X)=n\frac{K}{N}\) and \(Var(X) = n\frac{K}{N}\cdot\frac{N-k}{N}\cdot\frac{N-n}{N-1}\).
The formula follows from the frequency table of the possible outcomes.
Sampled | Not Sampled | Total | |
---|---|---|---|
success | k | K-k | K |
non-success | n-k | (N-K)-(n-k) | N-K |
Total | n | N-n | N |
Here is a simple analysis of data from a hypergeometric process. What is the probability of selecting \(k = 14\) red marbles from a sample of \(n = 20\) taken from an urn containing \(K = 70\) red marbles and \(N-K = 30\) green marbles?
Function choose()
returns the binomial coefficient \({{n}\choose{k}} = \frac{n!}{k!(n-k)!}\).
<- 14; n <- 20; N <- 100; K <- 70
k choose(K, k) * choose(N-K, n-k) / choose(N, n)
## [1] 0.2140911
But of course you would never have to do it that way.
dhyper(x = k, m = K, n = N-K, k = n)
## [1] 0.2140911
The expected value is 14 and variance is 9.7292929.
The hypergeometric random variable is similar to the binomial random variable except that it applies to situations of sampling without replacement from a small population. As the population size increases, sampling without replacement converges to sampling with replacement, and the hypergeometric distribution converges to the binomial. What if the total population size had been 250? 500? 1000?
<- data.frame(x = 1:20) %>%
p mutate(density = dbinom(x = 1:20, size = n, prob = K / N)) %>%
ggplot() +
geom_col(aes(x = x, y = density))
<- data.frame(
hyper x = 1:20,
N_100 = dhyper(x = 1:20, m = K*1.0, n = (N-K)*1.0, k = n),
N_250 = dhyper(x = 1:20, m = K*2.5, n = (N-K)*2.5, k = n),
N_500 = dhyper(x = 1:20, m = K*5.0, n = (N-K)*5.0, k = n)
%>% pivot_longer(-x)
) + geom_line(data = hyper, aes(x = x, y = value, color = name)) +
p labs() +
theme_minimal() +
theme(legend.position = "top") +
labs(title = "P(X = k) when X ~ Hypergeometric(N, .7N, 20)", color = "", x = "k")