6.1 Hypergeometric distributions

Example 6.1 Capture-recapture sampling is a technique often used to estimate the size of a population. Suppose you want to estimate \(N\), the number of monarch butterflies in Pismo Beach. (Assume that \(N\) is a fixed but unknown number; the population size doesn’t change over time.) You first capture a sample of \(N_1\) butterflies, selected randomly without replacement, and tag them and release them. At a later date, you then capture a second sample of \(n\) butterflies, selected randomly without replacement. Let \(X\) be the number of butterflies in the second sample that have tags (because they were also caught in the first sample). (Assume that the tagging has no effect on behavior, so that selection in the first sample is independent of selection in the second sample.)

In practice, \(N\) is unknown. But let’s start with a simpler, but unrealistic, example where there are \(N=52\) butterflies, \(N_1 = 13\) are tagged in the first sample, and \(n=5\) is the size of the second sample.

  1. Are the five individual selections independent?
  2. What are the possible values of \(X\)?
  3. Describe in detail how you could use simulation to approximate the distribution of \(X\).
  4. Find \(\textrm{P}(X = 0)\) in two ways.
  5. Find the probability that in the second sample the first butterfly selected is tagged but the rest are not.
  6. Find the probability that in the second sample the first four butterflies selected are not tagged but the fifth is.
  7. Find \(\textrm{P}(X = 1)\) in two ways.
  8. Find \(\textrm{P}(X = 2)\) in two ways.
  9. Suggest a formula for the probability mass function of \(X\).
  10. Find \(\textrm{E}(X)\) and suggest a simple shortcut formula.
  11. It can be shown that \(\textrm{Var}(X) = 0.864\). Would the variance be greater, less, or the same if the sampling was with replacement rather than without?

Solution. to Example 6.1

Show/hide solution
  1. No. When sampling without replacement the individual selections are not independent. For example, the conditional probability that the second butterfly is tagged given that the first is tagged is 12/51, but the conditional probability that the second butterfly is tagged given that the first is not tagged is 13/51.

  2. \(X\) can take values 0, 1, 2, 3, 4, 5.

  3. Write 1 to represent “tagged” on 13 cards and 0 to represent “not tagged” on 39 cards. Shuffle the 52 cards and deal 5, and let \(X\) be the number of cards in the 5 dealt that are labeled 1 (“tagged”). Repeat many times to simulate many values of \(X\). Approximate \(\textrm{P}(X = x)\) with the simulated relative frequency for \(x = 0, 1, \ldots, 5\). For example, count the number of repetitions in which \(X\) is 2 and divide by the total number of repetitions to approximate \(\textrm{P}(X = 2)\).

  4. We can use the partitioning strategy from the previous section. \[ \textrm{P}(X = 0) = \frac{\binom{13}{0}\binom{39}{5}}{\binom{52}{5}} = 0.2215 \] We can also use the multiplication rule. The probability that the first butterfly selected is not tagged is 39/52. Given that the first is not tagged, the conditional probability that the second butterfly selected is not tagged is \(38/51\). Given that the first two butterflies selected are not tagged, the conditional probability that the third butterfly selected is not tagged is \(37/50\). And so on. The probability that none of the butterflies selected are tagged is \[ \textrm{P}(X = 0) = \left(\frac{39}{52}\right)\left(\frac{38}{51}\right)\left(\frac{37}{50}\right)\left(\frac{36}{49}\right)\left(\frac{35}{48}\right) = 0.2215 \] The two methods are equivalent.

  5. We can use the multiplication rule. The probability that the first butterfly selected is tagged is 13/52. Given that the first is tagged, the conditional probability that the second butterfly selected is not tagged is \(39/51\). Given that the first is tagged and the second is not, the conditional probability that the third butterfly selected is not tagged is \(38/50\). And so on. The probability in question is \[ \left(\frac{13}{52}\right)\left(\frac{39}{51}\right)\left(\frac{38}{50}\right)\left(\frac{37}{49}\right)\left(\frac{36}{48}\right) = 0.0823 \]

  6. We can use the multiplication rule. The probability that the first butterfly selected is not tagged is 39/52. Given that the first is not tagged, the conditional probability that the second butterfly selected is not tagged is \(38/51\). And so on. Given that the first four selected are not tagged, the probability that the fifth butterfly selected is tagged is 13/48. The probability in question is \[ \left(\frac{39}{52}\right)\left(\frac{38}{51}\right)\left(\frac{37}{50}\right)\left(\frac{36}{49}\right)\left(\frac{13}{48}\right) = 0.0823 \] This is the same probability in the previous part.

  7. Continuing with the two previous parts, the probability of any particular sequence with exactly one tagged butterfly is 0.0823. There are \(\binom{5}{1}=5\) such sequences, since there are 5 “spots” where the one tagged butterfly could be (first selected through fifth selected). Therefore \[ \textrm{P}(X = 1) = \binom{5}{1}\left(\frac{13}{52}\right)\left(\frac{39}{51}\right)\left(\frac{38}{50}\right)\left(\frac{37}{49}\right)\left(\frac{36}{48}\right) = 0.4114. \] We can also use the partitioning strategy. \[ \frac{\binom{13}{1}\binom{39}{4}}{\binom{52}{5}} = 0.4114. \] The two methods are equivalent.

  8. Similar to the previous part. Multiplication rule \[ \textrm{P}(X = 2) = \binom{5}{2}\left(\frac{13}{52}\right)\left(\frac{12}{51}\right)\left(\frac{39}{50}\right)\left(\frac{38}{49}\right)\left(\frac{37}{48}\right) = 0.2743. \] Partitioning \[ \textrm{P}(X = 2) = \frac{\binom{13}{2}\binom{39}{3}}{\binom{52}{5}} = 0.2743. \]

  9. The partitioning method provides a more compact expression. In order to have a sample of size 5 with exactly \(x\) tagged butterflies, we need to select \(x\) butterflies from the 13 tagged butterflies in the population, and the remaining \(5-x\) butterflies from the 39 untagged butterflies in the population. \[ p_X(x) = \frac{\binom{13}{x}\binom{39}{5-x}}{\binom{52}{5}}, \qquad x = 0, 1, 2, 3, 4, 5. \]

  10. We could find the distribution and use the definition of expected value. However, we can also write \(X\) as a sum of indicators and use linearity of expected value. For \(i=1, \ldots, 5\), let \(X_i\) be 1 if the \(i\)th butterfly selected is tagged, and let \(X_i\) be 0 otherwise. Then \(X=X_1+\cdots+X_5\). \(\textrm{E}(X_i)\) is the unconditional probability that the \(i\)th butterfly selected is tagged, which is 13/52. Therefore, \(\textrm{E}(X) = 5(13/52) = 1.25\). This makes sense: if 1/4 of the butterflies in the population are tagged, we would also expect 1/4 of the butterflies in a randomly selected sample to be tagged.

  11. The variance be greater if the sampling was with replacement. Sampling without replacement, each selection conditionally restricts the number of possibilities, allowing for less variability in the number of successes in the sample. For example, sampling with replacement yields a slightly larger probability for \(\{X=5\}\) than sampling without replacement does.

    As a more extreme example, suppose instead that the sample size was \(n=20\). Then, the largest possible value of \(X\) is 13 when sampling without replacement but 20 when sampling with replacement.


P = BoxModel({1: 13, 0: 39}, size = 5, replace = False)
X = RV(P, count_eq(1))
x = X.sim(10000)

x.plot()
Hypergeometric(N1 = 13, N0 = 39, n = 5).plot()
plt.show()


x.count_eq(2) / 10000, Hypergeometric(N1 = 13, N0 = 39, n = 5).pmf(2)
## (0.2772, 0.27427971188475375)

x.mean(), Hypergeometric(N1 = 13, N0 = 39, n = 5).mean()
## (1.2626, 1.25)

x.var(), Hypergeometric(N1 = 13, N0 = 39, n = 5).var()
## (0.8772412399999999, 0.8639705882352942)

Definition 6.1 A discrete random variable \(X\) has a Hypergeometric distribution with parameters \(n, N_0, N_1\), all nonnegative integers — with \(N = N_0+N_1\) and \(p=N_1/N\) — if its probability mass function is

\[\begin{align*} p_{X}(x) & = \frac{\binom{N_1}{x}\binom{N_0}{n-x}}{\binom{N_0+N_1}{n}},\quad x \in \{\max(n-N_0,0),\ldots,\min(n,N_1)\}\\ \end{align*}\] If \(X\) has a Hypergeometric(\(n\), \(N_1\), \(N_0\)) distribution \[\begin{align*} \textrm{E}(X) & = np\\ \textrm{Var}(X) & = np(1-p)\left(\frac{N-n}{N-1}\right) \end{align*}\]

Imagine a box containing \(N=N_1+N_0\) tickets, \(N_1\) of which are labeled 1 (“success”) and \(N_0\) of which are labeled 0 (“failure”). Randomly select \(n\) tickets from the box without replacement and let \(X\) be the number of tickets in the sample that are labeled 1. Then \(X\) has a Hypergeometric(\(N_1\), \(N_0\), \(n\)) distribution. Since the tickets are labeled 1 and 0, the random variable \(X\) which counts the number of successes is equal to the sum of the 1/0 values on the tickets.

The population size is \(N\) and the sample size is \(n\). The population proportion of success is \(p=N_1/N\). The random variable \(X\) counts the number of successes in the sample, and so \(X/n\) is the sample proportion of success. However, since the selections are made without replacement, the draws are not independent, and it is not enough to just specify \(p\) to determine the distribution of \(X\) (or \(X/n\)).

The largest possible value of \(X\) is \(\min(n, N_1)\), since there can’t be more successes in the sample than in the population. The smallest possible value of \(X\) is \(\max(0, n-N_0)\) since there can’t be more failures in the sample than in the population (that is, \(n-X\le N_0\)). Often \(N_0\) and \(N_1\) are large relative to \(n\) in which case \(X\) takes values \(0, 1,\ldots, n\).

The quantity \(\frac{N-n}{N-1}<1\) which appears in the variance formula is called the finite population correction. We will investigate this factor in more detail later.