Chapter 4 Unequal probability sampling

4.1 The Horvitz-Thompson estimator

Suppose we seek to estimate the population total of a study variable \(y\):

\[\begin{equation} \theta = Y = \sum_{k \in U} y_k \tag{4.1} \end{equation}\]

Horvitz and Thompson (1952) proposed the following expression as an estimator of \(\theta\), where \(\pi_k\) designates the inclusion probability of \(k\) that is, the probability for \(k\) to be selected:

\[\begin{equation} \hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{\pi_k} = \sum_{k \in s} d_k y_k \tag{4.2} \end{equation}\]

\(d_k = 1 / \pi_k\) is the design weight of \(k\).

In case of simple random sampling of size \(n\), the inclusion probability of \(i\) is \(\pi_i = n/N\) and then the Horvitz-Thompson estimator is the traditional sample mean estimator: \[\begin{equation} \hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{n/N} = N\bar{y} \end{equation}\]

Lemma 1: \(\sum_{i \in U} \pi_i = E\left(n_S\right)\)

where \(n_S\) is the (random) sample size. If the sample size is fixed, then we have \(\sum_{i \in U} \pi_i = n\)

Let introduce the following dummy variable \(\delta_i = 1\) if \(i \in S\) and \(\delta_i = 0\) otherwise. \(\delta_i\) is a random variable whose expectation is given by: \[E\left(\delta_i\right) = Pr\left(i \in S\right) = \pi_i\]

Hence, we have: \(\sum_{i \in U} \pi_i = \sum_{i \in U} E\left(\delta_i\right) = E\left(\sum_{i \in U} \delta_i\right) = E\left(n_S\right)\)

Lemma 2: \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = V\left(n_S\right) + E\left(n_S\right)\left[E\left(n_S\right)-1\right]\)

If the sample size is fixed, then we have \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = n\left(n-1\right)\)

\(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = \sum_{i \in U} \sum_{j \in U, j \neq i} E\left(\delta_i \delta_j\right)= \sum_{i \in U} E\left(\delta_i \sum_{j \in U, j \neq i}\delta_j\right)= \sum_{i \in U} E\left[\delta_i \left(n_S -\delta_i\right)\right]\)

\(= \sum_{i \in U} E\left(n_S\delta_i\right) - \sum_{i \in U} E\left( \delta^2_i\right)= E\left(n_S \sum_{i \in U} \delta_i\right) - \sum_{i \in U} E\left( \delta_i\right)= E\left(n^2_S\right) - E\left(n_S\right)\)

Then we have: \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = E\left(n^2_S\right) - E\left(n_S\right) = E\left(n^2_S\right) - E^2\left(n_S\right) + E^2\left(n_S\right) - E\left(n_S\right)\)

\(= V\left(n_S\right) + E\left(n_S\right)\left[E\left(n_S\right)-1\right]\)

Lemma 3: \(\sum_{j \in U} \pi_{ij} = E\left(n_S\delta_i\right)\)

If the sample size is fixed, then we have \(\sum_{j \in U} \pi_{ij} = n E\left(\delta_i\right) = n \pi_i\)

The proof is similar to that for Lemma 2.

Result 1: Assuming \(\pi_k>0~~\forall k \in U\), \(\hat{Y}_{HT}\) is an unbiased estimator of \(Y\)

The proof is easy after the HT is rewritten as a sum over all population elements using the \(\delta_i\) dummies:

\[\hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{\pi_k} = \sum_{k \in U} \frac{y_k}{\pi_k} \delta_k\]

Then the expectation is given by:

\[E\left(\hat{Y}_{HT}\right) = \sum_{k \in U} \frac{y_k}{\pi_k} E\left(\delta_k\right) = \sum_{k \in U} \frac{y_k}{\pi_k} \pi_k = Y\]

Result 2: The variance of \(\hat{Y}_{HT}\) is given by:

\[\begin{equation} V\left(\hat{Y}_{HT} \right) = \sum_{i \in U}\sum_{j \in U} \Delta_{ij}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j} \tag{4.3} \end{equation}\]

where:

  • \(\pi_{ij}\) is the double inclusion probability of \(i\) and \(j\): \(\pi_{ij} = Pr\left(i,j \in S\right)\)
  • \(\Delta_{ij} = \pi_{ij}-\pi_i \pi_j\)

The proof is straightforward: \[V\left(\hat{Y}_{HT}\right) = \sum_{i \in U} \frac{y^2_i}{\pi^2_i} V\left(\delta_i\right) + \sum_{i \in U} \sum_{j \in U, j \neq i} \frac{y_i}{\pi_i} \frac{y_j}{\pi_j} Cov\left(\delta_i , \delta_j \right) \]

The final result is obtained as \(\Delta_{ij} = Cov\left(\delta_i , \delta_j \right)\)

Let’s see what we have in case of simple random sampling. When \(i \neq j\), we have: \[\begin{equation} \begin{array}{rcl} \Delta_{ij} & = & \pi_{ij} - \pi_i \pi_j \\ & = & \displaystyle{\frac{n\left(n-1\right)}{N\left(N-1\right)}-\frac{n}{N}\frac{n}{N}} \\ & = & \displaystyle{\frac{n}{N}\frac{N\left(n-1\right)-n\left(N-1\right)}{N\left(N-1\right)}} \\ & = & \displaystyle{-\frac{n}{N\left(N-1\right)}\frac{N-n}{N}} \end{array} \end{equation}\]

Hence, we got: \[\begin{equation} \begin{array}{rcl} \displaystyle{\frac{\Delta_{ij}}{\pi_i \pi_j}} & = & \displaystyle{-\frac{1}{n}\frac{N}{N-1}\frac{N-n}{N}} = A \end{array} \end{equation}\]

When \(i = j\), we got: \[\begin{equation} \displaystyle{\frac{\Delta_{ii}}{\pi_i \pi_i}} = \displaystyle{\frac{\pi_i\left(1-\pi_i\right)}{\pi^2_i}} = \displaystyle{\frac{1-\pi_i}{\pi_i}} = \frac{N}{n} - 1 = B = - \left(N - 1\right)A \end{equation}\]

Hence the variance of the estimator for the total is: \[\begin{equation} \begin{array}{rcl} V\left(N\bar{y}\right) & = & \sum_{i \in U} \sum_{j \in U, j \neq i} A y_i y_j + \sum_{i \in U} B y^2_i \\ & = & \sum_{i \in U} \sum_{j \in U} A y_i y_j - \sum_{i \in U} A y^2_i - \sum_{i \in U} \left(N-1\right) A y^2_i \\ & = & A \left[\left(\sum_{i \in U} y_i \right)^2 - \sum_{i \in U} y^2_i - (N-1)\sum_{i \in U} y^2_i\right] \\ & = & A \left[\left(\sum_{i \in U} y_i \right)^2 - N\sum_{i \in U} y^2_i\right] \\ & = & A \left[-N^2 S_y^2 \displaystyle{\frac{N-1}{N}} \right] \\ & = & N^2 \left(1-\displaystyle{\frac{n}{N}}\right) \displaystyle{\frac{S_y^2}{n}} \end{array} \end{equation}\]

This is the weel-known result that was established in the chapter on simple random sampling.

Result 3: Assuming \(\pi_{ij}>0~~\forall i,j \in U\), the variance of \(\hat{Y}_{HT}\) is estimated by:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HT} \right) = \sum_{i \in s}\sum_{j \in s} \frac{\Delta_{ij}}{\pi_{ij}}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j} \tag{4.4} \end{equation}\]

Result 4 (Sen-Yates-Grundy): Assuming the sampling design is of fixed size, the variance of \(\hat{Y}_{HT}\) can be alternatively written as:

\[\begin{equation} {V}_{SYG}\left(\hat{Y}\right) = -\frac{1}{2}\sum_{i \in U}\sum_{j \in U} \Delta_{ij}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 \tag{4.5} \end{equation}\]

\[\begin{equation} \hat{V}_{SYG}\left(\hat{Y}\right) = -\frac{1}{2}\sum_{i \in s}\sum_{j \in s} \frac{\Delta_{ij}}{\pi_{ij}}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 \tag{4.6} \end{equation}\]

The proof stems from the expansion of the square term in (4.5) : \[-\frac{1}{2}\sum_{i \in U}\sum_{j \in U} \Delta_{ij}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 = -\sum_{i \in U}\sum_{j \in U} \Delta_{ij} \frac{y^2_i}{\pi^2_i} + \sum_{i \in U}\sum_{j \in U} \Delta_{ij}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}\]

Assuming fixed sample size and using Lemma 1 and 3, we have for all \(i \in U\): \(\sum_{j \in U} \Delta_{ij} = \sum_{j \in U} \pi_{ij} - \sum_{j \in U} \pi_i \pi_j = n \pi_i - \pi_i \sum_{j \in U} \pi_j = 0\)

Thus the result is proved.

Consequently, when the inclusion probabilities \(\pi_i\) are chosen proportional to the study variable \(y_i\) then the variance (4.5) is equal to 0. In practice, as the study variable \(y\) is unknown, the inclusion probabilities should be taken proportional to an auxiliary variable \(x\) which is assumed to have a linear relationship with \(y\): \(\pi \propto x\) (probability proportional to size sampling).

In conclusion, all the previous results show that unequal probability sampling can lead to estimators with higher precision than when simple random sampling or other equal probability designs are used. Although it may seem counterintuitive to people unfamiliar with survey sampling, this result is important and emphasises the importance of using so-called ‘auxiliary’ information as a way of increasing sampling precision.

However, it is important to note that the “optimal” inclusion probabilities are variable-specific, in the sense that they are related to the study variable. Therefore, what is optimal for one variable may be far from optimal for other variables. In the case of multi-purpose surveys, this is a major problem that generally prevents the use of unequal probability sampling. Alternatively, survey statisticians are more likely to use stratification, as we know it always improves accuracy, whatever the variable.

4.2 The Hansen-Hurwitz estimator

The Hansen-Hurwitz estimator has been proposed in case of sampling with replacement. Let consider an ordered sample of \(m\) independent draws \(\left(k_1,k_2 \cdots k_m\right)\). At each draw, the probability of selecting an individual \(k\) is \(p_k\).

The Hansen-Hurwitz estimator is expressed as:

\[\begin{equation} \hat{Y}_{HH} = \frac{1}{m} \sum_{i=1}^m \frac{y_{k_i}}{p_{k_i}} \tag{4.7} \end{equation}\]

Result 1: Assuming \(p_k>0~~\forall k \in U\), \(\hat{Y}_{HH}\) is an unbiased estimator of \(Y\)

Result 2: The variance of \(\hat{Y}_{HH}\) is given by:

\[\begin{equation} V\left(\hat{Y}_{HH} \right) = \frac{V_1}{m} \tag{4.8} \end{equation}\]

where:

\(\displaystyle{V_1 = \sum_{i \in U} p_i \left(\frac{y_i}{p_i} - Y\right)^2}\)

Result 3: The variance of \(\hat{Y}_{HH}\) is estimated by:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HH} \right) = \displaystyle{\frac{1}{m\left(m-1\right)}\sum_{i =1}^m \left(\frac{y_i}{p_i} - \hat{Y}_{HH}\right)^2} \tag{4.9} \end{equation}\]

The above formula under with replacement sampling is pretty easy to program. Therefore, the estimator (4.9) can be used as an approximation to (4.4).

Furthermore, when we assume the size \(N\) of the population is large enough and \(n/N \rightarrow 0\), then the inclusion probability of \(i\) is \(\pi_i \approx m p_i\). Therefore (4.9) can be rewritten as:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HH} \right) = \displaystyle{\frac{m}{m-1}\sum_{i =1}^m \left(\frac{y_i}{\pi_i} - \frac{\hat{Y}_{HT}}{m}\right)^2} \tag{4.10} \end{equation}\]

(4.10) is an often used formulae, whose great advantage lies in its simplicity. It is implemented in most of the traditionnal statistical software tools such as SAS, Stata, SPSS or R.