Chapter 4 Unequal probability sampling

4.1 The Horvitz-Thompson estimator

Suppose we seek to estimate the population total of a study variable \(y\):

\[\begin{equation} \theta = Y = \sum_{k \in U} y_k \tag{4.1} \end{equation}\]

Horvitz and Thompson (1952) proposed the following expression as an estimator of \(\theta\), where \(\pi_k\) designates the inclusion probability of \(k\) that is, the probability for \(k\) to be selected:

\[\begin{equation} \hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{\pi_k} = \sum_{k \in s} d_k y_k \tag{4.2} \end{equation}\]

\(d_k = 1 / \pi_k\) is the design weight of \(k\).

In case of simple random sampling of size \(n\), the inclusion probability of \(i\) is \(\pi_i = n/N\) and then the Horvitz-Thompson estimator is the traditional sample mean estimator: \[\begin{equation} \hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{n/N} = N\bar{y} \end{equation}\]

Lemma 1: \(\sum_{i \in U} \pi_i = E\left(n_S\right)\)

where \(n_S\) is the (random) sample size. If the sample size is fixed, then we have \(\sum_{i \in U} \pi_i = n\)

Let introduce the following dummy variable \(\delta_i = 1\) if \(i \in S\) and \(\delta_i = 0\) otherwise. \(\delta_i\) is a random variable whose expectation is given by: \[E\left(\delta_i\right) = Pr\left(i \in S\right) = \pi_i\]

Hence, we have: \(\sum_{i \in U} \pi_i = \sum_{i \in U} E\left(\delta_i\right) = E\left(\sum_{i \in U} \delta_i\right) = E\left(n_S\right)\)

Lemma 2: \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = V\left(n_S\right) + E\left(n_S\right)\left[E\left(n_S\right)-1\right]\)

If the sample size is fixed, then we have \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = n\left(n-1\right)\)

\(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = \sum_{i \in U} \sum_{j \in U, j \neq i} E\left(\delta_i \delta_j\right)= \sum_{i \in U} E\left(\delta_i \sum_{j \in U, j \neq i}\delta_j\right)= \sum_{i \in U} E\left[\delta_i \left(n_S -\delta_i\right)\right]\)

\(= \sum_{i \in U} E\left(n_S\delta_i\right) - \sum_{i \in U} E\left( \delta^2_i\right)= E\left(n_S \sum_{i \in U} \delta_i\right) - \sum_{i \in U} E\left( \delta_i\right)= E\left(n^2_S\right) - E\left(n_S\right)\)

Then we have: \(\sum_{i \in U} \sum_{j \in U, j \neq i} \pi_{ij} = E\left(n^2_S\right) - E\left(n_S\right) = E\left(n^2_S\right) - E^2\left(n_S\right) + E^2\left(n_S\right) - E\left(n_S\right)\)

\(= V\left(n_S\right) + E\left(n_S\right)\left[E\left(n_S\right)-1\right]\)

Lemma 3: \(\sum_{j \in U} \pi_{ij} = E\left(n_S\delta_i\right)\)

If the sample size is fixed, then we have \(\sum_{j \in U} \pi_{ij} = n E\left(\delta_i\right) = n \pi_i\)

The proof is similar to that for Lemma 2.

Result 1: Assuming \(\pi_k>0~~\forall k \in U\), \(\hat{Y}_{HT}\) is an unbiased estimator of \(Y\)

The proof is easy after the HT is rewritten as a sum over all population elements using the \(\delta_i\) dummies:

\[\hat{Y}_{HT} = \sum_{k \in s} \frac{y_k}{\pi_k} = \sum_{k \in U} \frac{y_k}{\pi_k} \delta_k\]

Then the expectation is given by:

\[E\left(\hat{Y}_{HT}\right) = \sum_{k \in U} \frac{y_k}{\pi_k} E\left(\delta_k\right) = \sum_{k \in U} \frac{y_k}{\pi_k} \pi_k = Y\]

Result 2: The variance of \(\hat{Y}_{HT}\) is given by:

\[\begin{equation} V\left(\hat{Y}_{HT} \right) = \sum_{i \in U}\sum_{j \in U} \Delta_{ij}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j} \tag{4.3} \end{equation}\]

where:

  • \(\pi_{ij}\) is the double inclusion probability of \(i\) and \(j\): \(\pi_{ij} = Pr\left(i,j \in S\right)\)
  • \(\Delta_{ij} = \pi_{ij}-\pi_i \pi_j\)

The proof is straightforward: \[V\left(\hat{Y}_{HT}\right) = \sum_{i \in U} \frac{y^2_i}{\pi^2_i} V\left(\delta_i\right) + \sum_{i \in U} \sum_{j \in U, j \neq i} \frac{y_i}{\pi_i} \frac{y_j}{\pi_j} Cov\left(\delta_i , \delta_j \right) \]

The final result is obtained as \(\Delta_{ij} = Cov\left(\delta_i , \delta_j \right)\)

Let’s see what we have in case of simple random sampling. When \(i \neq j\), we have: \[\begin{equation} \begin{array}{rcl} \Delta_{ij} & = & \pi_{ij} - \pi_i \pi_j \\ & = & \displaystyle{\frac{n\left(n-1\right)}{N\left(N-1\right)}-\frac{n}{N}\frac{n}{N}} \\ & = & \displaystyle{\frac{n}{N}\frac{N\left(n-1\right)-n\left(N-1\right)}{N\left(N-1\right)}} \\ & = & \displaystyle{-\frac{n}{N\left(N-1\right)}\frac{N-n}{N}} \end{array} \end{equation}\]

Hence, we got: \[\begin{equation} \begin{array}{rcl} \displaystyle{\frac{\Delta_{ij}}{\pi_i \pi_j}} & = & \displaystyle{-\frac{1}{n}\frac{N}{N-1}\frac{N-n}{N}} = A \end{array} \end{equation}\]

When \(i = j\), we got: \[\begin{equation} \displaystyle{\frac{\Delta_{ii}}{\pi_i \pi_i}} = \displaystyle{\frac{\pi_i\left(1-\pi_i\right)}{\pi^2_i}} = \displaystyle{\frac{1-\pi_i}{\pi_i}} = \frac{N}{n} - 1 = B = - \left(N - 1\right)A \end{equation}\]

Hence the variance of the estimator for the total is: \[\begin{equation} \begin{array}{rcl} V\left(N\bar{y}\right) & = & \sum_{i \in U} \sum_{j \in U, j \neq i} A y_i y_j + \sum_{i \in U} B y^2_i \\ & = & \sum_{i \in U} \sum_{j \in U} A y_i y_j - \sum_{i \in U} A y^2_i - \sum_{i \in U} \left(N-1\right) A y^2_i \\ & = & A \left[\left(\sum_{i \in U} y_i \right)^2 - \sum_{i \in U} y^2_i - (N-1)\sum_{i \in U} y^2_i\right] \\ & = & A \left[\left(\sum_{i \in U} y_i \right)^2 - N\sum_{i \in U} y^2_i\right] \\ & = & A \left[-N^2 S_y^2 \displaystyle{\frac{N-1}{N}} \right] \\ & = & N^2 \left(1-\displaystyle{\frac{n}{N}}\right) \displaystyle{\frac{S_y^2}{n}} \end{array} \end{equation}\]

This is the weel-known result that was established in the chapter on simple random sampling.

Result 3: Assuming \(\pi_{ij}>0~~\forall i,j \in U\), the variance of \(\hat{Y}_{HT}\) is estimated by:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HT} \right) = \sum_{i \in s}\sum_{j \in s} \frac{\Delta_{ij}}{\pi_{ij}}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j} \tag{4.4} \end{equation}\]

Result 4 (Sen-Yates-Grundy): Assuming the sampling design is of fixed size, the variance of \(\hat{Y}_{HT}\) can be alternatively written as:

\[\begin{equation} {V}_{SYG}\left(\hat{Y}\right) = -\frac{1}{2}\sum_{i \in U}\sum_{j \in U} \Delta_{ij}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 \tag{4.5} \end{equation}\]

\[\begin{equation} \hat{V}_{SYG}\left(\hat{Y}\right) = -\frac{1}{2}\sum_{i \in s}\sum_{j \in s} \frac{\Delta_{ij}}{\pi_{ij}}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 \tag{4.6} \end{equation}\]

The proof stems from the expansion of the square term in (4.5) : \[-\frac{1}{2}\sum_{i \in U}\sum_{j \in U} \Delta_{ij}\left(\frac{y_i}{\pi_i}-\frac{y_j}{\pi_j}\right)^2 = -\sum_{i \in U}\sum_{j \in U} \Delta_{ij} \frac{y^2_i}{\pi^2_i} + \sum_{i \in U}\sum_{j \in U} \Delta_{ij}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}\]

Assuming fixed sample size and using Lemma 1 and 3, we have for all \(i \in U\): \(\sum_{j \in U} \Delta_{ij} = \sum_{j \in U} \pi_{ij} - \sum_{j \in U} \pi_i \pi_j = n \pi_i - \pi_i \sum_{j \in U} \pi_j = 0\)

Thus the result is proved.

Consequently, when the inclusion probabilities \(\pi_i\) are chosen proportional to the study variable \(y_i\) then the variance (4.5) is equal to 0. In practice, as the study variable \(y\) is unknown, the inclusion probabilities should be taken proportional to an auxiliary variable \(x\) assumed to have a linear relationship with \(y\): \(\pi \propto x\) (probability proportional to size sampling).

As a conclusion, all the previous results show that unequal probability sampling can result in estimators having higher precision than when simple random sampling or other equal probability designs are used. Though it may seem counter-intuitive to people who are not familiar with survey sampling, this result is important and emphasizes the importance of utilizing so-called “auxiliary” information as a way to boost sampling precision.

However, it must be noted that optimality in terms of inclusion probabilities is variable-specific, in the sense that “optimal” inclusion probabilities are related to the study variable. Therefore what is optimal with respect to one variable may be far from optimal with other variables. In case of multi-purpose surveys, this is a major problem which generally prevents from using unequal probability sampling. Alternatively, survey statisticians rather use stratification as we know it always make accuracy better no matter the variable.

4.2 The Hansen-Hurwitz estimator

The Hansen-Hurwitz estimator has been proposed in case of sampling with replacement. Let consider an ordered sample of \(m\) independent draws \(\left(k_1,k_2 \cdots k_m\right)\). At each draw, the probability of selecting an individual \(k\) is \(p_k\).

The Hansen-Hurwitz estimator is expressed as:

\[\begin{equation} \hat{Y}_{HH} = \frac{1}{m} \sum_{i=1}^m \frac{y_{k_i}}{p_{k_i}} \tag{4.7} \end{equation}\]

Result 1: Assuming \(p_k>0~~\forall k \in U\), \(\hat{Y}_{HH}\) is an unbiased estimator of \(Y\)

Result 2: The variance of \(\hat{Y}_{HH}\) is given by:

\[\begin{equation} V\left(\hat{Y}_{HH} \right) = \frac{V_1}{m} \tag{4.8} \end{equation}\]

where:

\(\displaystyle{V_1 = \sum_{i \in U} p_i \left(\frac{y_i}{p_i} - Y\right)^2}\)

Result 3: The variance of \(\hat{Y}_{HH}\) is estimated by:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HH} \right) = \displaystyle{\frac{1}{m\left(m-1\right)}\sum_{i =1}^m \left(\frac{y_i}{p_i} - \hat{Y}_{HH}\right)^2} \tag{4.9} \end{equation}\]

The above formula under with replacement sampling is pretty easy to program. Therefore, the estimator (4.9) can be used as an approximation to (4.4).

Furthermore, when we assume the size \(N\) of the population is large enough and \(n/N \rightarrow 0\), then the inclusion probability of \(i\) is \(\pi_i \approx m p_i\). Therefore (4.9) can be rewritten as:

\[\begin{equation} \hat{V}\left(\hat{Y}_{HH} \right) = \displaystyle{\frac{m}{m-1}\sum_{i =1}^m \left(\frac{y_i}{\pi_i} - \frac{\hat{Y}_{HT}}{m}\right)^2} \tag{4.10} \end{equation}\]

(4.10) is an often used formulae, whose great advantage lies in its simplicity. It is implemented in most of the traditionnal statistical software tools such as SAS, Stata, SPSS or R.