Chapter 4 Unequal probability sampling

4.1 The Horvitz-Thompson estimator

Suppose we want to estimate the population total of a study variable y:

θ=Y=kUyk

Horvitz and Thompson (1952) proposed the following expression as an estimator of θ, where πk designates the inclusion probability of k that is, the probability for k to be selected:

ˆYHT=ksykπk=ksdkyk

dk=1/πk is the design weight of k.

In the case of a simple random sample of size n, the inclusion probability of i is πi=n/N and then the Horvitz-Thompson estimator is the traditional sample mean estimator: ˆYHT=ksykn/N=Nˉy

Lemma 1: iUπi=E(nS)

where nS is the (random) sample size. If the sample size is fixed, then we have iUπi=n

Let introduce the following dummy variable δi=1 if iS and δi=0 otherwise. δi is a random variable whose expectation is given by: E(δi)=Pr(iS)=πi

Hence, we have: iUπi=iUE(δi)=E(iUδi)=E(nS)

Lemma 2: iUjU,jiπij=V(nS)+E(nS)[E(nS)1]

If the sample size is fixed, then we have iUjU,jiπij=n(n1)

iUjU,jiπij=iUjU,jiE(δiδj)=iUE(δijU,jiδj)=iUE[δi(nSδi)]

=iUE(nSδi)iUE(δ2i)=E(nSiUδi)iUE(δi)=E(n2S)E(nS)

Then we have: iUjU,jiπij=E(n2S)E(nS)=E(n2S)E2(nS)+E2(nS)E(nS)

=V(nS)+E(nS)[E(nS)1]

Lemma 3: jUπij=E(nSδi)

If the sample size is fixed, then we have jUπij=nE(δi)=nπi

The proof is similar to that for Lemma 2.

Result 1: Assuming πk>0  kU, ˆYHT is an unbiased estimator of Y

The proof is easy after the HT is rewritten as a sum over all population elements using the δi dummies:

ˆYHT=ksykπk=kUykπkδk

Then the expectation is given by:

E(ˆYHT)=kUykπkE(δk)=kUykπkπk=Y

Result 2: The variance of ˆYHT is given by:

V(ˆYHT)=iUjUΔijyiπiyjπj

where:

  • πij is the double inclusion probability of i and j: πij=Pr(i,jS)
  • Δij=πijπiπj

The proof is straightforward: V(ˆYHT)=iUy2iπ2iV(δi)+iUjU,jiyiπiyjπjCov(δi,δj)

The final result is obtained as Δij=Cov(δi,δj)

Let’s see what we have in case of simple random sampling. When ij, we have: Δij=πijπiπj=n(n1)N(N1)nNnN=nNN(n1)n(N1)N(N1)=nN(N1)NnN

Hence, we got: Δijπiπj=1nNN1NnN=A

When i=j, we got: Δiiπiπi=πi(1πi)π2i=1πiπi=Nn1=B=(N1)A

Hence the variance of the estimator for the total is: V(Nˉy)=iUjU,jiAyiyj+iUBy2i=iUjUAyiyjiUAy2iiU(N1)Ay2i=A[(iUyi)2iUy2i(N1)iUy2i]=A[(iUyi)2NiUy2i]=A[N2S2yN1N]=N2(1nN)S2yn

This is the weel-known result that was established in the chapter on simple random sampling.

Result 3: Assuming πij>0  i,jU, the variance of ˆYHT is estimated by:

ˆV(ˆYHT)=isjsΔijπijyiπiyjπj

Result 4 (Sen-Yates-Grundy): Assuming the sampling design is of fixed size, the variance of ˆYHT can be alternatively written as:

VSYG(ˆY)=12iUjUΔij(yiπiyjπj)2

ˆVSYG(ˆY)=12isjsΔijπij(yiπiyjπj)2

The proof stems from the expansion of the square term in (4.5) : 12iUjUΔij(yiπiyjπj)2=iUjUΔijy2iπ2i+iUjUΔijyiπiyjπj

Assuming fixed sample size and using Lemma 1 and 3, we have for all iU: jUΔij=jUπijjUπiπj=nπiπijUπj=0

Thus the result is proved.

Consequently, when the inclusion probabilities πi are chosen proportional to the study variable yi then the variance (4.5) is equal to 0. In practice, as the study variable y is unknown, the inclusion probabilities should be taken proportional to an auxiliary variable x which is assumed to have a linear relationship with y: πx (probability proportional to size sampling).

In conclusion, all the previous results show that unequal probability sampling can lead to estimators with higher precision than when simple random sampling or other equal probability designs are used. Although it may seem counterintuitive to people unfamiliar with survey sampling, this result is important and emphasises the importance of using so-called ‘auxiliary’ information as a way of increasing sampling precision.

However, it is important to note that the “optimal” inclusion probabilities are variable-specific, in the sense that they are related to the study variable. Therefore, what is optimal for one variable may be far from optimal for other variables. In the case of multi-purpose surveys, this is a major problem that generally prevents the use of unequal probability sampling. Alternatively, survey statisticians are more likely to use stratification, as we know it always improves accuracy, whatever the variable.

4.2 The Hansen-Hurwitz estimator

The Hansen-Hurwitz estimator has been proposed in case of sampling with replacement. Let consider an ordered sample of m independent draws (k1,k2km). At each draw, the probability of selecting an individual k is pk.

The Hansen-Hurwitz estimator is expressed as:

ˆYHH=1mmi=1ykipki

Result 1: Assuming pk>0  kU, ˆYHH is an unbiased estimator of Y

Result 2: The variance of ˆYHH is given by:

V(ˆYHH)=V1m

where:

V1=iUpi(yipiY)2

Result 3: The variance of ˆYHH is estimated by:

ˆV(ˆYHH)=1m(m1)mi=1(yipiˆYHH)2

The above formula under with replacement sampling is pretty easy to program. Therefore, the estimator (4.9) can be used as an approximation to (4.4).

Furthermore, when we assume the size N of the population is large enough and n/N0, then the inclusion probability of i is πimpi. Therefore (4.9) can be rewritten as:

ˆV(ˆYHH)=mm1mi=1(yiπiˆYHTm)2

(4.10) is an often used formulae, whose great advantage lies in its simplicity. It is implemented in most of the traditionnal statistical software tools such as SAS, Stata, SPSS or R.