Chapter 4 Unequal probability sampling

4.1 The Horvitz-Thompson estimator

Suppose we seek to estimate the population total of a study variable y:

θ=Y=kUyk

Horvitz and Thompson (1952) proposed the following expression as an estimator of θ, where πk designates the inclusion probability of k that is, the probability for k to be selected:

ˆYHT=ksykπk=ksdkyk

dk=1/πk is the design weight of k.

In case of simple random sampling of size n, the inclusion probability of i is πi=n/N and then the Horvitz-Thompson estimator is the traditional sample mean estimator: ˆYHT=ksykn/N=Nˉy

Lemma 1: iUπi=E(nS)

where nS is the (random) sample size. If the sample size is fixed, then we have iUπi=n

Let introduce the following dummy variable δi=1 if iS and δi=0 otherwise. δi is a random variable whose expectation is given by: E(δi)=Pr(iS)=πi

Hence, we have: iUπi=iUE(δi)=E(iUδi)=E(nS)

Lemma 2: iUjU,jiπij=V(nS)+E(nS)[E(nS)1]

If the sample size is fixed, then we have iUjU,jiπij=n(n1)

iUjU,jiπij=iUjU,jiE(δiδj)=iUE(δijU,jiδj)=iUE[δi(nSδi)]

=iUE(nSδi)iUE(δ2i)=E(nSiUδi)iUE(δi)=E(n2S)E(nS)

Then we have: iUjU,jiπij=E(n2S)E(nS)=E(n2S)E2(nS)+E2(nS)E(nS)

=V(nS)+E(nS)[E(nS)1]

Lemma 3: jUπij=E(nSδi)

If the sample size is fixed, then we have jUπij=nE(δi)=nπi

The proof is similar to that for Lemma 2.

Result 1: Assuming πk>0  kU, ˆYHT is an unbiased estimator of Y

The proof is easy after the HT is rewritten as a sum over all population elements using the δi dummies:

ˆYHT=ksykπk=kUykπkδk

Then the expectation is given by:

E(ˆYHT)=kUykπkE(δk)=kUykπkπk=Y

Result 2: The variance of ˆYHT is given by:

V(ˆYHT)=iUjUΔijyiπiyjπj

where:

  • πij is the double inclusion probability of i and j: πij=Pr(i,jS)
  • Δij=πijπiπj

The proof is straightforward: V(ˆYHT)=iUy2iπ2iV(δi)+iUjU,jiyiπiyjπjCov(δi,δj)

The final result is obtained as Δij=Cov(δi,δj)

Let’s see what we have in case of simple random sampling. When ij, we have: Δij=πijπiπj=n(n1)N(N1)nNnN=nNN(n1)n(N1)N(N1)=nN(N1)NnN

Hence, we got: Δijπiπj=1nNN1NnN=A

When i=j, we got: Δiiπiπi=πi(1πi)π2i=1πiπi=Nn1=B=(N1)A

Hence the variance of the estimator for the total is: V(Nˉy)=iUjU,jiAyiyj+iUBy2i=iUjUAyiyjiUAy2iiU(N1)Ay2i=A[(iUyi)2iUy2i(N1)iUy2i]=A[(iUyi)2NiUy2i]=A[N2S2yN1N]=N2(1nN)S2yn

This is the weel-known result that was established in the chapter on simple random sampling.

Result 3: Assuming πij>0  i,jU, the variance of ˆYHT is estimated by:

ˆV(ˆYHT)=isjsΔijπijyiπiyjπj

Result 4 (Sen-Yates-Grundy): Assuming the sampling design is of fixed size, the variance of ˆYHT can be alternatively written as:

VSYG(ˆY)=12iUjUΔij(yiπiyjπj)2

ˆVSYG(ˆY)=12isjsΔijπij(yiπiyjπj)2

The proof stems from the expansion of the square term in (4.5) : 12iUjUΔij(yiπiyjπj)2=iUjUΔijy2iπ2i+iUjUΔijyiπiyjπj

Assuming fixed sample size and using Lemma 1 and 3, we have for all iU: jUΔij=jUπijjUπiπj=nπiπijUπj=0

Thus the result is proved.

Consequently, when the inclusion probabilities πi are chosen proportional to the study variable yi then the variance (4.5) is equal to 0. In practice, as the study variable y is unknown, the inclusion probabilities should be taken proportional to an auxiliary variable x assumed to have a linear relationship with y: πx (probability proportional to size sampling).

As a conclusion, all the previous results show that unequal probability sampling can result in estimators having higher precision than when simple random sampling or other equal probability designs are used. Though it may seem counter-intuitive to people who are not familiar with survey sampling, this result is important and emphasizes the importance of utilizing so-called “auxiliary” information as a way to boost sampling precision.

However, it must be noted that optimality in terms of inclusion probabilities is variable-specific, in the sense that “optimal” inclusion probabilities are related to the study variable. Therefore what is optimal with respect to one variable may be far from optimal with other variables. In case of multi-purpose surveys, this is a major problem which generally prevents from using unequal probability sampling. Alternatively, survey statisticians rather use stratification as we know it always make accuracy better no matter the variable.

4.2 The Hansen-Hurwitz estimator

The Hansen-Hurwitz estimator has been proposed in case of sampling with replacement. Let consider an ordered sample of m independent draws (k1,k2km). At each draw, the probability of selecting an individual k is pk.

The Hansen-Hurwitz estimator is expressed as:

ˆYHH=1mmi=1ykipki

Result 1: Assuming pk>0  kU, ˆYHH is an unbiased estimator of Y

Result 2: The variance of ˆYHH is given by:

V(ˆYHH)=V1m

where:

V1=iUpi(yipiY)2

Result 3: The variance of ˆYHH is estimated by:

ˆV(ˆYHH)=1m(m1)mi=1(yipiˆYHH)2

The above formula under with replacement sampling is pretty easy to program. Therefore, the estimator (4.9) can be used as an approximation to (4.4).

Furthermore, when we assume the size N of the population is large enough and n/N0, then the inclusion probability of i is πimpi. Therefore (4.9) can be rewritten as:

ˆV(ˆYHH)=mm1mi=1(yiπiˆYHTm)2

(4.10) is an often used formulae, whose great advantage lies in its simplicity. It is implemented in most of the traditionnal statistical software tools such as SAS, Stata, SPSS or R.