Chapter 5 Multistage sampling
5.1 Introduction and notations
The sampling designs presented so far are single-stage designs that is, sampling frames are available for direct-element selection. However, there exists many practical situations where no frame is available in order to identify the elements of a population. In which case, multi-stage sampling is an alternative option:
- At first-stage sampling, a sample of Primary Sampling Units (PSU) is selected using a probabilistic design (e.g. simple random sampling or other, with or without stratification)
- At second-stage sampling, a sub-sample of Secondary Sampling Units (SSU) is selected within each PSU selected at first-stage. The selection of SSU is supposed to be independent from one PSU to another.

Figure 5.1: Principle of multistage sampling (Ardilly, 2006)
The process can be continued and at third-stage sampling a sample of Tertiary Sampling Units can be selected with each of the SSU selected at second stage.
For example, in the absence of any frame of individuals, we may consider selecting a sample of municipalities (first-stage sampling), a sample of neighbourhoods (second-stage sampling) within each selected municipality, a sample of households (third-stage sampling) within each of the neighbourhoods selected a second stage and finally a sample of individuals (fourth-stage sampling) within each household.
5.2 Estimating a population total
We wish to estimate the total Y of a study variable y: Y=∑i∈UI∑j∈UIIiyij=∑i∈UIYi where
- yij is the value taken by y on the SSU j of PSU i
- Yi=∑j∈UIIiyij is the subtotal of y on PSU i
According to the Horvitz-Thompson principle, Y should be estimated by the following sum: ∑i∈sIYi/πIi, where πIi is the inclusion probability of PSU i.
As the term Yi is unknown, it must be estimated using the same Horvitz-Thompson principle. As a result, in two-stage sampling, the total Y of y is estimated by the following two-stage Horvitz-Thompson estimator:
ˆYHT,2st=∑i∈sIˆYi,HTπIi=∑i∈sI∑j∈sIIiyijπIiπIIj
where πIIj is the conditional inclusion probability of SSU j given PSU i had been selected at first-stage.
The variance of ˆYHT,2st is given by:
V(ˆYHT,2st)=∑i∈UI∑i′∈UIΔIii′YiπIiYi′πIi′+∑i∈UIV(ˆYi,HT)πIi
where
- VI=∑i∈UI∑i′∈UIΔIii′YiπIiYi′πIi′ is the first-stage variance
- VII=∑i∈UIV(ˆYi,HT)πIi is the second-stage variance
Proof: the variance is decomposed into first-stage variance and second-stage variance using conditional expectation and conditional variance
V(ˆYHT,2st)=VI[EII(ˆYHT,2st|I)]+EI[VII(ˆYHT,2st|I)]=VI(∑i∈sIYiπIi)+EI[∑i∈sIV(ˆYi,HT)π2Ii]=∑i∈UI∑i′∈UIΔIii′YiπIiYi′πIi′+∑i∈UIV(ˆYi,HT)πIi
The variance (5.2) is unbiasedly estimated by:
ˆV(ˆYHT,2st)=∑i∈sI∑i′∈sIΔIii′πIii′ˆYi,HTπIiˆYi′,HTπIi′+∑i∈sIˆV(ˆYi,HT)πIi
where ˆV(ˆYi,HT) is an unbiased estimator of the variance V(ˆYi,HT)
Proof: EII(∑i∈sI∑i′∈sIΔIii′πIii′ˆYi,HTπIiˆYi′,HTπIi′|I)=∑i∈sI∑i′∈sIi′≠iΔIii′πIii′EII(ˆYi,HTˆYi′,HT|I)πIiπIi′+∑i∈sIΔIiπIiEII(ˆY2i,HT|I)π2Ii=∑i∈sI∑i′∈sIi′≠iΔIii′πIii′EII(ˆYi,HT|I)EII(ˆYi′,HT|I)πIiπIi′+∑i∈sIΔIiπIiE2II(ˆYi,HT|I)+V(ˆYi,HT)π2Ii=∑i∈sI∑i′∈sIΔIii′πIii′EII(ˆYi,HT|I)EII(ˆYi′,HT|I)πIiπIi′+∑i∈sIΔIiπIiV(ˆYi,HT)π2Ii=∑i∈sI∑i′∈sIΔIii′πIii′YiYi′πIiπIi′+∑i∈sI1−πIiπ2IiV(ˆYi,HT)
Thus we got E(∑i∈sI∑i′∈sIΔIii′πIii′ˆYi,HTπIiˆYi′,HTπIi′)=∑i∈UI∑i′∈UIΔIii′YiπIiYi′πIi′+∑i∈UI1−πIiπIiV(ˆYi,HT)=∑i∈UI∑i′∈UIΔIii′YiπIiYi′πIi′+∑i∈UIV(ˆYi,HT)πIi−∑i∈UIV(ˆYi,HT)=V(ˆYHT,2st)−∑i∈UIV(ˆYi,HT)
As we have:
E(∑i∈sIˆV(ˆYi,HT)πIi)=∑i∈UIV(ˆYi,HT)
then we can conclude (5.3) is an unbiased estimator of the variance.
The first term of (5.3) (“Ultimate Cluster” estimator) is often considered as a more tractable option for variance estimation under multi-stage sampling, though this estimator is slightly biased.
ˆVUC(ˆYHT,2st)=∑i∈sI∑i′∈sIΔIii′πIii′ˆYi,HTπIiˆYi′,HTπIi′
However the bias should be negligible as long as the first-stage inclusion probabilities πIi are small. In which case, assuming sampling with replacement, the Hansen-Hurwitz estimator can be used as a sensible estimator for the variance under multi-stage sampling:
ˆVUC(ˆYHT,2st)=1m1m−1∑i∈sI(ˆYi,HTpIi−ˆYHT,2st)2
where m is the number of PSUs drawn at first stage and pIi=πIi/m.
(5.5) is general enough to accommodate most of the multi-stage sampling designs used in practice. The formula works under mild assumptions, mainly the sampling fraction at first stage be close to zero.
If not, the variance formula (5.5) can be adjusted using the finite population correction factor at first stage:
ˆVUC(ˆYHT,2st)=(1−f)1m1m−1∑i∈sI(ˆYi,HTpIi−ˆYHT,2st)2
where 1−f=1−mM
5.3 Case of simple random sampling at each stage
Let assume simple random sampling of m PSUs out of M at first stage and, within each PSU i, simple random sampling of ni SSUs out of Ni. Then the variance (5.2) can be written as:
V(ˆYHT,2st)=M2(1−mM)S2Ym+Mm[∑i∈UIN2i(1−niNi)S2ini]
where: S2Y is the dispersion among the PSU totals Yi
When we assume ni=Ni ∀i (cluster sampling) and Ni=ˉN ∀i, the variance (5.7) can be written as:
V(ˆYHT,2st)=M2(1−mM)S2Ym≈N2(1−nN)S2n[1+ρ(ˉN−1)]
where ρ denotes the intra-cluster correlation coefficient: the more homogeneous the PSU with regard to the characteristics of interest, the higher the coefficient.
ρ=1ˉN−1∑i∈UI∑j∈UIIi∑k∈UIIik≠j(yij−ˉY)(yik−ˉY)∑i∈UI∑j∈UIIi(yij−ˉY)2≈S2BS2−1ˉN
The factor [1+ρ(ˉN−1)] measures the gain or loss in accuracy due to clustering as compared to simple random sampling of same size. This coefficient is known as the design effect factor (Deff). If ρ>0 then clustering leads to a loss in accuracy as compared to simple random sampling, while ρ<0 means that clustering has had a postive effect on sample accuracy.
In short, in order to reduce the negative impact of clustering on accuracy, PSU should be constructed so to be as heterogeneous as possible with respect to the target characteristics of the study, and to be as small as possible. This is a logic which is different to that of stratification, where strata must be as homogeneous as possible in order to be efficient in term of variance reduction. Actually, stratification can be combined with multistage selection in order to offset the loss of accuracy due to clustering.
5.4 Design optimality
The result (5.8) can be extended beyond cluster sampling to two-stage sample selection where the sample size ˉn at second stage is equal within each PSU. In which case, the variance is:
V(ˆYHT,2st)≈N2S2n[1+ρ(ˉn−1)]=N2S2[1−ρmˉn+ρm]
Ther above formula can be used to determine the optimal sample sizes m and ˉn at each sampling stage so to minimise the variance (5.10) under the cost constraint c1m+c2mˉn=C. We got the following result:
ˉn=√c1c21−ρρm=Cc1+√c1c21−ρρ
Therefore, when the average cost c1 of surveying a PSU is low compared to the cost c2 for a SSU, then we have to put the effort on sample size m at first stage in order to reach optimal accuracy under cost constraints.