Chapter 3 Stratification

3.1 What is it?

Stratification is a commonly used technique in survey sampling, which consists of dividing the target population U into H non overlapping sub-populations U1,U2UH, called strata, which comprise the whole population U.

A sample sh of size nh is taken within each stratum h independently from one stratum to another. Let n=Hh=1nh denote the overall sample size.

Stratification

Figure 3.1: Stratification

For example, we can divide a population of individuals according to the age and the gender of the person or its city of residence. As to a population of businesses, it can be stratified by characteristics such as company size or sector of activity as set out by the NACE classification.

Stratification provides an example of a sampling technique whereby auxiliary information is mobilized in order to improve sampling efficiency. This auxiliary information is given by stratification characteristics whose values are assumed to be known a priori for every unit in the target population. As we’ll see, stratification will make data accuracy better when the stratification variables used are well related to the study variables of the survey.

Overall, stratification is driven either by statistical considerations (the need to get more accurate results at global or local level) or by organisational aspects (a stratum corresponds to a “regional” office which is in charge of data collection within the region)

3.2 Total, mean and proportion estimators

Let Nh be the size of stratum h. In stratification, the size Nh is assumed to be known for all strata h. In which case, the total Y of a study variable y can be written as:

Y=iUyi=Hh=1iUhyi=Hh=1Yh=Hh=1NhˉYh

As to the population mean ˉY of y, it can be written as the average of the stratum means ˉYh using the ratios Wh=Nh/N as weights:

ˉY=Hh=1NhNˉYh=Hh=1WhˉYh

Similarly, the proportion P is written as a weighted function of the stratum proportions Ph:

P=Hh=1NhNPh=Hh=1WhPh

Assuming simple random sampling within each stratum (stratified simple random sampling), unbiased estimators of population totals, means and proportions are given by:

ˆYSTSRS=Hh=1Nhˉyh

ˆˉYSTSRS=Hh=1Whˉyh

ˆPSTSRS=Hh=1Whph

In stratified simple random sampling, the design weights are equal within each stratum: di=Nh/nh  ish

Using the results for simple random sampling, the variances of the three above estimators are given by:

V(ˆYSTSRS)=Hh=1N2h(1fh)S2h/nh

V(ˆˉYSTSRS)=Hh=1W2h(1fh)S2h/nh

V(ˆPSTSRS)Hh=1W2h(1fh)Ph(1Ph)/nh

Assuming the sampling fraction fh is the same in each stratum: fh=nh/Nh=n/N=f  h, we can then rewrite the estimator (3.8) as:

V(ˆˉYSTSRS)=Hh=1W2h(1fh)S2h/nh=(1f)Hh=1W2hS2h/nh=(1f)Hh=1WhS2h/n=(1f)S2w/n

where S2w=Hh=1WhS2h is the within-stratum dispersion.

As S2w is lower than the total population dispersion S2, it can be shown that:

V(ˆˉYSTSRS)=(1f)S2w/n(1f)S2/n=V(ˆˉYSRS) Therefore, stratification leads to mean estimators which are more accurate than those obtained under simple random sampling. The gain in accuracy is directly measured through the ratio ρ=S2w/S2: the more homegeneous the stratum groups with respect to the study variable, the better the gain. For instance, in a survey about housing rents, stratification should be done according to characteristics such as geographical region, dwelling type or living area. In a business survey, relevant stratification criteria are company size or sector of activity (NACE)

3.3 Sample allocation

Let assume the overall sample size n has been fixed (generally out of budgetary considerations). We seek to determine which sample size nh is to be drawn out of each stratum in order to achieve statistical optimality under cost considerations. To that end, different allocation schemes have been proposed in the literature.

3.3.1 Equal allocation

In equal allocation, the sample size nh is constant across the stratum groups: h  neqh=n/H

This scheme ensures a minimum level of precision in each stratum. It is therefore driven by local considerations. On the other hand, equal allocation performs poorly when it comes to national estimation, especially when the dispersions S2h are different from one stratum to another.

3.3.2 Proportional allocation

Proportional allocation consists of selecting samples in each stratum in proportion to the size Nh of the stratum population: h  nproph=nNh/N=nWh

As previously shown, proportional allocation is always more efficient in terms of sample accuracy than simple random sampling of same size n.

V(ˆˉYprop)=(1f)S2wn=Hh=1WhS2hnHh=1WhS2hN

This result justifies the use of stratification as a powerful variance reduction technique.

3.3.3 Optimal allocation

Optimal allocation (also called Neyman allocation) seeks to minimize the variance (3.8) under the cost constraint Hh=1chnh=C0, where C0 is the overall budget available and ch the average survey cost for an individual in stratum h.

The solution to this problem is given by: h  nopth=NhShchC0Hh=1NhShch

If there is no cost constraint across the strata (i.e. ch=1  h), then the sample size in stratum h is: nopth=nNhShHh=1NhSh

and the variance of (3.8) under Neyman allocation (3.15) is:

V(ˆˉYopt)=(Hh=1WhSh)2nHh=1WhS2hN=V(ˆˉYprop)1nhWh(ShˉS)2=V(ˆˉYSRS)1nhWh(ˉYhˉY)21nhWh(ShˉS)2

Contrary to proportional allocation, the Neyman allocation is variable-specific: optimality is defined with respect to one study variable, and what is optimal with respect to one variable may be far from optimal with respect to another.

In addition, one can prove (see e.g. Cochran (1977)) the gain in accuracy as compared to proportional allocation is pretty small. That’s why in practice proportional allocation is often preferred to optimal allocation.

3.3.4 Balanced allocation

Both proportional and Neyman allocations increase sample accuracy at global level, but may happen to perform very poorly when it comes to regional level estimates. For instance, proportional allocation of a global sample of 5000 individuals would lead to a sample of 500 individuals when the stratum weight Wh is equal to 10% of the whole population, and only 100 individuals when the weight is equal to 2%.

In order to reconcile local and global considerations, a balanced approach can be adopted: a subsample ˜nn may be equally allocated among the strata in order to ensure a minimum level of precision in each group, while the rest of the sample, of size n˜n, may be allocated using either proportional or optimal allocations in order to optimize accuracy at global level.

h  nbalh=˜nH+(n˜n)Wh

As a conclusion, we can say that stratification is a well established technique which serves both for statistical and organisational purposes. That’s why it is so commonly used in survey practice, especially by National Statistical Institutes such as STATEC.