Chapter 6 Dealing with non-linear indicators: the linearisation technique
So far, the statistics we’ve been dealing with are linear statistics that is, population totals, means or proportions. However, many indicators, particularly those used in the field of social statistics, are non-linear ones. For example, a mean or a proportion are non-linear when the denominator is unknown and therefore must be regarded as ratios between two linear indicators. Furthermore, a reference indicator of income inequalities is the Gini coefficient, whose definition uses rank statistics. Distributional aspects can also be measured through calculating percentiles such as the median, quartiles, quintiles or deciles. All those indicators are complex ones, for which variance calculation requires specific techniques.
6.1 The linearisation technique
Let assume an indicator θ be expressed as a function of the p totals Y1,Y2⋯Yp:
θ=f(Y1,Y2⋯Yp)
where Yi is the total of variable (yik) over U: Yi=∑k∈Uyik
For example, an unemployment rate can be regarded as a ratio between the total number of unemployed persons in the labour force population Y=∑i∈U1UNEMPi and the total number of individuals in the labour force X=∑i∈U1LFi
RUNEMP=∑i∈U1UNEMPi∑i∈U1LFi=YX=f(Y,X)
A complex parameter such as (6.1) is traditionally estimated through substituting an estimator ˆYk for each of the p totals Y1,Y2⋯Yp
ˆθ=f(ˆY1,ˆY2⋯ˆYp)
Thus, the unemployment rate can be estimated by taking the ratio between the Horvitz-Thompson estimators for the numerator and the denominator:
ˆRUNEMP=f(ˆY,ˆX)=∑i∈s1UNEMPiπi∑i∈s1LFiπi
Assuming the function f is “regular” (C1 type - derivable with continuous derived function), the linearisation technique consists of approaching the complex estimator (6.3) with a linear estimator through first-order Taylor expansion:
ˆθ=f(ˆY1,ˆY2⋯ˆYp)=f(Y1,Y2⋯Yp)+∑pi=1∂f∂vi(Y1,Y2⋯Yp)×(ˆYi−Yi)+Kn=f(Y1,Y2⋯Yp)+∑pi=1di×(ˆYi−Yi)+Kn=C+∑pi=1diˆYi+Kn
where Kn is a random variable satisfying: Kn=OP(1n)
Finally, based on this first-order expansion, one can prove that the variance of the complex estimator ˆθ is equal to the variance of the linear part ∑pi=1diˆYi plus a reminder term of order 1n3/2
V(ˆθ)=V(p∑i=1diˆYi)+O(1n3/2)
Thus, provided the sample size is “large” enough, the variance of ˆθ is asymptotically equal to that of its linear part:
V(ˆθ)≈V(p∑i=1diˆYi)=V(ˆZ)
where ˆZ is a (linear) estimator of the total Z of zk=∑pi=1diyik
As the partial derivatives di=∂f∂vi(Y1,Y2⋯Yp) are unknown, the variance of ˆθ is estimated by:
ˆVL(ˆθ)=ˆV(p∑i=1˜diˆYi)=ˆV(ˆ˜Z)
where ˜di=∂f∂vi(ˆY1,ˆY2⋯ˆYp) and ˜zk=∑pi=1˜diyik
6.2 Examples
- Case of a ratio between two totals
ˆR=f(ˆY,ˆX)=ˆYˆX
The linearised variable is given by: ˜zk=∂f∂y(ˆY,ˆX)yk+∂f∂x(ˆY,ˆX)xk=1ˆXyk−ˆYˆX2xk=1ˆX(yk−ˆRxk)
Then, assuming simple random sampling, the estimator of the variance of ˆR is given by: ˆVL(ˆR)=N2(1−nN)s2zn
- Case of the dispersion of a variable
S2=1N∑i∈U(yi−ˉY)2=1N∑i∈Uy2i−(∑i∈Uyi)2N2=f(N,∑i∈Uyi,∑i∈Uy2i)
where f(x,y,z)=zx−y2x2
Thus, we have
d1=∂f∂x(N,∑i∈Uyi,∑i∈Uy2i)=−∑i∈Uy2iN2+2(∑i∈Uyi)2N3=−1N(∑i∈Uy2iN−2(∑i∈Uyi)2N2)=−1N(S2−ˉY2)
d2=∂f∂y(N,∑i∈Uyi,∑i∈Uy2i)=−2∑i∈UyiN2=−2ˉYN
d3=∂f∂z(N,∑i∈Uyi,∑i∈Uy2i)=1N
Therefore the linearised variable for the dispersion S2 of y is: zk=d1+d2yk+d3y2k=−1N(S2−ˉY2+2ˉYyk−y2k)=1N[(yk−ˉY)2−S2]