Chapter 9 Variance estimation for non-linear indicators: the linearisation technique

The statistics we’ve looked at so far are linear statistics, i.e. population totals, means or proportions. However, many indicators, especially those used in social statistics, are non-linear. For example, a mean or proportion is non-linear when the denominator is unknown and must therefore be considered as a ratio between two linear indicators. Another reference indicator of income inequality is the Gini coefficient, which is defined using rank statistics. Distributional aspects can also be measured by calculating percentiles such as medians, quartiles, quintiles or deciles. All these indicators are complex and require specific techniques to calculate the variance.

9.1 Seminal approach

Consider an indicator θ be expressed as a function of the p totals Y1,Y2Yp:

θ=f(Y1,Y2Yp)

where Yi is the total of variable (yik) over U: Yi=kUyik

For example, the unemployment rate can be regarded as a ratio between the total number of unemployed persons in the labour force population Y=iU1UNEMPi and the total number of individuals in the labour force X=iU1LFi

RUNEMP=iU1UNEMPiiU1LFi=YX=f(Y,X)

A complex parameter such as (9.1) is traditionally estimated by substituting an estimator ˆYk for each of the p totals Y1,Y2Yp

ˆθ=f(ˆY1,ˆY2ˆYp)

Thus, the unemployment rate can be estimated by taking the ratio between the weighted estimators for the numerator and the denominator, respectively:

ˆRUNEMP=f(ˆY,ˆX)=isωi1UNEMPiisωi1LFi

Assuming the function f is “regular” (C1 type - derivable with continuous derived function), the linearisation technique consists of approaching the complex estimator (9.3) with a linear estimator through first-order Taylor expansion:

ˆθ=f(ˆY1,ˆY2ˆYp)=f(Y1,Y2Yp)+pi=1fvi(Y1,Y2Yp)×(ˆYiYi)+Kn=f(Y1,Y2Yp)+pi=1di×(ˆYiYi)+Kn=C+pi=1diˆYi+Kn

where Kn is a random variable satisfying: Kn=OP(1n) that is, with a probability \geq 1 - \epsilon

Finally, based on this first-order expansion, it can be proved that the variance of the complex estimator \hat{\theta} is equal to the variance of the linear part \sum_{i=1}^p d_i \hat{Y}_i plus a reminder term of order \displaystyle{\frac{1}{n^{3/2}}}

\begin{equation} V\left(\hat{\theta}\right) = V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) + \displaystyle{O\left(\frac{1}{n^{3/2}}\right)} \tag{9.6} \end{equation}

Thus, provided the sample size is “large” enough, the variance of \hat{\theta} is asymptotically equal to that of its linear part:

\begin{equation} V\left(\hat{\theta}\right) \approx V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) = V\left(\hat{Z}\right) \tag{9.7} \end{equation}

where \hat{Z} is a (linear) estimator of the total Z of z_k = \sum_{i=1}^p d_i {y}_{ik}

As the partial derivatives d_i = \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)} are unknown, the variance of \hat{\theta} is estimated by:

\begin{equation} \hat{V}_L\left(\hat{\theta}\right) = \hat{V}\left(\sum_{i=1}^p \tilde{d}_i \hat{Y}_i\right) = \hat{V}\left(\hat{\tilde{Z}}\right) \tag{9.8} \end{equation}

where \tilde{d}_i = \displaystyle{\frac{\partial f}{\partial v_i} \left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right)} and \tilde{z}_k = \sum_{i=1}^p \tilde{d}_i {y}_{ik}

In the case of domain analysis, the linearisation technique also applies. First, we linearise the estimator over the domain, treating the domain itself as a population. Then we add 0s to the observations that are outside the domain.

9.2 Alternative approaches

The seminal linearisation approach outlined in the previous section applies to non-linear but regular indicators in the sense that the function f of p sums Y_1, Y_2 \cdots Y_p must be differentiable by a continuous derivative function. However, there are many complex indicators that do not meet these requirements, such as the quantiles of a distribution (e.g. median income or quantile ratios) or concentration measures such as the Gini coefficient, which relies on rank statistics. Alternative linearisation frameworks have been developed to deal with these types of indicators.

  • Kovačević and Binder (1997) developed a framework for dealing with estimators \hat{\theta} that are expressed as the solution of an estimation equation \sum_{i \in s} \omega_i \times u\left(y_i,\hat{\theta}\right) = 0. In this case, the linearised variable is given by:

\begin{equation*} \tilde{z}_k = \displaystyle{-\left(\sum_{i \in s} {\omega}_i \frac{\partial u\left(y_i , \theta \right)}{\partial \theta} |\theta = \hat{\theta}\right)^{-1} \times u\left(y_k,\hat{\theta}\right)} \end{equation*}

Such a framework is suitable for dealing with indicators such as regression coefficients (e.g. linear or logistic regression models), which are expressed as the solution of systems of equations

  • The framework proposed by Demnati and Rao (2004) is based on the concept of “influence function” that is used in robust statistics. In practice, the linearised variable is defined by the partial derivative function with respect to the weight variable:

\begin{equation*} \tilde{z}_k = \displaystyle{\frac{\partial f\left({\omega}_k , k \in s\right)}{\partial {\omega}_k}} \end{equation*}

9.3 Examples

  • Case of a ratio between totals

\begin{equation} \hat{R} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\hat{Y}}{\hat{X}}} \tag{9.9} \end{equation}

The linearised variable is given by: \begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{\partial f}{\partial y} \left(\hat{Y},\hat{X}\right)} y_k + \displaystyle{\frac{\partial f}{\partial x} \left(\hat{Y},\hat{X}\right)} x_k \\ & = & \displaystyle{\frac{1}{\hat{X}} y_k -\frac{\hat{Y}}{\hat{X}^2} x_k} \\ & = & \displaystyle{\frac{1}{\hat{X}} \left(y_k - \hat{R} x_k \right)} \end{array} \tag{9.10} \end{equation}

Then, assuming simple random sampling, the estimator of the variance of \hat{R} is given by: \begin{equation} \hat{V}_{L}\left(\hat{R}\right) = N^2 \left(1-\displaystyle{\frac{n}{N}}\right)\frac{s^2_z}{n} \tag{9.11} \end{equation}

In the case of a ratio over a subpopulation U_D \subseteq U, we obtain: \begin{equation} \hat{R}_D = \displaystyle{\frac{\sum_{i \in s \cap U_D} \omega_i y_i}{\sum_{i \in s \cap U_D} \omega_i x_i} = \frac{\sum_{i \in s} \omega_i y_i 1_i^D}{\sum_{i \in s} \omega_i x_i 1_i^D} = \frac{\hat{Y}_D}{\hat{X}_D}} \tag{9.12} \end{equation}

and \tilde{z}_k = \displaystyle{\frac{1^D_k}{\hat{X}_D} \left(y_k - \hat{R}_D x_k \right)}, where 1^D is the dummy membership indicator variable of U_D \subseteq U: 1^D_k if k \in U_D, 0 otherwise.

  • Case of a population mean

This is a particular case of a ratio where x_k=1:

\begin{equation} \hat{\bar{Y}} = f\left(\hat{Y},\hat{N}\right) = \displaystyle{\frac{\hat{Y}}{\hat{N}}} \tag{9.13} \end{equation}

The linearised variable is given by: \begin{equation} \tilde{z}_k = \displaystyle{\frac{1}{\hat{N}} \left(y_k - \hat{\bar{Y}} \right)} \tag{9.14} \end{equation}

  • Case of the dispersion

\begin{equation} \begin{array}{rcl} S^2 = \displaystyle{\frac{1}{N}}\sum_{i \in U} \left(y_i - \bar{Y}\right)^2 & = & \displaystyle{\frac{1}{N}}\sum_{i \in U} y^2_i - \displaystyle{\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}} \\ & = & f\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) \end{array} \tag{9.15} \end{equation}

where f\left(x,y,z\right) = \displaystyle{\frac{z}{x} -\frac{y^2}{x^2}}

Thus, we have

\begin{equation} \begin{array}{rcl} {d}_1 = \displaystyle{\frac{\partial{f}}{\partial{x}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right)} & = & \displaystyle{-\frac{\sum_{i \in U} y^2_i}{N^2} + 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^3}} \\ & = & \displaystyle{-\frac{1}{N}\left(\frac{\sum_{i \in U} y^2_i}{N} - 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}\right)} \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2\right)} \end{array} \end{equation}

\begin{equation} {d}_2 = \displaystyle{\frac{\partial{f}}{\partial{y}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = -2\frac{\sum_{i \in U} y_i}{N^2}= -2\frac{\bar{Y}}{N}} \end{equation}

\begin{equation} {d}_3 = \displaystyle{\frac{\partial{f}}{\partial{z}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = \frac{1}{N}} \end{equation}

Therefore the (exact) linearised variable for the dispersion S^2 of y is given by: \begin{equation} \begin{array}{rcl} z_k & = & d_1 + d_2 y_k + d_3 y^2_k \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2 + 2\bar{Y}y_k - y^2_k\right)} \\ & = & \displaystyle{\frac{1}{N}\left[\left(y_k-\bar{Y}\right)^2 - S^2\right]} \end{array} \tag{9.16} \end{equation}

Hence the empirical linearised variable is: \begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.17} \end{equation}

By using the approach by Demnati and Rao (2004) as an alternative, we got the same result: \begin{equation} \displaystyle{\tilde{z}_k = \frac{\hat{N}y^{2}_k-\sum_{i \in s} \omega_i y^{2}_i }{\hat{N}^2} - 2 \left(\frac{\sum_{i \in s} \omega_i y_i}{\hat{N}}\right)\left(\frac{y_k\hat{N}-\sum_{i \in s} \omega_i y_i}{\hat{N}^2}\right)} \tag{9.18} \end{equation}

Hence, we obtain: \begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[y^{2}_k- \frac{\sum_{i \in s}\omega_i y^{2}_i}{\hat{N}} - 2 \left(\frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\left(y_k - \frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\right]} \tag{9.19} \end{equation}

\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[y^{2}_k- \hat{S}^2 - \hat{\bar{Y}}^2 - 2\hat{\bar{Y}}\left(y_k - \hat{\bar{Y}}\right)\right] = \frac{1}{\hat{N}}\left(y^{2}_k- \hat{S}^2 + \hat{\bar{Y}}^2 - 2 \hat{\bar{Y}} y_k\right)} \tag{9.20} \end{equation}

\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.21} \end{equation}

  • Case of the Theil index

The Theil index is a commonly used measure of inequality of a distribution, which belongs to the class of entropy measures:

\begin{equation} \begin{array}{rcl} T = \displaystyle{\frac{1}{N}}\sum_{i \in U} \frac{y_i}{\bar{Y}} Log\left(\frac{y_i}{\bar{Y}}\right) & = & \displaystyle{\frac{1}{N\bar{Y}}}\sum_{i \in U} \left[y_i Log\left(y_i\right) - y_i Log\left(\bar{Y}\right)\right] \\ & = & \displaystyle{\frac{1}{Y}}\sum_{i \in U} \left[y_i Log\left(y_i\right) - y_i Log\left(\sum_{i \in U} y_i\right) + y_i Log\left(N\right) \right] \\ & = & \displaystyle{\frac{\sum_{i \in U} y_i Log\left(y_i\right)}{\sum_{i \in U} y_i}- Log\left(\sum_{i \in U} y_i\right) + Log\left(N\right)} \end{array} \tag{9.22} \end{equation}

It is estimated by:

\begin{equation} \begin{array}{rcl} \hat{T} & = & \displaystyle{\frac{\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)}{\sum_{i \in s} {\omega}_i y_i}- Log\left(\sum_{i \in s} {\omega}_i y_i\right) + Log\left(\sum_{i \in s} {\omega}_i\right)} \end{array} \tag{9.23} \end{equation}

Using the Demnati-Rao framework, the lineairised variable is given by: \begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{y_k Log\left(y_k\right)\left(\sum_{i \in s} {\omega}_i y_i\right) - y_k\left[\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)\right]} {\left(\sum_{i \in s} {\omega}_i y_i\right)^2} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} + \frac{1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{\frac{y_k Log\left(y_k\right)}{\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i}\frac{\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)} {\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} + \frac{1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{\frac{y_k Log\left(y_k\right)}{\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i}\left[\hat{T} + Log\left(\sum_{i \in s} {\omega}_i y_i\right) - Log\left(\sum_{i \in s} {\omega}_i\right)\right] - \frac{y_k-1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{ \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} \left[ Log\left(y_k\right) - \hat{T} - Log\left(\sum_{i \in s} {\omega}_i y_i\right) + Log\left(\sum_{i \in s} {\omega}_i\right) -y_k + 1\right] } \end{array} \tag{9.24} \end{equation}

Regarding to the Theil index T_D for a subpopulation U_D \subseteq U, the linearised variable is obtained by introducing the dummy membership indicator variable 1^D: \begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{ \frac{y_k 1_k^D}{\sum_{i \in s} {\omega}_i y_i 1_i^D} \left[ Log\left(y_k\right) - \hat{T}_D - Log\left(\sum_{i \in s} {\omega}_i 1_i^D y_i\right) + Log\left(\sum_{i \in s} {\omega}_i 1_i^D\right) - y_k + 1\right] } \end{array} \tag{9.25} \end{equation}