Chapter 9 Variance estimation for non-linear indicators: the linearisation technique

The statistics we’ve looked at so far are linear statistics, i.e. population totals, means or proportions. However, many indicators, especially those used in social statistics, are non-linear. For example, a mean or proportion is non-linear when the denominator is unknown and must therefore be considered as a ratio between two linear indicators. Another reference indicator of income inequality is the Gini coefficient, which is defined using rank statistics. Distributional aspects can also be measured by calculating percentiles such as medians, quartiles, quintiles or deciles. All these indicators are complex and require specific techniques to calculate the variance.

9.1 Seminal approach

Consider an indicator $\theta$ be expressed as a function of the $p$ totals $Y_1, Y_2 \cdots Y_p$ :

$\begin{equation} \theta = f\left(Y_1,Y_2 \cdots Y_p\right) \tag{9.1} \end{equation}$

where $Y_i$ is the total of variable $\left(y_{ik}\right)$ over $U$ : $Y_i = \sum_{k \in U} y_{ik}$

For example, the unemployment rate can be regarded as a ratio between the total number of unemployed persons in the labour force population $Y = \sum_{i \in U} 1^{UNEMP}_i$ and the total number of individuals in the labour force $X = \sum_{i \in U} 1^{LF}_i$

$\begin{equation} R_{UNEMP} = \frac{\sum_{i \in U} 1^{UNEMP}_i}{\sum_{i \in U} 1^{LF}_i} = \frac{Y}{X} = f\left(Y,X\right) \tag{9.2} \end{equation}$

A complex parameter such as (9.1) is traditionally estimated by substituting an estimator $\hat{Y}_k$ for each of the $p$ totals $Y_1, Y_2 \cdots Y_p$

$\begin{equation} \hat{\theta} = f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \tag{9.3} \end{equation}$

Thus, the unemployment rate can be estimated by taking the ratio between the weighted estimators for the numerator and the denominator, respectively:

$\begin{equation} \hat{R}_{UNEMP} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\sum_{i \in s} \omega_i 1^{UNEMP}_i}{\sum_{i \in s} \omega_i 1^{LF}_i}} \tag{9.4} \end{equation}$

Assuming the function $f$ is “regular” ( $C^1$ type - derivable with continuous derived function), the linearisation technique consists of approaching the complex estimator (9.3) with a linear estimator through first-order Taylor expansion:

$\begin{equation} \begin{array}{rcl} \hat{\theta} & = & f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)} \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p d_i \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & C + \sum_{i=1}^p d_i \hat{Y}_i + K_n \end{array} \tag{9.5} \end{equation}$

where $K_n$ is a random variable satisfying: $K_n = O_P\left(\displaystyle{\frac{1}{n}}\right)$ that is, $\left\|K_n\right\| \leq \frac{C}{n}$ with a probability $\geq 1 - \epsilon$

Finally, based on this first-order expansion, it can be proved that the variance of the complex estimator $\hat{\theta}$ is equal to the variance of the linear part $\sum_{i=1}^p d_i \hat{Y}_i$ plus a reminder term of order $\displaystyle{\frac{1}{n^{3/2}}}$

$\begin{equation} V\left(\hat{\theta}\right) = V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) + \displaystyle{O\left(\frac{1}{n^{3/2}}\right)} \tag{9.6} \end{equation}$

Thus, provided the sample size is “large” enough, the variance of $\hat{\theta}$ is asymptotically equal to that of its linear part:

$\begin{equation} V\left(\hat{\theta}\right) \approx V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) = V\left(\hat{Z}\right) \tag{9.7} \end{equation}$

where $\hat{Z}$ is a (linear) estimator of the total $Z$ of $z_k = \sum_{i=1}^p d_i {y}_{ik}$

As the partial derivatives $d_i = \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)}$ are unknown, the variance of $\hat{\theta}$ is estimated by:

$\begin{equation} \hat{V}_L\left(\hat{\theta}\right) = \hat{V}\left(\sum_{i=1}^p \tilde{d}_i \hat{Y}_i\right) = \hat{V}\left(\hat{\tilde{Z}}\right) \tag{9.8} \end{equation}$

where $\tilde{d}_i = \displaystyle{\frac{\partial f}{\partial v_i} \left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right)}$ and $\tilde{z}_k = \sum_{i=1}^p \tilde{d}_i {y}_{ik}$

In the case of domain analysis, the linearisation technique also applies. First, we linearise the estimator over the domain, treating the domain itself as a population. Then we add 0s to the observations that are outside the domain.

9.2 Alternative approaches

The seminal linearisation approach outlined in the previous section applies to non-linear but regular indicators in the sense that the function $f$ of $p$ sums $Y_1, Y_2 \cdots Y_p$ must be differentiable by a continuous derivative function. However, there are many complex indicators that do not meet these requirements, such as the quantiles of a distribution (e.g. median income or quantile ratios) or concentration measures such as the Gini coefficient, which relies on rank statistics. Alternative linearisation frameworks have been developed to deal with these types of indicators.

Kovačević and Binder (1997) developed a framework for dealing with estimators $\hat{\theta}$ that are expressed as the solution of an estimation equation $\sum_{i \in s} \omega_i \times u\left(y_i,\hat{\theta}\right) = 0$ . In this case, the linearised variable is given by:

$\begin{equation*} \tilde{z}_k = \displaystyle{-\left(\sum_{i \in s} {\omega}_i \frac{\partial u\left(y_i , \theta \right)}{\partial \theta} |\theta = \hat{\theta}\right)^{-1} \times u\left(y_k,\hat{\theta}\right)} \end{equation*}$

Such a framework is suitable for dealing with indicators such as regression coefficients (e.g. linear or logistic regression models), which are expressed as the solution of systems of equations

The framework proposed by Demnati and Rao (2004) is based on the concept of “influence function” that is used in robust statistics. In practice, the linearised variable is defined by the partial derivative function with respect to the weight variable:

$\begin{equation*} \tilde{z}_k = \displaystyle{\frac{\partial f\left({\omega}_k , k \in s\right)}{\partial {\omega}_k}} \end{equation*}$

9.3 Examples

Case of a ratio between totals

$\begin{equation} \hat{R} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\hat{Y}}{\hat{X}}} \tag{9.9} \end{equation}$

The linearised variable is given by: $\begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{\partial f}{\partial y} \left(\hat{Y},\hat{X}\right)} y_k + \displaystyle{\frac{\partial f}{\partial x} \left(\hat{Y},\hat{X}\right)} x_k \\ & = & \displaystyle{\frac{1}{\hat{X}} y_k -\frac{\hat{Y}}{\hat{X}^2} x_k} \\ & = & \displaystyle{\frac{1}{\hat{X}} \left(y_k - \hat{R} x_k \right)} \end{array} \tag{9.10} \end{equation}$

Then, assuming simple random sampling, the estimator of the variance of $\hat{R}$ is given by: $\begin{equation} \hat{V}_{L}\left(\hat{R}\right) = N^2 \left(1-\displaystyle{\frac{n}{N}}\right)\frac{s^2_z}{n} \tag{9.11} \end{equation}$

In the case of a ratio over a subpopulation $U_D \subseteq U$ , we obtain: $\begin{equation} \hat{R}_D = \displaystyle{\frac{\sum_{i \in s \cap U_D} \omega_i y_i}{\sum_{i \in s \cap U_D} \omega_i x_i} = \frac{\sum_{i \in s} \omega_i y_i 1_i^D}{\sum_{i \in s} \omega_i x_i 1_i^D} = \frac{\hat{Y}_D}{\hat{X}_D}} \tag{9.12} \end{equation}$

and $\tilde{z}_k = \displaystyle{\frac{1^D_k}{\hat{X}_D} \left(y_k - \hat{R}_D x_k \right)}$ , where $1^D$ is the dummy membership indicator variable of $U_D \subseteq U$ : $1^D_k$ if $k \in U_D$ , 0 otherwise.

Case of a population mean

This is a particular case of a ratio where $x_k=1$ :

$\begin{equation} \hat{\bar{Y}} = f\left(\hat{Y},\hat{N}\right) = \displaystyle{\frac{\hat{Y}}{\hat{N}}} \tag{9.13} \end{equation}$

The linearised variable is given by: $\begin{equation} \tilde{z}_k = \displaystyle{\frac{1}{\hat{N}} \left(y_k - \hat{\bar{Y}} \right)} \tag{9.14} \end{equation}$

Case of the dispersion

$\begin{equation} \begin{array}{rcl} S^2 = \displaystyle{\frac{1}{N}}\sum_{i \in U} \left(y_i - \bar{Y}\right)^2 & = & \displaystyle{\frac{1}{N}}\sum_{i \in U} y^2_i - \displaystyle{\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}} \\ & = & f\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) \end{array} \tag{9.15} \end{equation}$

where $f\left(x,y,z\right) = \displaystyle{\frac{z}{x} -\frac{y^2}{x^2}}$

Thus, we have

$\begin{equation} \begin{array}{rcl} {d}_1 = \displaystyle{\frac{\partial{f}}{\partial{x}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right)} & = & \displaystyle{-\frac{\sum_{i \in U} y^2_i}{N^2} + 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^3}} \\ & = & \displaystyle{-\frac{1}{N}\left(\frac{\sum_{i \in U} y^2_i}{N} - 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}\right)} \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2\right)} \end{array} \end{equation}$

$\begin{equation} {d}_2 = \displaystyle{\frac{\partial{f}}{\partial{y}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = -2\frac{\sum_{i \in U} y_i}{N^2}= -2\frac{\bar{Y}}{N}} \end{equation}$

$\begin{equation} {d}_3 = \displaystyle{\frac{\partial{f}}{\partial{z}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = \frac{1}{N}} \end{equation}$

Therefore the (exact) linearised variable for the dispersion $S^2$ of $y$ is given by: $\begin{equation} \begin{array}{rcl} z_k & = & d_1 + d_2 y_k + d_3 y^2_k \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2 + 2\bar{Y}y_k - y^2_k\right)} \\ & = & \displaystyle{\frac{1}{N}\left[\left(y_k-\bar{Y}\right)^2 - S^2\right]} \end{array} \tag{9.16} \end{equation}$

Hence the empirical linearised variable is: $\begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.17} \end{equation}$

By using the approach by Demnati and Rao (2004) as an alternative, we got the same result: $\begin{equation} \displaystyle{\tilde{z}_k = \frac{\hat{N}y^{2}_k-\sum_{i \in s} \omega_i y^{2}_i }{\hat{N}^2} - 2 \left(\frac{\sum_{i \in s} \omega_i y_i}{\hat{N}}\right)\left(\frac{y_k\hat{N}-\sum_{i \in s} \omega_i y_i}{\hat{N}^2}\right)} \tag{9.18} \end{equation}$

Hence, we obtain: $\begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[y^{2}_k- \frac{\sum_{i \in s}\omega_i y^{2}_i}{\hat{N}} - 2 \left(\frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\left(y_k - \frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\right]} \tag{9.19} \end{equation}$

$\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[y^{2}_k- \hat{S}^2 - \hat{\bar{Y}}^2 - 2\hat{\bar{Y}}\left(y_k - \hat{\bar{Y}}\right)\right] = \frac{1}{\hat{N}}\left(y^{2}_k- \hat{S}^2 + \hat{\bar{Y}}^2 - 2 \hat{\bar{Y}} y_k\right)} \tag{9.20} \end{equation}$

$\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.21} \end{equation}$

Case of the Theil index

The Theil index is a commonly used measure of inequality of a distribution, which belongs to the class of entropy measures:

$\begin{equation} \begin{array}{rcl} T = \displaystyle{\frac{1}{N}}\sum_{i \in U} \frac{y_i}{\bar{Y}} Log\left(\frac{y_i}{\bar{Y}}\right) & = & \displaystyle{\frac{1}{N\bar{Y}}}\sum_{i \in U} \left[y_i Log\left(y_i\right) - y_i Log\left(\bar{Y}\right)\right] \\ & = & \displaystyle{\frac{1}{Y}}\sum_{i \in U} \left[y_i Log\left(y_i\right) - y_i Log\left(\sum_{i \in U} y_i\right) + y_i Log\left(N\right) \right] \\ & = & \displaystyle{\frac{\sum_{i \in U} y_i Log\left(y_i\right)}{\sum_{i \in U} y_i}- Log\left(\sum_{i \in U} y_i\right) + Log\left(N\right)} \end{array} \tag{9.22} \end{equation}$

It is estimated by:

$\begin{equation} \begin{array}{rcl} \hat{T} & = & \displaystyle{\frac{\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)}{\sum_{i \in s} {\omega}_i y_i}- Log\left(\sum_{i \in s} {\omega}_i y_i\right) + Log\left(\sum_{i \in s} {\omega}_i\right)} \end{array} \tag{9.23} \end{equation}$

Using the Demnati-Rao framework, the lineairised variable is given by: $\begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{y_k Log\left(y_k\right)\left(\sum_{i \in s} {\omega}_i y_i\right) - y_k\left[\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)\right]} {\left(\sum_{i \in s} {\omega}_i y_i\right)^2} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} + \frac{1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{\frac{y_k Log\left(y_k\right)}{\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i}\frac{\sum_{i \in s} {\omega}_i y_i Log\left(y_i\right)} {\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} + \frac{1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{\frac{y_k Log\left(y_k\right)}{\sum_{i \in s} {\omega}_i y_i} - \frac{y_k}{\sum_{i \in s} {\omega}_i y_i}\left[\hat{T} + Log\left(\sum_{i \in s} {\omega}_i y_i\right) - Log\left(\sum_{i \in s} {\omega}_i\right)\right] - \frac{y_k-1}{\sum_{i \in s} {\omega}_i y_i}} \\ & = & \displaystyle{ \frac{y_k}{\sum_{i \in s} {\omega}_i y_i} \left[ Log\left(y_k\right) - \hat{T} - Log\left(\sum_{i \in s} {\omega}_i y_i\right) + Log\left(\sum_{i \in s} {\omega}_i\right) -y_k + 1\right] } \end{array} \tag{9.24} \end{equation}$

Regarding to the Theil index $T_D$ for a subpopulation $U_D \subseteq U$ , the linearised variable is obtained by introducing the dummy membership indicator variable $1^D$ : $\begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{ \frac{y_k 1_k^D}{\sum_{i \in s} {\omega}_i y_i 1_i^D} \left[ Log\left(y_k\right) - \hat{T}_D - Log\left(\sum_{i \in s} {\omega}_i 1_i^D y_i\right) + Log\left(\sum_{i \in s} {\omega}_i 1_i^D\right) - y_k + 1\right] } \end{array} \tag{9.25} \end{equation}$