Chapter 6 Dealing with non-linear indicators: the linearisation technique

So far, the statistics we’ve been dealing with are linear statistics that is, population totals, means or proportions. However, many indicators, particularly those used in the field of social statistics, are non-linear ones. For example, a mean or a proportion are non-linear when the denominator is unknown and therefore must be regarded as ratios between two linear indicators. Furthermore, a reference indicator of income inequalities is the Gini coefficient, whose definition uses rank statistics. Distributional aspects can also be measured through calculating percentiles such as the median, quartiles, quintiles or deciles. All those indicators are complex ones, for which variance calculation requires specific techniques.

6.1 The linearisation technique

Let assume an indicator $\theta$ be expressed as a function of the $p$ totals $Y_1, Y_2 \cdots Y_p$ :

$\begin{equation} \theta = f\left(Y_1,Y_2 \cdots Y_p\right) \tag{6.1} \end{equation}$

where $Y_i$ is the total of variable $\left(y_{ik}\right)$ over $U$ : $Y_i = \sum_{k \in U} y_{ik}$

For example, an unemployment rate can be regarded as a ratio between the total number of unemployed persons in the labour force population $Y = \sum_{i \in U} 1^{UNEMP}_i$ and the total number of individuals in the labour force $X = \sum_{i \in U} 1^{LF}_i$

$\begin{equation} R_{UNEMP} = \frac{\sum_{i \in U} 1^{UNEMP}_i}{\sum_{i \in U} 1^{LF}_i} = \frac{Y}{X} = f\left(Y,X\right) \tag{6.2} \end{equation}$

A complex parameter such as (6.1) is traditionally estimated through substituting an estimator $\hat{Y}_k$ for each of the $p$ totals $Y_1, Y_2 \cdots Y_p$

$\begin{equation} \hat{\theta} = f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \tag{6.3} \end{equation}$

Thus, the unemployment rate can be estimated by taking the ratio between the Horvitz-Thompson estimators for the numerator and the denominator:

$\begin{equation} \hat{R}_{UNEMP} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\sum_{i \in s} \displaystyle{\frac{1^{UNEMP}_i}{\pi_i}}}{\sum_{i \in s} \displaystyle{\frac{1^{LF}_i}{\pi_i}}}} \tag{6.4} \end{equation}$

Assuming the function $f$ is “regular” ( $C^1$ type - derivable with continuous derived function), the linearisation technique consists of approaching the complex estimator (6.3) with a linear estimator through first-order Taylor expansion:

$\begin{equation} \begin{array}{rcl} \hat{\theta} & = & f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)} \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p d_i \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & C + \sum_{i=1}^p d_i \hat{Y}_i + K_n \end{array} \tag{6.5} \end{equation}$

where $K_n$ is a random variable satisfying: $K_n = O_P\left(\displaystyle{\frac{1}{n}}\right)$

Finally, based on this first-order expansion, one can prove that the variance of the complex estimator $\hat{\theta}$ is equal to the variance of the linear part $\sum_{i=1}^p d_i \hat{Y}_i$ plus a reminder term of order $\displaystyle{\frac{1}{n^{3/2}}}$

$\begin{equation} V\left(\hat{\theta}\right) = V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) + \displaystyle{O\left(\frac{1}{n^{3/2}}\right)} \tag{6.6} \end{equation}$

Thus, provided the sample size is “large” enough, the variance of $\hat{\theta}$ is asymptotically equal to that of its linear part:

$\begin{equation} V\left(\hat{\theta}\right) \approx V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) = V\left(\hat{Z}\right) \tag{6.7} \end{equation}$

where $\hat{Z}$ is a (linear) estimator of the total $Z$ of $z_k = \sum_{i=1}^p d_i {y}_{ik}$

As the partial derivatives $d_i = \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)}$ are unknown, the variance of $\hat{\theta}$ is estimated by:

$\begin{equation} \hat{V}_L\left(\hat{\theta}\right) = \hat{V}\left(\sum_{i=1}^p \tilde{d}_i \hat{Y}_i\right) = \hat{V}\left(\hat{\tilde{Z}}\right) \tag{6.8} \end{equation}$

where $\tilde{d}_i = \displaystyle{\frac{\partial f}{\partial v_i} \left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right)}$ and $\tilde{z}_k = \sum_{i=1}^p \tilde{d}_i {y}_{ik}$

6.2 Examples

Case of a ratio between two totals

$\begin{equation} \hat{R} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\hat{Y}}{\hat{X}}} \tag{6.9} \end{equation}$

The linearised variable is given by: $\begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{\partial f}{\partial y} \left(\hat{Y},\hat{X}\right)} y_k + \displaystyle{\frac{\partial f}{\partial x} \left(\hat{Y},\hat{X}\right)} x_k \\ & = & \displaystyle{\frac{1}{\hat{X}} y_k -\frac{\hat{Y}}{\hat{X}^2} x_k} \\ & = & \displaystyle{\frac{1}{\hat{X}} \left(y_k - \hat{R} x_k \right)} \end{array} \tag{6.10} \end{equation}$

Then, assuming simple random sampling, the estimator of the variance of $\hat{R}$ is given by: $\begin{equation} \hat{V}_{L}\left(\hat{R}\right) = N^2 \left(1-\displaystyle{\frac{n}{N}}\right)\frac{s^2_z}{n} \tag{6.11} \end{equation}$

Case of the dispersion of a variable

$\begin{equation} \begin{array}{rcl} S^2 = \displaystyle{\frac{1}{N}}\sum_{i \in U} \left(y_i - \bar{Y}\right)^2 & = & \displaystyle{\frac{1}{N}}\sum_{i \in U} y^2_i - \displaystyle{\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}} \\ & = & f\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) \end{array} \tag{6.12} \end{equation}$

where $f\left(x,y,z\right) = \displaystyle{\frac{z}{x} -\frac{y^2}{x^2}}$

Thus, we have

$\begin{equation} \begin{array}{rcl} {d}_1 = \displaystyle{\frac{\partial{f}}{\partial{x}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right)} & = & \displaystyle{-\frac{\sum_{i \in U} y^2_i}{N^2} + 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^3}} \\ & = & \displaystyle{-\frac{1}{N}\left(\frac{\sum_{i \in U} y^2_i}{N} - 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}\right)} \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2\right)} \end{array} \end{equation}$

$\begin{equation} {d}_2 = \displaystyle{\frac{\partial{f}}{\partial{y}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = -2\frac{\sum_{i \in U} y_i}{N^2}= -2\frac{\bar{Y}}{N}} \end{equation}$

$\begin{equation} {d}_3 = \displaystyle{\frac{\partial{f}}{\partial{z}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = \frac{1}{N}} \end{equation}$

Therefore the linearised variable for the dispersion $S^2$ of $y$ is: $\begin{equation} \begin{array}{rcl} z_k & = & d_1 + d_2 y_k + d_3 y^2_k \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2 + 2\bar{Y}y_k - y^2_k\right)} \\ & = & \displaystyle{\frac{1}{N}\left[\left(y_k-\bar{Y}\right)^2 - S^2\right]} \end{array} \tag{6.13} \end{equation}$