Chapter 9 Variance estimation for non-linear indicators: the linearisation technique
The statistics we’ve been looking at so far are linear statistics, i.e. population totals, means or proportions. However, many indicators, especially those used in social statistics, are non-linear. For example, a mean or proportion is non-linear when the denominator is unknown and must therefore be considered as a ratio between two linear indicators. Another reference indicator of income inequality is the Gini coefficient, which is defined using rank statistics. Distributional aspects can also be measured by calculating percentiles such as medians, quartiles, quintiles or deciles. All these indicators are complex and require specific techniques to calculate the variance.
9.1 Seminal approach
Let assume an indicator \(\theta\) be expressed as a function of the \(p\) totals \(Y_1, Y_2 \cdots Y_p\):
\[\begin{equation} \theta = f\left(Y_1,Y_2 \cdots Y_p\right) \tag{9.1} \end{equation}\]
where \(Y_i\) is the total of variable \(\left(y_{ik}\right)\) over \(U\): \(Y_i = \sum_{k \in U} y_{ik}\)
For example, an unemployment rate can be regarded as a ratio between the total number of unemployed persons in the labour force population \(Y = \sum_{i \in U} 1^{UNEMP}_i\) and the total number of individuals in the labour force \(X = \sum_{i \in U} 1^{LF}_i\)
\[\begin{equation} R_{UNEMP} = \frac{\sum_{i \in U} 1^{UNEMP}_i}{\sum_{i \in U} 1^{LF}_i} = \frac{Y}{X} = f\left(Y,X\right) \tag{9.2} \end{equation}\]
A complex parameter such as (9.1) is traditionally estimated through substituting an estimator \(\hat{Y}_k\) for each of the \(p\) totals \(Y_1, Y_2 \cdots Y_p\)
\[\begin{equation} \hat{\theta} = f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \tag{9.3} \end{equation}\]
Thus, the unemployment rate can be estimated by taking the ratio between the Horvitz-Thompson estimators for the numerator and the denominator:
\[\begin{equation} \hat{R}_{UNEMP} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\sum_{i \in s} \displaystyle{\frac{1^{UNEMP}_i}{\pi_i}}}{\sum_{i \in s} \displaystyle{\frac{1^{LF}_i}{\pi_i}}}} \tag{9.4} \end{equation}\]
Assuming the function \(f\) is “regular” (\(C^1\) type - derivable with continuous derived function), the linearisation technique consists of approaching the complex estimator (9.3) with a linear estimator through first-order Taylor expansion:
\[\begin{equation} \begin{array}{rcl} \hat{\theta} & = & f\left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right) \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)} \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & f\left({Y}_1,{Y}_2 \cdots {Y}_p\right) + \sum_{i=1}^p d_i \times \left(\hat{Y}_i - {Y}_i \right) + K_n \\ & = & C + \sum_{i=1}^p d_i \hat{Y}_i + K_n \end{array} \tag{9.5} \end{equation}\]
where \(K_n\) is a random variable satisfying: \(K_n = O_P\left(\displaystyle{\frac{1}{n}}\right)\)
Finally, based on this first-order expansion, one can prove that the variance of the complex estimator \(\hat{\theta}\) is equal to the variance of the linear part \(\sum_{i=1}^p d_i \hat{Y}_i\) plus a reminder term of order \(\displaystyle{\frac{1}{n^{3/2}}}\)
\[\begin{equation} V\left(\hat{\theta}\right) = V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) + \displaystyle{O\left(\frac{1}{n^{3/2}}\right)} \tag{9.6} \end{equation}\]
Thus, provided the sample size is “large” enough, the variance of \(\hat{\theta}\) is asymptotically equal to that of its linear part:
\[\begin{equation} V\left(\hat{\theta}\right) \approx V\left(\sum_{i=1}^p d_i \hat{Y}_i\right) = V\left(\hat{Z}\right) \tag{9.7} \end{equation}\]
where \(\hat{Z}\) is a (linear) estimator of the total \(Z\) of \(z_k = \sum_{i=1}^p d_i {y}_{ik}\)
As the partial derivatives \(d_i = \displaystyle{\frac{\partial f}{\partial v_i} \left({Y}_1,{Y}_2 \cdots {Y}_p\right)}\) are unknown, the variance of \(\hat{\theta}\) is estimated by:
\[\begin{equation} \hat{V}_L\left(\hat{\theta}\right) = \hat{V}\left(\sum_{i=1}^p \tilde{d}_i \hat{Y}_i\right) = \hat{V}\left(\hat{\tilde{Z}}\right) \tag{9.8} \end{equation}\]
where \(\tilde{d}_i = \displaystyle{\frac{\partial f}{\partial v_i} \left(\hat{Y}_1,\hat{Y}_2 \cdots \hat{Y}_p\right)}\) and \(\tilde{z}_k = \sum_{i=1}^p \tilde{d}_i {y}_{ik}\)
9.2 Additional approaches
The seminal linearisation approach outlined in the previous section applies to non-linear but regular indicators in the sense that the function \(f\) of \(p\) sums \(Y_1, Y_2 \cdots Y_p\) must be differentiable by a continuous derivative function. However, there are many complex indicators that do not meet these requirements, such as the quantiles of a distribution (e.g. median income or quantile ratios) or concentration measures such as the Gini coefficient, which relies on rank statistics. Alternative linearisation frameworks have been developed to deal with these types of indicators.
- Kovačević and Binder (1997) developed a framework for dealing with estimators \(\hat{\theta}\) that are expressed as the solution of an estimation equation \(\sum_{i \in s} \omega_i \times u\left(y_i,\hat{\theta}\right) = 0\). In this case, the linearised variable is given by:
\[\begin{equation*} \tilde{z}_k = \displaystyle{-\left(\sum_{i \in s} {\omega}_i \frac{\partial u\left(y_i , \theta \right)}{\partial \theta} |\theta = \hat{\theta}\right)^{-1} \times u\left(y_k,\hat{\theta}\right)} \end{equation*}\]
Such a framework is suitable for dealing with indicators such as regression coefficients (e.g. linear or logistic regression models), which are expressed as the solution of systems of equations
- The framework proposed by Demnati and Rao (2004) is based on the concept of “influence function” that is used in robust statistics. In practice, the linearised variable is defined by the partial derivative function with respect to the weight variable:
\[\begin{equation*} \tilde{z}_k = \displaystyle{\frac{\partial f\left({\omega}_k , k \in s\right)}{\partial {\omega}_k}} \end{equation*}\]
9.3 Examples
- Case of a ratio between totals
\[\begin{equation} \hat{R} = f\left(\hat{Y},\hat{X}\right) = \displaystyle{\frac{\hat{Y}}{\hat{X}}} \tag{9.9} \end{equation}\]
The linearised variable is given by: \[\begin{equation} \begin{array}{rcl} \tilde{z}_k & = & \displaystyle{\frac{\partial f}{\partial y} \left(\hat{Y},\hat{X}\right)} y_k + \displaystyle{\frac{\partial f}{\partial x} \left(\hat{Y},\hat{X}\right)} x_k \\ & = & \displaystyle{\frac{1}{\hat{X}} y_k -\frac{\hat{Y}}{\hat{X}^2} x_k} \\ & = & \displaystyle{\frac{1}{\hat{X}} \left(y_k - \hat{R} x_k \right)} \end{array} \tag{9.10} \end{equation}\]
Then, assuming simple random sampling, the estimator of the variance of \(\hat{R}\) is given by: \[\begin{equation} \hat{V}_{L}\left(\hat{R}\right) = N^2 \left(1-\displaystyle{\frac{n}{N}}\right)\frac{s^2_z}{n} \tag{9.11} \end{equation}\]
- Case of the dispersion
\[\begin{equation} \begin{array}{rcl} S^2 = \displaystyle{\frac{1}{N}}\sum_{i \in U} \left(y_i - \bar{Y}\right)^2 & = & \displaystyle{\frac{1}{N}}\sum_{i \in U} y^2_i - \displaystyle{\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}} \\ & = & f\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) \end{array} \tag{9.12} \end{equation}\]
where \(f\left(x,y,z\right) = \displaystyle{\frac{z}{x} -\frac{y^2}{x^2}}\)
Thus, we have
\[\begin{equation} \begin{array}{rcl} {d}_1 = \displaystyle{\frac{\partial{f}}{\partial{x}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right)} & = & \displaystyle{-\frac{\sum_{i \in U} y^2_i}{N^2} + 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^3}} \\ & = & \displaystyle{-\frac{1}{N}\left(\frac{\sum_{i \in U} y^2_i}{N} - 2\frac{\left(\sum_{i \in U} y_i\right)^2}{N^2}\right)} \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2\right)} \end{array} \end{equation}\]
\[\begin{equation} {d}_2 = \displaystyle{\frac{\partial{f}}{\partial{y}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = -2\frac{\sum_{i \in U} y_i}{N^2}= -2\frac{\bar{Y}}{N}} \end{equation}\]
\[\begin{equation} {d}_3 = \displaystyle{\frac{\partial{f}}{\partial{z}}\left(N,\sum_{i \in U} y_i,\sum_{i \in U} y^2_i\right) = \frac{1}{N}} \end{equation}\]
Therefore the (exact) linearised variable for the dispersion \(S^2\) of \(y\) is given by: \[\begin{equation} \begin{array}{rcl} z_k & = & d_1 + d_2 y_k + d_3 y^2_k \\ & = & \displaystyle{-\frac{1}{N}\left(S^2-\bar{Y}^2 + 2\bar{Y}y_k - y^2_k\right)} \\ & = & \displaystyle{\frac{1}{N}\left[\left(y_k-\bar{Y}\right)^2 - S^2\right]} \end{array} \tag{9.13} \end{equation}\]
Hence the empirical linearised variable is: \[\begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.14} \end{equation}\]
By using the approach by Demnati and Rao (2004) as an alternative, we got the same result: \[\begin{equation} \displaystyle{\tilde{z}_k = \frac{\hat{N}y^{2}_k-\sum_{i \in s} \omega_i y^{2}_i }{\hat{N}^2} - 2 \left(\frac{\sum_{i \in s} \omega_i y_i}{\hat{N}}\right)\left(\frac{y_k\hat{N}-\sum_{i \in s} \omega_i y_i}{\hat{N}^2}\right)} \tag{9.15} \end{equation}\]
Hence, we obtain: \[\begin{equation} \displaystyle{\tilde{z}_k = \frac{1}{\hat{N}}\left[y^{2}_k- \frac{\sum_{i \in s}\omega_i y^{2}_i}{\hat{N}} - 2 \left(\frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\left(y_k - \frac{\sum_{i \in s}\omega_i y_i}{\hat{N}}\right)\right]} \tag{9.16} \end{equation}\]
\[\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[y^{2}_k- \hat{S}^2 - \hat{\bar{Y}}^2 - 2\hat{\bar{Y}}\left(y_k - \hat{\bar{Y}}\right)\right] = \frac{1}{\hat{N}}\left(y^{2}_k- \hat{S}^2 + \hat{\bar{Y}}^2 - 2 \hat{\bar{Y}} y_k\right)} \tag{9.17} \end{equation}\]
\[\begin{equation} \displaystyle{= \frac{1}{\hat{N}}\left[\left(y_k-\hat{\bar{Y}}\right)^2 - \hat{S}^2\right]} \tag{9.18} \end{equation}\]