Chapter 7 Incorporating auxiliary information to increase sampling precision
It is long established practice in sample surveys to incorporate auxiliary information into estimation formulas in order to reduce variance, thereby making survey estimators more accurate. For example, when a survey leads to an estimated proportion of males/females in the population equal to 52/48, and when an external source (e.g. a population census) gives a proportion of 50/50, we seek to adjust the estimator so the estimated proportion of males/females is equal to 50/50. This adjustment is expected to reduce sampling variance when the study variables of the survey are related to gender.
More generally, auxiliary information can be regarded as a vector of \(K\) variables \(x=\left(x^1,x^2 \cdots x^K\right)\) whose population totals \(X=\sum_{j \in U}x_j=\sum_{j \in U} \left(x_j^1,x_j^2 \cdots x_j^K\right)=\left(X^1,X^2 \cdots X^K\right)\) are assumed to be known from external sources, for instance, population censuses, business registers or administrative databases are typical examples of external sources.
7.1 Examples of adjusted estimators
7.1.1 Difference estimator
\[\begin{equation} \displaystyle{\hat{Y}_{D} = cX + \sum_{i \in s} \frac{\left(y_i - c.x_i\right)}{\pi_i}} = \hat{Y}_{HT} + c\left(X-\hat{X}_{HT}\right) \tag{7.1} \end{equation}\]
Under simple random sampling of size \(n\), the variance of the difference estimator is given by: \(\displaystyle{V\left(\hat{Y}_{D}\right)=N^2\left(1-f\right)\frac{S^2_d}{n}}\) where \(S^2_d\) is the dispersion of \(d_i = y_i - cx_i\)
7.1.2 Ratio estimator
\[\begin{equation} \hat{Y}_{R} = X \times \sum_{i \in s} \frac{y_i}{\pi_i} / \sum_{i \in s} \frac{x_i}{\pi_i} = X \times \hat{Y}_{HT} / \hat{X}_{HT} \tag{7.2} \end{equation}\]
When the sample size is large enough, the variance of the ratio estimator is: \(\displaystyle{V\left(\hat{Y}_{R}\right) \approx N^2\left(1-f\right)\frac{S^2_u}{n}}\) where \(S^2_u\) is the dispersion of the residuals \(u_i = y_i - \frac{Y}{X}.x_i\)
7.1.3 Regression estimator
\[\begin{equation} \displaystyle{\hat{Y}_{Reg} = \hat{Y}_{HT} + \hat{B}\left(X-\hat{X}_{HT}\right)} \tag{7.3} \end{equation}\]
where \(\hat{B}\) is the vector of the estimated regression coefficients of the study variable \(y\) on the auxiliary variables \(x\).
When the sample size is large enough, the variance of the regression estimator is: \(\displaystyle{V\left(\hat{Y}_{Reg}\right) \approx N^2\left(1-f\right)\frac{S^2_e}{n}}\) where \(S^2_e\) is the dispersion of the regression residuals \(e_i = y_i - \hat{B}x_i\)
7.2 Unified calibration theory
A unified approach for incorporating auxiliary information was proposed in Deville and Särndal (1992) and Särndal and Lundström (2005). This approach consists of “calibrating” the sampling weights \(d_j=1/\pi_j\) so to match the \(K\) population benchmarks \(\left(X^1,X^2 \cdots X^K\right)\).
More precisely, we seek new weights \(\left(\omega_j\right)_{j \in s}\) which are “as close as” the initial weights \(\left(d_j\right)_{j \in s}\) and such that the following \(K\) calibration equations based on the auxiliary variables \(x=\left(x^1,x^2 \cdots x^K\right)\) are satisfied:
\[\begin{equation} \sum_{j \in s} \omega_j x_j = \sum_{j \in s} \omega_j \left(x_j^1,x_j^2 \cdots x_j^K\right)=\left(X^1,X^2 \cdots X^K\right)=X \tag{7.4} \end{equation}\]
The solution to this problem (7.4) is given by: \(\omega_j = d_j F\left(x_j \lambda\right)\), where \(\lambda\) is a constant parameter and \(F\) a function which depends on the calibration method used. For example, \(F\left(u\right)=1+u\) for the linear method and \(F\left(u\right)=exp\left(u\right)\) for the “raking ratio” method. As to the constant parameter \(\lambda\), it is determined through solving the system of \(K\) calibration equations (7.4)
Calibration weighting is a very powerful technique to reduce sampling variance when the auxiliary variables are related to the target characteristics of the survey. A key result in calibration theory was found by Deville and Särndal (1992): provided the sample size is large enough, the variance of the estimator \(\hat{Y}_{Calib} = \sum_{j \in s} \omega_j y_j\) based on the “calibrated” weights is equal to the variance of the estimator based on the initial weights \(\left(d_j\right)_{j \in s}\) and using the regression residuals as variable of interest:
\[\begin{equation} \displaystyle{V\left(\hat{Y}_{Calib}\right) \approx V\left(\hat{E}_{HT}\right)} \tag{7.5} \end{equation}\]
Where \(\displaystyle{\hat{E}_{HT}=\sum_{i \in s} \frac{y_i - \hat{B}x_i}{\pi_i}=\sum_{i \in s} d_i \left(y_i - \hat{B}x_i\right)}\)
Thus, when the variables \(x_i\) are well related to the study variable \(y\), then the regression residuals are close to zero and so is the variance (7.5) of the regression estimator.
Furthermore, another interesting result states the calibration estimator is asymptotically unbiased.
7.3 Software implementation
Calibration is a commonly used technique in survey sampling to help increase sample accuracy. A lot of software tools are now available to perform the computation of calibrated weights from a set of population benchmarks:
- The SAS macro CALMAR developed by France’s National Statistical Institute (INSEE10)
- The Stata command Calibrate
- The R package Sampling
- The R Shiny app Calif developed by Slovakia’s National Statistical Institute
As a weighting adjustment technique, calibration modifies the distribution of sampling weights. It is therefore important to check for extreme values in the weight distribution in order to keep the impact of calibration under control. For instance, “bounded” methods, keeping the ratios between final and initial weights under two predefined limits, are often recommended.