Chapter 8 Incorporating auxiliary information to increase sampling precision

8.1 Main idea

It is a long-established practice in sample surveys to incorporate auxiliary information into estimation formulae in order to reduce variance and thereby make the survey estimators more accurate. For example, if a survey results in an estimated proportion of men/ women in the population of 52/48, and an external source (e.g. a census) gives a proportion of 50/50, we seek to adjust the estimator so that the estimated proportion of men/women is 50/50. This adjustment is expected to reduce the sampling variance if the study variables of the survey are related to gender.

More generally, the auxiliary information can be thought of as a vector of \(K\) variables \(x=\left(x^1,x^2 \cdots x^K\right)\) whose the population totals \(X=\sum_{j \in U}x_j=\sum_{j \in U} \left(x_j^1,x_j^2 \cdots x_j^K\right)=\left(X^1,X^2 \cdots X^K\right)\) are assumed to be known from external sources, e.g. population censuses, business registers or administrative databases are typical examples of external sources.

8.2 Examples of adjusted estimators

8.2.1 Difference estimator

\[\begin{equation} \displaystyle{\hat{Y}_{D} = cX + \sum_{i \in s} \frac{\left(y_i - c.x_i\right)}{\pi_i}} = \hat{Y}_{HT} + c\left(X-\hat{X}_{HT}\right) \tag{8.1} \end{equation}\]

Under simple random sampling of size \(n\), the variance of the difference estimator is given by: \(\displaystyle{V\left(\hat{Y}_{D}\right)=N^2\left(1-f\right)\frac{S^2_d}{n}}\) where \(S^2_d\) is the dispersion of \(d_i = y_i - cx_i\)

8.2.2 Ratio estimator

\[\begin{equation} \hat{Y}_{R} = X \times \sum_{i \in s} \frac{y_i}{\pi_i} / \sum_{i \in s} \frac{x_i}{\pi_i} = X \times \hat{Y}_{HT} / \hat{X}_{HT} \tag{8.2} \end{equation}\]

When the sample size is large enough, the variance of the ratio estimator is: \(\displaystyle{V\left(\hat{Y}_{R}\right) \approx N^2\left(1-f\right)\frac{S^2_u}{n}}\) where \(S^2_u\) is the dispersion of the residuals \(u_i = y_i - \frac{Y}{X}.x_i\)

8.2.3 Regression estimator

\[\begin{equation} \displaystyle{\hat{Y}_{Reg} = \hat{Y}_{HT} + \hat{B}\left(X-\hat{X}_{HT}\right)} \tag{8.3} \end{equation}\]

where \(\hat{B}\) is the vector of the estimated regression coefficients of the study variable \(y\) on the auxiliary variables \(x\).

When the sample size is large enough, the variance of the regression estimator is: \(\displaystyle{V\left(\hat{Y}_{Reg}\right) \approx N^2\left(1-f\right)\frac{S^2_e}{n}}\) where \(S^2_e\) is the dispersion of the regression residuals \(e_i = y_i - \hat{B}x_i\)

8.3 Unified calibration theory

A unified approach to incorporating auxiliary information has been proposed in Deville and Särndal (1992) and Särndal and Lundström (2005). This approach consists of “calibrating” the sample weights \(d_j=1/\pi_j\) so that they match the \(K\) population benchmarks \(\left(X^1,X^2 \cdots X^K\right)\).

More precisely, we are looking for new weights \(\left(\omega_j\right)_{j \in s}\) that are “as close as possible” to the initial weights \(\left(d_j\right)_{j \in s}\) and such that the following \(K\) calibration equations are satisfied, based on the auxiliary variables \(x=\left(x^1,x^2 \cdots x^K\right)\):

\[\begin{equation} \sum_{j \in s} \omega_j x_j = \sum_{j \in s} \omega_j \left(x_j^1,x_j^2 \cdot x_j^K\right)=\left(X^1,X^2 \cdot X^K\right)=X \tag{8.4} \end{equation}\]

The solution to this problem (8.4) is given by \(\omega_j = d_j F\left(x_j \lambda\right)\), where \(\lambda\) is a constant parameter and \(F\) is a function that depends on the calibration method used. For example, \(F\left(u\right)=1+u\) for the linear method and \(F\left(u\right)=exp\left(u\right)\) for the rake ratio method. As for the constant parameter \(\lambda\), it is determined by solving the system of \(K\) calibration equations (8.4)

Calibration weighting is a very powerful technique for reducing sampling variance when the auxiliary variables are related to the target characteristics of the survey. A key result in calibration theory was found by Deville and Särndal (1992): provided the sample size is large enough, the variance of the estimator \(\hat{Y}_{Calib} = \sum_{j \in s} \omega_j y_j\) based on the “calibrated” weights is equal to the variance of the estimator based on the initial weights \(\left(d_j\right)_{j \in s}\) and using the regression residuals as the variable of interest:

\[\begin{equation} \displaystyle{V\left(\hat{Y}_{Calib}\right) \approx V\left(\hat{E}_{HT}\right)} \tag{8.5} \end{equation}\]

where \(\displaystyle{\hat{E}_{HT}=\sum_{i \in s} \frac{y_i - \hat{B}x_i}{\pi_i}=\sum_{i \in s} d_i \left(y_i - \hat{B}x_i\right)}\).

Thus, if the variables \(x_i\) are well related to the study variable \(y\), then the regression residuals are close to zero and so is the variance (8.5) of the regression estimator.

Another interesting result is that the calibration estimator is asymptotically unbiased.

8.4 Software Implementation

Calibration is a commonly used technique in survey sampling to increase sample precision. Many software tools are now available to perform the calculation of calibrated weights from a set of population benchmarks:

  • The SAS macro CALMAR, developed by the French National Statistical Institute (INSEE13).
  • The Stata Calibrate command
  • The R package Sampling
  • The R Shiny application Calif developed by the National Statistical Institute of Slovakia

As a weight adjustment technique, calibration modifies the distribution of sample weights. It is therefore important to check for extreme values in the weight distribution in order to keep the effects of calibration under control and prevent unexpected results from happening. That’s why “bounded” methods are often recommended in practice, where the ratios between final and initial weights are kept below two predefined limits.