Chapter 7 Unit non-response

7.1 The issue of unit non-response

All the bias and variance formulae we have seen so far assume that survey information is collected from all sample units without error. This is not the case in practice. In fact, all surveys suffer from unit non-response, which occurs when a proportion of the initially selected sample units are not interviewed. There are many possible reasons for this: strong refusal to participate, long-term absence, health problems, language barriers, difficulties in accessing the dwelling, etc.

There is also item non-response, where a unit participates in the survey but part of the response is missing (e.g. the interviewee may refuse to answer certain questions that are considered too intrusive or may not know the answer). In this chapter only the issue of unit non-response is discussed.

Unit non-response is a key issue for the quality of survey data. It is a potential source of bias if non-responding units differ from responding units on key survey characteristics. In addition, unit non-response makes survey estimates more volatile because the sample size achieved is actually smaller than the size of the sample originally drawn.

Let assume a simple random sample $s$ of size $n$ and let $\phi_{i} = Pr\left(i \in r | s\right)$ the probability for i to respond. Suppose we want to estimate the population mean $\bar{Y}$ of $y$ and let $\bar{y}_{r}$ the sample mean of $y$ based on the subsample $r$ of the responding units.

$E\left(\bar{y}_{r} | s \right) \approx \frac{E\left(\sum_{i \in r} y_i| s \right)}{E\left( \sum_{i \in r} 1| s\right)} = \frac{\sum_{i \in s} y_i \phi_{i}}{\sum_{i \in s} \phi_{i}}$

Hence we have: $\displaystyle{\begin{array}{rcl} E\left(\bar{y}_{r} \right) & = & E\left[E\left(\bar{y}_{r} | s \right)\right] \approx \frac{\sum_{i \in U} y_i \phi_{i}}{\sum_{i \in U} \phi_{i}} \\ & & \\ & = & \bar{Y} + \frac{N \left(\sum_{i \in U} y_i \phi_{i}\right) - \left( \sum_{i \in U} y_i \right) \left( \sum_{i \in U} \phi_{i} \right)}{N \sum_{i \in U} \phi_{i}} \\ & & \\ & = & \bar{Y} + \frac{cov\left(y,\phi\right)}{\bar{\phi}} \end{array}}$

Thus non-response bias depends on:

In particular, non-response bias does not disappear as the sample size increases.

7.2 Dealing with unit non-response

The traditional approach to reducing the bias caused by unit non-response is to modify the weights of sample units to take account of their ‘propensities’ to respond. Sample weights are inherent in survey data in the sense that they are needed to extrapolate from sample observations to estimates for the entire target population of the survey. In effect, sample weights aim to reduce bias due to

unequal inclusion probabilities between sample units
coverage errors, i.e. part of the target population not being included in the sampling frame
Differential unit non-response

Following the Horvitz-Thompson formula, the design weights are defined by the inverse of the inclusion probabilities. For $i \in U$ we have $d_i = \displaystyle{\frac{1}{\pi_i}}$

For example, in the case of simple random sampling without replacement of size $n$ , the inclusion probability of $i$ is $\pi_i = n/N$ and then the design weight is equal to $N/n$ for all units $i \in U$ . Similarly, in the case of stratified simple random sampling, we have $d_i = N_{h_i} / n_{h_i}$ , where $h_i$ is the stratum group to which the unit $i$ belongs.

According to Horvitz-Thompson theory, assuming full response and no measurement error in the data, design weights lead to unbiased estimators for linear parameters. In the case of unit non-response, the weights must be adjusted in order to continue to be used for statistical inference.

The traditional approach to dealing with unit non-response is to estimate response ‘propensities’ for each responding unit using logistic modelling, and then use the estimated propensities to adjust the design weights.

Let $X_i$ be a vector of response predictors for unit $i$ , which are assumed to be known for both respondents and non-respondents. They usually come from auxiliary sources such as sampling frames, population censuses, administrative databases or registers (population or business). Using $X_i$ as explanatory variables, the estimated response propensity of $i$ is given by $\hat{p}_i = \displaystyle{\frac{e^{\hat{A}X_i}}{1+e^{\hat{A}X_i}}}$

where $\hat{A}$ is the estimated vector of model parameters. Estimation in logistic regression is usually done by maximum likelihood.

Probit modelling can also be used as a substitute for logistic modelling, although the differences between the two approaches in terms of estimated propensities are usually small:

$\hat{p}_i = {\Phi}^{-1}\left(\hat{A}X_i\right)$

As a result, the non-response adjusted design weights are $d^{adjNR}_i =\displaystyle{\frac{d_i}{\hat{p}_i}}$

7.3 Operational aspects

Non-response adjustment requires auxiliary information, available for both respondents and non-respondents, to describe and model the response mechanism. To be effective in reducing non-response bias, the auxiliary information used must be related to both the survey response and the target characteristics of the survey. Indeed, the availability of such information is an important practical consideration that often limits the scope for non-response correction. Examples of external sources include the sampling frame, which often contains additional information for each unit in the population, population censuses or administrative databases.

The important result is that the weights $d^{adjNR}$ adjusted for unit non-response remain unbiased if the estimated response propensities $\hat{p}_i$ are assumed to be accurate. The trade-off is an increase in sampling variance, since the sample size obtained is smaller than the size of the initial sample, and the corrected weights may take extreme values, which would make the weight distribution more volatile and the survey estimators less stable. In order to avoid extreme values, an alternative approach to estimating response propensities is to construct so-called ‘response homogeneity groups’, where each unit is assumed to have the same propensity to respond, and to estimate it by taking the mean response rate within the group.