Chapter 10 Survey data and statistical modelling

The use of survey data in statistical modelling, such as regression analysis, is a common and valuable approach in the fields of economics and finance in order to examine the relationship between one or more independent variables and a dependent variable. Traditional modelling approaches typically assume that the observations are independent and identically distributed. However, when working with survey data these assumptions are often violated and we have to take into account design features, such as unequal inclusion probabilities, stratification, clustering, re-weighting for unit non-response and calibration to external data sources.

10.1 The case of linear regression

In the traditional approach, we assume the following linear relationship \(Y_i = X_i B + \epsilon_i\) between a response variable \(Y_i\) and a vector of \(p\) explanatory variables \(X_i=\left(X^1_i,X^2_i \cdots X^p_i\right)\)

\(B = \left(B_1 , B_2 \dots B_p \right)^T\) is the vector of the \(p\) model parameters and \(\epsilon_i~~\left(i = 1 \cdots n\right)\) are the model residuals, assumed to be independent and identically distributed realisations of a normal distribution of mean 0 and variable \(\sigma^2\)

The vector \(B\) of model parameters is estimated using the least-squares criterion \(Min \sum_{i=1}^n \left(Y_i - X_i B\right)^2\):

\[ \hat{B}_{LS} = \left(\sum_{i=1}^n X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i=1}^n X^1_i Y_i \\ \\ \sum_{i=1}^n X^2_i Y_i \\ \cdots \\ \sum_{i=1}^n X^p_i Y_i \end{array} \right) \]

In survey settings, the model coefficients \(B\) are considered as a vector of finite population parameters:

\[ B = \left(\sum_{i \in U} \omega_i X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i \in U} \omega_i X^1_i Y_i \\ \\ \sum_{i \in U} \omega_i X^2_i Y_i \\ \cdots \\ \sum_{i \in U} \omega_i X^p_i Y_i \end{array} \right) \]

which is estimated by:

\[ \hat{B} = \left(\sum_{i \in s} \omega_i X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i \in s} \omega_i X^1_i Y_i \\ \\ \sum_{i \in s} \omega_i X^2_i Y_i \\ \cdots \\ \sum_{i \in s} \omega_i X^p_i Y_i \end{array} \right) \]

The variance of \(\hat{B}\) can be estimated using the linearisation technique:

\[ \widetilde{z}_k = \displaystyle{\frac{1}{\hat{N}\hat{\sigma}^2_x}}\left(X_k - \displaystyle{\frac{\hat{X}}{\hat{N}}}\right) \left[\left(Y_k - \displaystyle{\frac{\hat{Y}}{\hat{N}}}\right) - \hat{B} \left(X_k -\displaystyle{\frac{\hat{X}}{\hat{N}}}\right)\right] \]

10.2 The case of logistic regression

Binary logistic models assume the following relationship between the propensity of a response category and a vector of predictor variables:

\[\hat{p}_i = \displaystyle{\frac{e^{AX_i}}{1+e^{AX_i}}}\]

where \(A\) is the estimated vector of model parameters. Estimation in logistic regression is usually done by maximizing the likelihood function:

\[ \displaystyle{\hat{A} = argmax_{A} \prod_{i/y_i=1}\left(\frac{e^{AX_i}}{1+e^{AX_i}}\right)\prod_{j/y_j=0}\left(1-\frac{e^{AX_j}}{1+e^{AX_j}}\right)}\]

or, equivalently, by finding the zeros of the derivative of the log likelihood function. The corresponding weighted estimator of \(A\) is:

\[ \displaystyle{\hat{A} = argmax_{A} \prod_{i/y_i=1} \omega_i \left(\frac{e^{AX_i}}{1+e^{AX_i}}\right)\prod_{j/y_j=0} \omega_j \left(1-\frac{e^{AX_j}}{1+e^{AX_j}}\right)}\]

10.3 To weight or not to weight?

Whether to use a traditional model-based or a design-based approach to regression analysis is a recurring question to which there is no simple answer.

Weighting is a protection against bias, so weights should be used when there is a high risk that the data are highly biased (unequal inclusion probabilities, stratification, clustering, high non-response, calibration). However, the trade-off for using weights is an increase in sampling variance.

When using models with survey data, it is recommended to test the effect of including weights on the regression results. If there is not much variation and if the standard error of the regression coefficients increases at the same time, then weights have no significant effect. Whatever the data used, it is essential that the model is well specified.

In summary, the use of survey data in regression analysis requires careful consideration of variable types, sampling methods and potential problems associated with the data. Properly addressing these considerations will increase the reliability and validity of your regression results.