Chapter 11 Survey data and statistical modeling
The use of survey data in statistical modeling, such as regression analysis, is a common and valuable approach in the fields of economics and finance in order to examine the relationship between one or more independent variables and a dependent variable. Traditional modeling approaches typically assume that the observations are independent and identically distributed. However, when working with survey data these assumptions are often violated and we need to account for complex design features, such as unequal inclusion probabilities, stratification, clustering, reweighting for unit non-response and calibration to external data sources.
11.1 The case of linear regression
In the traditional approach, we assume the following linear relationship \(Y_i = X_i \beta + \epsilon_i\) between a response variable \(Y_i\) and a vector of \(p\) explanatory variables \(X_i=\left(X^1_i,X^2_i \cdots X^p_i\right)\)
\(\beta = \left({\beta}_1 , {\beta}_2 \dots {\beta}_p \right)^T\) is the vector of the \(p\) model parameters and \(\epsilon_i~~\left(i = 1 \cdots n\right)\) are the model residuals, which are assumed to be independent and identically distributed realizations of a normal distribution with mean 0 and variable \(\sigma^2\)
The vector \(\beta\) of model parameters is estimated using the least-squares criterion \(Min \sum_{i=1}^n \left(Y_i - X_i B\right)^2\):
\[ \hat{\beta}_{LS} = \left(\sum_{i=1}^n X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i=1}^n X^1_i Y_i \\ \\ \sum_{i=1}^n X^2_i Y_i \\ \cdots \\ \sum_{i=1}^n X^p_i Y_i \end{array} \right) \]
\(\hat{\beta}_{LS}\) leads to the Best Linear Unbiased Estimator (BLUE) among all the possible linear combinations of the observations.
In the survey settings, the parameter \(\beta\) is estimated by incorporating the sampling weights \(\left( {\omega}_i , i \in s \right)\) into the formula:
\[ \hat{\beta}_{\omega} = \left(\sum_{i \in s} \omega_i X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i \in s} \omega_i X^1_i Y_i \\ \\ \sum_{i \in s} \omega_i X^2_i Y_i \\ \cdots \\ \sum_{i \in s} \omega_i X^p_i Y_i \end{array} \right) \]
In this case, \(\hat{\beta}_{\omega}\) is an unbiased of the population regression coefficient \(B\) under the survey sampling design:
\[ B = \left(\sum_{i \in U} X_i^T X_i\right)^{-1} \left( \begin{array}{c} \sum_{i \in U} X^1_i Y_i \\ \\ \sum_{i \in U} X^2_i Y_i \\ \cdots \\ \sum_{i \in U} X^p_i Y_i \end{array} \right) \]
Unlike \(\beta\), which is a model parameter, \(B\) is finite population quantity. Hence \(\hat{\beta}_{\omega}\) is an biased estimator of \(\beta\), whose bias is of order \(1/N\), where \(N\) is the size of the population: \(\hat{\beta}_{\omega}\) is asymptotically unbiased.
As regards the variance of \(\hat{\beta}_{\omega}\) can be estimated using the linearisation technique:
\[ \widetilde{z}_k = \displaystyle{\frac{1}{\hat{N}\hat{\sigma}^2_x}}\left(X_k - \displaystyle{\frac{\hat{X}}{\hat{N}}}\right) \left[\left(Y_k - \displaystyle{\frac{\hat{Y}}{\hat{N}}}\right) - \hat{B} \left(X_k -\displaystyle{\frac{\hat{X}}{\hat{N}}}\right)\right] \]
All design features must be taken into account when fitting the model to the data in order to obtain reliable and accurate variance estimators. The regression commands in STATA allow this, as long as the keyword svy: is specified at the beginning of the regression command.
The output of the regression is similar to that of the traditional i.i.d. regression command. However, there are small differences, such as the number of degrees of freedom based on the number of strata or the way the R2 coefficient is calculated.
11.2 The case of logistic regression
Binary logistic models assume the following relationship between the propensity of a response category and a vector of predictor variables:
\[\hat{p}_i = \displaystyle{\frac{e^{AX_i}}{1+e^{AX_i}}}\]
where \(A\) is the estimated vector of model parameters. Estimation in logistic regression is usually done by maximizing the likelihood function:
\[ \displaystyle{\hat{A} = argmax_{A} \prod_{i/y_i=1}\left(\frac{e^{AX_i}}{1+e^{AX_i}}\right)\prod_{j/y_j=0}\left(1-\frac{e^{AX_j}}{1+e^{AX_j}}\right)}\]
or, equivalently, by finding the zeros of the derivative of the log likelihood function.
In the survey settings, the corresponding weighted estimator of \(A\) is:
\[ \displaystyle{\hat{A} = argmax_{A} \prod_{i/y_i=1} \omega_i \left(\frac{e^{AX_i}}{1+e^{AX_i}}\right)\prod_{j/y_j=0} \omega_j \left(1-\frac{e^{AX_j}}{1+e^{AX_j}}\right)}\]
In general, the survey commands available in STATA are universal enough to accommodate most models encountered in practice:
11.3 To weight or not to weight ?
Whether to use a traditional model-based or survey-based approach to regression analysis is a recurring question to which there is no simple answer.
Weighting is clearly a protection against sample bias, so weights should be used when there is a high risk that the data are highly biased (unequal inclusion probabilities, stratification, clustering, high nonresponse, calibration). However, the cost of using weights is an increase in sampling variance.
When using models with survey data, it is recommended to test the effect of including weights on the regression results. If the variance is small and the standard error of the regression coefficients increases, then the weights have no significant effect. On the other hand, if the difference between unweighted and weighted estimates is important, then the data should be further examined. Regardless of the data used, it is important that the model is well specified.
In summary, the use of survey data in regression analysis requires careful consideration of variable types, sampling methods, and potential problems associated with the data. Properly addressing these considerations will increase the reliability and validity of your regression results.