Chapter 11 Survey data and statistical modelling
The use of survey data in statistical modelling, such as regression analysis, is a common and valuable approach in the fields of economics and finance in order to examine the relationship between one or more independent variables and a dependent variable. Traditional modelling approaches typically assume that the observations are independent and identically distributed. However, when working with survey data these assumptions are often violated and we have to take into account design features, such as unequal inclusion probabilities, stratification, clustering, re-weighting for unit non-response and calibration to external data sources.
11.1 The case of linear regression
In the traditional approach, we assume the following linear relationship Yi=XiB+ϵi between a response variable Yi and a vector of p explanatory variables Xi=(X1i,X2i⋯Xpi)
B=(B1,B2…Bp)T is the vector of the p model parameters and ϵi (i=1⋯n) are the model residuals, assumed to be independent and identically distributed realisations of a normal distribution of mean 0 and variable σ2
The vector B of model parameters is estimated using the least-squares criterion Min∑ni=1(Yi−XiB)2:
ˆBLS=(n∑i=1XTiXi)−1(∑ni=1X1iYi∑ni=1X2iYi⋯∑ni=1XpiYi)
In survey settings, the model coefficients B are considered as a vector of finite population parameters:
B=(∑i∈UωiXTiXi)−1(∑i∈UωiX1iYi∑i∈UωiX2iYi⋯∑i∈UωiXpiYi)
which is estimated by:
ˆB=(∑i∈sωiXTiXi)−1(∑i∈sωiX1iYi∑i∈sωiX2iYi⋯∑i∈sωiXpiYi)
The variance of ˆB can be estimated using the linearisation technique:
˜zk=1ˆNˆσ2x(Xk−ˆXˆN)[(Yk−ˆYˆN)−ˆB(Xk−ˆXˆN)]
11.2 The case of logistic regression
Binary logistic models assume the following relationship between the propensity of a response category and a vector of predictor variables:
ˆpi=eAXi1+eAXi
where A is the estimated vector of model parameters. Estimation in logistic regression is usually done by maximizing the likelihood function:
ˆA=argmaxA∏i/yi=1(eAXi1+eAXi)∏j/yj=0(1−eAXj1+eAXj)
or, equivalently, by finding the zeros of the derivative of the log likelihood function. The corresponding weighted estimator of A is:
ˆA=argmaxA∏i/yi=1ωi(eAXi1+eAXi)∏j/yj=0ωj(1−eAXj1+eAXj)
11.3 To weight or not to weight?
Whether to use a traditional model-based or a design-based approach to regression analysis is a recurring question to which there is no simple answer.
Weighting is a protection against bias, so weights should be used when there is a high risk that the data are highly biased (unequal inclusion probabilities, stratification, clustering, high non-response, calibration). However, the trade-off for using weights is an increase in sampling variance.
When using models with survey data, it is recommended to test the effect of including weights on the regression results. If there is not much variation and if the standard error of the regression coefficients increases at the same time, then weights have no significant effect. Regardless the data used, it is important that the model is well specified.
In summary, the use of survey data in regression analysis requires careful consideration of variable types, sampling methods and potential problems associated with the data. Properly addressing these considerations will increase the reliability and validity of your regression results.