2 Import data and data management

library(RCurl)
library(MASS)
library(glmnet)
library(caret)
library(survey)
library(readxl)
library(stringr)
library(forcats)
library(foreign)
library(magrittr)
library(tidyverse)
options(scipen = 9999)
options(dplyr.width = Inf)


set.seed(456162)

We first need to import data into R. In this guide we will use UK data from the 7th round of the European Social Survey. The advantage of this data is that the European Social Survey (ESS) is a well documented and high quality probability survey. It allows us to understand how responses were collected and provides some useful information about non-respondents. At the same time, the 7th ESS was weighted by expert statisticians. The process of the two phases of weighting they applied is explained in their website. This will allow us to compare our own weights and results with those already computed by their team of experts. Focusing on the UK sample will allow us to narrow down the analysis and fasten computation by reducing the amount of data used in each step.

For this guide we will use the following 7th ESS datafiles in SPSS (‘.sav’) format:

sample data (SDDF), edition 1.1, which contains the probability of being sampled for all respondents and non-respondents invited to the survey;
the data from Contact forms, edition 2.1, which provides information about the process of data collection (e.g. number of times the person was approached for a response, ID of interviewer in each approach, conditions of the house/area where the potential respondent lived.). We will call this data the ‘paradata’ of the survey;
the integrated interviewer data file, edition2.1. These are the responses to the survey.

The following sections explain data import, selection, merging and recoding. Readers who are not interested in technical details about datasets can skip them and jump directly into exploration and presentation of the data

2.0.1 Import data

The following chunk of code loads the data sets from a data folder in the working directory. The sample data file is stored into the sample.data ‘data_frame’ object. The contact forms information is stored into the paradata folder. Survey responses from the integrated interviewer file are saved in the responses object. We also store the weight variables included in the integrated interviewer file in the original.weights data_frame.

sample.data <- read.spss("data/ESS7SDDFe1_1.sav", to.data.frame = T)  %>%
  filter(cntry == "United Kingdom")

paradata <- read.spss("data/ess7CFe02_1.sav", to.data.frame = T) %>%
  filter(cntry == "United Kingdom") 

responses <- read.spss("data/ESS7e02_1.sav", to.data.frame = T) %>%
  filter(cntry == "United Kingdom") 

original.weights <- responses %>% select(idno ,dweight, pspwght, pweight)

2.0.2 Select variables

Once the data has been read into R, we select the variables we are going to use in our analysis. Selecting variables is a good practice as the ESS files contain much more information that we need for this example. This will allow us to easily find and see the data that is important for us in this guide. Here we just write the names of the variables we intend to use and we will later explain the content of these in more substantial terms.

vars.sample.data <- c("idno", "psu", "prob")

vars.paradata <- c("idno", "typesamp", "interva", "telnum", 
                   "agea_1", "gendera1", "type", "access", 
                   "physa", "littera", "vandaa")

resp.id <- c("idno")

resp.y <- c("cgtsmke", "cgtsday",
         "alcfreq", "alcwkdy", "alcwknd")

resp.x <- c("vote", "prtvtbgb",
            "prtclbgb", "prtdgcl",
            "ctzcntr", "ctzshipc",
         "brncntr","cntbrthc",
         "gndr", "agea", "hhmmb","eisced", "region",
         "pdwrk", "edctn", "uempla", "uempli", "rtrd",
         "wrkctra", "hinctnta")

We will also keep the variable labels from the SPSS (.sav) file, although these are not so common in R.

selected.labels.sample.data <- attributes(sample.data)$variable.labels[which(names(sample.data) %in% vars.sample.data)]

selected.labels.paradata <- attributes(paradata)$variable.labels[which(names(paradata) %in% vars.paradata)]

selected.labels.responses <- attributes(responses)$variable.labels[which(names(responses) %in% c(resp.y, resp.x))] 

attributes(responses)$variable.labels %>% 
  cbind(names(responses),.) %>% 
  as_data_frame %>% 
  write_csv("interim_output/variable_labels.csv")

Now we do the selection of variables from the three data sets using the names of the variables written a couple of code chunks ago.

sample.data %<>% 
  .[vars.sample.data]

paradata %<>%
  .[vars.paradata]

responses %<>%
  .[which(names(responses) %in% c(resp.id, resp.y, resp.x))]

2.0.3 Merging datafiles

After selecting the variables for the analysis, we merge the ‘paradata’ file containing all sampled units (respondents and non-respondents) with the ‘survey responses’ file, containing interview responses (only for respondents). The resulting data_frame is the ‘data’ object. It contains the ‘paradata’ information for all sampled individuals and responses for those that were interviewed successfully.

In a real situation where we collect the data ourselves we would also have a ‘survey frame’. This ‘survey frame’ would ideally include include all units from the population and characteristics of these such as stratification variables. A survey frame would include sampled units (respondents and non-respondents) as well as non-sampled units.

data <- paradata %>%
  left_join(sample.data, by = "idno") %>%
  left_join(responses, by = "idno") %>%
  arrange(interva) 

rm(paradata,
   sample.data,
   responses)

Here we add the variable labels to the datasets with we kept before.

attributes(data)$variable.labels <- c(selected.labels.paradata, selected.labels.sample.data[!names(selected.labels.sample.data) %in% "idno"],
                                      selected.labels.responses)

2.0.4 Recoding

Here we will recode our two dependent variables: cigarette and alcohol consumption. All those respondents that don’t smoke should have a 0 in the cigarretes smoked per day variable. To calculate the alcohol consumption of respondents, we first calculate the daily consumption of alcohol if they were to consume alcohol daily and then weight by their stated frequency of alcohol consumption.

data$cgtsday[data$cgtsmke %in% c("I have never smoked",
                                 "I don't smoke now but I used to",
                                 "I have only smoked a few times")] <- 0

data$alcohol_day <- NA 
data$alcohol_day <- (data$alcwkdy * 5 + data$alcwknd *2)/7 

data$alcohol_day[which(data$alcfreq == "Several times a week")] <- data$alcohol_day / 2.5
data$alcohol_day[which(data$alcfreq == "Once a week")] <- data$alcohol_day/7
data$alcohol_day[which(data$alcfreq == "2-3 times a month")] <- data$alcohol_day/10
data$alcohol_day[which(data$alcfreq == "Once a month")] <- data$alcohol_day/30
data$alcohol_day[which(data$alcfreq == "Less than once a month")] <- data$alcohol_day/50
data$alcohol_day[which(data$alcfreq == "Never")] <- 0

resp.y <- c(resp.y, "alcohol_day")

2.1 Exploring and presenting the dataset

The merged data set contains sampled respondents and non-respondents. It contains a total of 5600 units and 39 variables.

dim(data)

## [1] 5600   39

The data set contains information about 2265 respondents and 3335 non-respondents.

And this is a list of the variables it contains (with their labels). idno is the individual identification variable.

data.variables <- cbind(names(data),attributes(data)$variable.labels) %>% 
  as_data_frame()

data.variables$V2 <- format(data.variables$V2 , justify = "left")

data.variables %>%
  print(n = 40)

## # A tibble: 39 x 2
##             V1                                                                              V2
##          <chr>                                                                           <chr>
##  1        idno Respondent's identification number                                             
##  2    typesamp Type of the sample                                                             
##  3     interva Interview information for the sample unit                                      
##  4      telnum Telephone number                                                               
##  5      agea_1 Estimation of age of respondent or household member who refuses, by interviewer
##  6    gendera1 Gender of respondent or household member who refuses, recorded by interviewer  
##  7        type Type of house respondent lives in                                              
##  8      access Entry phone or locked gate/door before reaching respondent's individual door   
##  9       physa Assessment overall physical condition building/house                           
## 10     littera Amount of litter and rubbish in the immediate vicinity                         
## 11      vandaa Amount of vandalism and graffiti in the immediate vicinity                     
## 12         psu PSU                                                                            
## 13        prob PROB                                                                           
## 14        vote Voted last national election                                                   
## 15    prtvtbgb Party voted for in last national election, United Kingdom                      
## 16    prtclbgb Which party feel closer to, United Kingdom                                     
## 17     prtdgcl How close to party                                                             
## 18     ctzcntr Citizen of country                                                             
## 19    ctzshipc Citizenship                                                                    
## 20     brncntr Born in country                                                                
## 21    cntbrthc Country of birth                                                               
## 22     cgtsmke Cigarettes smoking behaviour                                                   
## 23     cgtsday How many cigarettes smoke on typical day                                       
## 24     alcfreq How often drink alcohol                                                        
## 25     alcwkdy Grams alcohol, last time drinking on a weekday, Monday to Thursday             
## 26     alcwknd Grams alcohol, last time drinking on a weekend day, Friday to Sunday           
## 27       hhmmb Number of people living regularly as member of household                       
## 28        gndr Gender                                                                         
## 29        agea Age of respondent, calculated                                                  
## 30      eisced Highest level of education, ES - ISCED                                         
## 31       pdwrk Doing last 7 days: paid work                                                   
## 32       edctn Doing last 7 days: education                                                   
## 33      uempla Doing last 7 days: unemployed, actively looking for job                        
## 34      uempli Doing last 7 days: unemployed, not actively looking for job                    
## 35        rtrd Doing last 7 days: retired                                                     
## 36     wrkctra Employment contract unlimited or limited duration                              
## 37    hinctnta Household's total net income, all sources                                      
## 38      region Region                                                                         
## 39 alcohol_day Respondent's identification number

rm(data.variables)

The goal of this guide will be to to give UK population estimates for cigarette and alcohol consumption based on ESS respondents. These will be our Y variables or variables of interest. The idea is to give descriptives of the distribution of these two variables (such as quantiles and mean) and then a simple extrapolation and compute total cigarette and alcohol consumption for the whole UK.

These are our Y variables:

cgtsday : Number of cigarettes smoked on a typical day.
alcohol_day : Grams of alcohol ingested on a daily basis. Computed in the Recoding section from the amount of alcohol drank last time during weekdays and weekend.

data[resp.y] %>%
  as_data_frame() %>%
  print()

## # A tibble: 5,600 x 6
##                            cgtsmke cgtsday              alcfreq alcwkdy alcwknd alcohol_day
##  *                          <fctr>   <dbl>               <fctr>   <dbl>   <dbl>       <dbl>
##  1             I have never smoked       0          Once a week       0       8  0.32653061
##  2                   I smoke daily      10          Once a week      48      96  8.81632653
##  3             I have never smoked       0 Several times a week      42       0  0.91428571
##  4             I have never smoked       0                Never      NA      NA  0.00000000
##  5                   I smoke daily      20    2-3 times a month      55       0  0.03265306
##  6 I don't smoke now but I used to       0 Several times a week      26      32 24.68571429
##  7 I don't smoke now but I used to       0          Once a week      17       0  0.13061224
##  8                   I smoke daily      15                Never      NA      NA  0.00000000
##  9             I have never smoked       0                Never      NA      NA  0.00000000
## 10                   I smoke daily      20 Several times a week      42     121 12.00000000
## # ... with 5,590 more rows

2.1.1 Paradata variables

The 7th ESS survey contains variables which give information about the data collection process. First, we have some variables that come from the ‘sample data (SDDF)’ file. These contain info about the ‘primary sampling unit’ and the probability of each unit of being selected in the sample. These two variables are only available for respondents. In a real project we would most probably have to compute the probability of being sampled ourselves.

psu: This variable includes information on the primary sampling unit (PSU). In the UK this refers to the ‘postcode address file’.
prob: Probability of being included in the sample (i.e. approached for survey).

data[vars.sample.data] %>%
  as_data_frame() %>%
  print()

## # A tibble: 5,600 x 3
##         idno   psu          prob
##  *     <dbl> <dbl>         <dbl>
##  1 100000003  9388 0.00020306164
##  2 100000005  9241 0.00020306164
##  3 100000008  9472 0.00020306164
##  4 100000009  9450 0.00005076541
##  5 100000010  9479 0.00010153082
##  6 100000012  9460 0.00010153082
##  7 100000015  9440 0.00010153082
##  8 100000016  9494 0.00020306164
##  9 100000017  9376 0.00010153082
## 10 100000020  9273 0.00010153082
## # ... with 5,590 more rows

The 7th ESS also contains variables for all sampled units (i.e. respondents and non-respondents). These give information about the events that occurred during the data collection process. We will use these variables as covariates during the computation of Non-response weights in step two.

typesamp: Refers to the type of unit sampled. In the UK addresses were the final sampling units. In some other countries these were households and individual people.
interva: Shows the final outcome of the contact. In the UK sample, only codes ‘Complete …’ and ‘No interview …’ were used for respondents and non-respondents respectively.
telnum: The interviewed person gave his/her mobile phone to the interviewer.
agea_1: Interviewer estimation of age of respondent or household member who refuses to give the interview.
gendera1: Interviewer estimation of gender of respondent or household member who refuses to give the interview.
type: Type of house sampled unit lives in.
access: Entry phone or locked gate/door before reaching respondent’s individual door.
physa: Interviewer assessment overall physical condition building/house.
littera: Interviewer assessment of amount of litter and rubbish in the immediate vicinity.
vandaa: Interviewer assessment of amount of vandalism and graffiti in the immediate vicinity.

data[vars.paradata] %>%
  head(6)

##        idno typesamp                                    interva  telnum
## 1 100000003  Address Complete and valid interview related to CF Present
## 2 100000005  Address Complete and valid interview related to CF Present
## 3 100000008  Address Complete and valid interview related to CF Present
## 4 100000009  Address Complete and valid interview related to CF Present
## 5 100000010  Address Complete and valid interview related to CF Present
## 6 100000012  Address Complete and valid interview related to CF Present
##   agea_1 gendera1                             type                access
## 1   <NA>     <NA>      Single unit: Terraced house  No, neither of these
## 2   <NA>     <NA> Single unit: Semi-detached house  No, neither of these
## 3   <NA>     <NA>      Single unit: Terraced house Yes, locked gate/door
## 4   <NA>     <NA> Single unit: Semi-detached house  No, neither of these
## 5   <NA>     <NA>      Single unit: Terraced house  No, neither of these
## 6   <NA>     <NA>      Single unit: Terraced house  No, neither of these
##          physa             littera              vandaa
## 1 Satisfactory        Small amount None or almost none
## 2         Good None or almost none None or almost none
## 3         Good None or almost none None or almost none
## 4 Satisfactory        Large amount None or almost none
## 5 Satisfactory None or almost none None or almost none
## 6    Very good None or almost none None or almost none

2.1.2 Survey responses

Apart from the variables of interest (cigarette and alcohol consumption) our dataset has other variables obtained from survey responses. Obviously, these are only available for respondents. We will try to use some of these variables to calibrate the survey in Use of auxiliary data/calibration step. Some of these variables are:

vote: Voted last national election (Yes/No)
prtvtbgb: Party voted for in last national election
prtclbgb: Which party feel closer to, United Kingdom
prtdgcl: How close does the repondent feel to the party party from ‘prtclbgb’
ctzcntr: Has UK citizenship (Yes/No)
ctzshipc: Citizenship of respondent
brncntr:Respondent born in the UK
cntbrthc: Respondent country of birth
gndr: Gender of respondent
agea: Calculated age of respondent
eisced: Highest level of education of respondent
pdwrk: In paid work
edctn: In education
uempla: In unemployment, actively looking for a job
uempli: In unemployment, not actively looking for a job
rtrd: Retired
wrkctra: Employment contract unlimited or limited duration
hinctnta: Household’s total net income, all sources

data[c(resp.id, resp.x)] %>%
  head(6)

##        idno vote         prtvtbgb prtclbgb   prtdgcl ctzcntr ctzshipc
## 1 100000003  Yes Liberal Democrat     <NA>      <NA>     Yes     <NA>
## 2 100000005  Yes Liberal Democrat     <NA>      <NA>     Yes     <NA>
## 3 100000008  Yes           Labour   Labour Not close     Yes     <NA>
## 4 100000009  Yes           Labour     <NA>      <NA>     Yes     <NA>
## 5 100000010   No             <NA>     <NA>      <NA>     Yes     <NA>
## 6 100000012   No             <NA>     <NA>      <NA>     Yes     <NA>
##   brncntr cntbrthc   gndr agea hhmmb
## 1     Yes     <NA> Female   49     1
## 2     Yes     <NA> Female   45     2
## 3     Yes     <NA>   Male   76     1
## 4      No     <NA>   Male   50     5
## 5     Yes     <NA>   Male   67     2
## 6     Yes     <NA> Female   31     2
##                                                eisced
## 1 ES-ISCED V2, higher tertiary education, >= MA level
## 2     ES-ISCED V1, lower tertiary education, BA level
## 3              ES-ISCED I , less than lower secondary
## 4                        ES-ISCED II, lower secondary
## 5              ES-ISCED I , less than lower secondary
## 6           ES-ISCED IIIa, upper tier upper secondary
##                     region      pdwrk      edctn     uempla     uempli
## 1  West Midlands (England)     Marked Not marked Not marked Not marked
## 2                 Scotland     Marked Not marked Not marked Not marked
## 3  East Midlands (England) Not marked Not marked Not marked Not marked
## 4     South East (England) Not marked Not marked Not marked Not marked
## 5 Yorkshire and the Humber Not marked Not marked Not marked Not marked
## 6     South West (England)     Marked Not marked Not marked Not marked
##         rtrd     wrkctra       hinctnta
## 1 Not marked   Unlimited M - 4th decile
## 2 Not marked No contract R - 2nd decile
## 3     Marked   Unlimited R - 2nd decile
## 4 Not marked   Unlimited S - 6th decile
## 5     Marked        <NA> J - 1st decile
## 6 Not marked   Unlimited F - 5th decile