2 Import data and data management
library(RCurl)
library(MASS)
library(glmnet)
library(caret)
library(survey)
library(readxl)
library(stringr)
library(forcats)
library(foreign)
library(magrittr)
library(tidyverse)
options(scipen = 9999)
options(dplyr.width = Inf)
set.seed(456162)
We first need to import data into R. In this guide we will use UK data from the 7th round of the European Social Survey. The advantage of this data is that the European Social Survey (ESS) is a well documented and high quality probability survey. It allows us to understand how responses were collected and provides some useful information about non-respondents. At the same time, the 7th ESS was weighted by expert statisticians. The process of the two phases of weighting they applied is explained in their website. This will allow us to compare our own weights and results with those already computed by their team of experts. Focusing on the UK sample will allow us to narrow down the analysis and fasten computation by reducing the amount of data used in each step.
For this guide we will use the following 7th ESS datafiles in SPSS (‘.sav’) format:
- sample data (SDDF), edition 1.1, which contains the probability of being sampled for all respondents and non-respondents invited to the survey;
- the data from Contact forms, edition 2.1, which provides information about the process of data collection (e.g. number of times the person was approached for a response, ID of interviewer in each approach, conditions of the house/area where the potential respondent lived.). We will call this data the ‘paradata’ of the survey;
- the integrated interviewer data file, edition2.1. These are the responses to the survey.
The following sections explain data import, selection, merging and recoding. Readers who are not interested in technical details about datasets can skip them and jump directly into exploration and presentation of the data
2.0.1 Import data
The following chunk of code loads the data sets from a data folder in the working directory. The sample data file is stored into the sample.data ‘data_frame’ object. The contact forms information is stored into the paradata folder. Survey responses from the integrated interviewer file are saved in the responses object. We also store the weight variables included in the integrated interviewer file in the original.weights data_frame.
sample.data <- read.spss("data/ESS7SDDFe1_1.sav", to.data.frame = T) %>%
filter(cntry == "United Kingdom")
paradata <- read.spss("data/ess7CFe02_1.sav", to.data.frame = T) %>%
filter(cntry == "United Kingdom")
responses <- read.spss("data/ESS7e02_1.sav", to.data.frame = T) %>%
filter(cntry == "United Kingdom")
original.weights <- responses %>% select(idno ,dweight, pspwght, pweight)
2.0.2 Select variables
Once the data has been read into R, we select the variables we are going to use in our analysis. Selecting variables is a good practice as the ESS files contain much more information that we need for this example. This will allow us to easily find and see the data that is important for us in this guide. Here we just write the names of the variables we intend to use and we will later explain the content of these in more substantial terms.
vars.sample.data <- c("idno", "psu", "prob")
vars.paradata <- c("idno", "typesamp", "interva", "telnum",
"agea_1", "gendera1", "type", "access",
"physa", "littera", "vandaa")
resp.id <- c("idno")
resp.y <- c("cgtsmke", "cgtsday",
"alcfreq", "alcwkdy", "alcwknd")
resp.x <- c("vote", "prtvtbgb",
"prtclbgb", "prtdgcl",
"ctzcntr", "ctzshipc",
"brncntr","cntbrthc",
"gndr", "agea", "hhmmb","eisced", "region",
"pdwrk", "edctn", "uempla", "uempli", "rtrd",
"wrkctra", "hinctnta")
We will also keep the variable labels from the SPSS (.sav) file, although these are not so common in R.
selected.labels.sample.data <- attributes(sample.data)$variable.labels[which(names(sample.data) %in% vars.sample.data)]
selected.labels.paradata <- attributes(paradata)$variable.labels[which(names(paradata) %in% vars.paradata)]
selected.labels.responses <- attributes(responses)$variable.labels[which(names(responses) %in% c(resp.y, resp.x))]
attributes(responses)$variable.labels %>%
cbind(names(responses),.) %>%
as_data_frame %>%
write_csv("interim_output/variable_labels.csv")
Now we do the selection of variables from the three data sets using the names of the variables written a couple of code chunks ago.
sample.data %<>%
.[vars.sample.data]
paradata %<>%
.[vars.paradata]
responses %<>%
.[which(names(responses) %in% c(resp.id, resp.y, resp.x))]
2.0.3 Merging datafiles
After selecting the variables for the analysis, we merge the ‘paradata’ file containing all sampled units (respondents and non-respondents) with the ‘survey responses’ file, containing interview responses (only for respondents). The resulting data_frame is the ‘data’ object. It contains the ‘paradata’ information for all sampled individuals and responses for those that were interviewed successfully.
In a real situation where we collect the data ourselves we would also have a ‘survey frame’. This ‘survey frame’ would ideally include include all units from the population and characteristics of these such as stratification variables. A survey frame would include sampled units (respondents and non-respondents) as well as non-sampled units.
data <- paradata %>%
left_join(sample.data, by = "idno") %>%
left_join(responses, by = "idno") %>%
arrange(interva)
rm(paradata,
sample.data,
responses)
Here we add the variable labels to the datasets with we kept before.
attributes(data)$variable.labels <- c(selected.labels.paradata, selected.labels.sample.data[!names(selected.labels.sample.data) %in% "idno"],
selected.labels.responses)
2.0.4 Recoding
Here we will recode our two dependent variables: cigarette and alcohol consumption. All those respondents that don’t smoke should have a 0 in the cigarretes smoked per day variable. To calculate the alcohol consumption of respondents, we first calculate the daily consumption of alcohol if they were to consume alcohol daily and then weight by their stated frequency of alcohol consumption.
data$cgtsday[data$cgtsmke %in% c("I have never smoked",
"I don't smoke now but I used to",
"I have only smoked a few times")] <- 0
data$alcohol_day <- NA
data$alcohol_day <- (data$alcwkdy * 5 + data$alcwknd *2)/7
data$alcohol_day[which(data$alcfreq == "Several times a week")] <- data$alcohol_day / 2.5
data$alcohol_day[which(data$alcfreq == "Once a week")] <- data$alcohol_day/7
data$alcohol_day[which(data$alcfreq == "2-3 times a month")] <- data$alcohol_day/10
data$alcohol_day[which(data$alcfreq == "Once a month")] <- data$alcohol_day/30
data$alcohol_day[which(data$alcfreq == "Less than once a month")] <- data$alcohol_day/50
data$alcohol_day[which(data$alcfreq == "Never")] <- 0
resp.y <- c(resp.y, "alcohol_day")
2.1 Exploring and presenting the dataset
The merged data set contains sampled respondents and non-respondents. It contains a total of 5600 units and 39 variables.
dim(data)
## [1] 5600 39
The data set contains information about 2265 respondents and 3335 non-respondents.
And this is a list of the variables it contains (with their labels). idno is the individual identification variable.
data.variables <- cbind(names(data),attributes(data)$variable.labels) %>%
as_data_frame()
data.variables$V2 <- format(data.variables$V2 , justify = "left")
data.variables %>%
print(n = 40)
## # A tibble: 39 x 2
## V1 V2
## <chr> <chr>
## 1 idno Respondent's identification number
## 2 typesamp Type of the sample
## 3 interva Interview information for the sample unit
## 4 telnum Telephone number
## 5 agea_1 Estimation of age of respondent or household member who refuses, by interviewer
## 6 gendera1 Gender of respondent or household member who refuses, recorded by interviewer
## 7 type Type of house respondent lives in
## 8 access Entry phone or locked gate/door before reaching respondent's individual door
## 9 physa Assessment overall physical condition building/house
## 10 littera Amount of litter and rubbish in the immediate vicinity
## 11 vandaa Amount of vandalism and graffiti in the immediate vicinity
## 12 psu PSU
## 13 prob PROB
## 14 vote Voted last national election
## 15 prtvtbgb Party voted for in last national election, United Kingdom
## 16 prtclbgb Which party feel closer to, United Kingdom
## 17 prtdgcl How close to party
## 18 ctzcntr Citizen of country
## 19 ctzshipc Citizenship
## 20 brncntr Born in country
## 21 cntbrthc Country of birth
## 22 cgtsmke Cigarettes smoking behaviour
## 23 cgtsday How many cigarettes smoke on typical day
## 24 alcfreq How often drink alcohol
## 25 alcwkdy Grams alcohol, last time drinking on a weekday, Monday to Thursday
## 26 alcwknd Grams alcohol, last time drinking on a weekend day, Friday to Sunday
## 27 hhmmb Number of people living regularly as member of household
## 28 gndr Gender
## 29 agea Age of respondent, calculated
## 30 eisced Highest level of education, ES - ISCED
## 31 pdwrk Doing last 7 days: paid work
## 32 edctn Doing last 7 days: education
## 33 uempla Doing last 7 days: unemployed, actively looking for job
## 34 uempli Doing last 7 days: unemployed, not actively looking for job
## 35 rtrd Doing last 7 days: retired
## 36 wrkctra Employment contract unlimited or limited duration
## 37 hinctnta Household's total net income, all sources
## 38 region Region
## 39 alcohol_day Respondent's identification number
rm(data.variables)
The goal of this guide will be to to give UK population estimates for cigarette and alcohol consumption based on ESS respondents. These will be our Y variables or variables of interest. The idea is to give descriptives of the distribution of these two variables (such as quantiles and mean) and then a simple extrapolation and compute total cigarette and alcohol consumption for the whole UK.
These are our Y variables:
- cgtsday : Number of cigarettes smoked on a typical day.
- alcohol_day : Grams of alcohol ingested on a daily basis. Computed in the Recoding section from the amount of alcohol drank last time during weekdays and weekend.
data[resp.y] %>%
as_data_frame() %>%
print()
## # A tibble: 5,600 x 6
## cgtsmke cgtsday alcfreq alcwkdy alcwknd alcohol_day
## * <fctr> <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 I have never smoked 0 Once a week 0 8 0.32653061
## 2 I smoke daily 10 Once a week 48 96 8.81632653
## 3 I have never smoked 0 Several times a week 42 0 0.91428571
## 4 I have never smoked 0 Never NA NA 0.00000000
## 5 I smoke daily 20 2-3 times a month 55 0 0.03265306
## 6 I don't smoke now but I used to 0 Several times a week 26 32 24.68571429
## 7 I don't smoke now but I used to 0 Once a week 17 0 0.13061224
## 8 I smoke daily 15 Never NA NA 0.00000000
## 9 I have never smoked 0 Never NA NA 0.00000000
## 10 I smoke daily 20 Several times a week 42 121 12.00000000
## # ... with 5,590 more rows
2.1.1 Paradata variables
The 7th ESS survey contains variables which give information about the data collection process. First, we have some variables that come from the ‘sample data (SDDF)’ file. These contain info about the ‘primary sampling unit’ and the probability of each unit of being selected in the sample. These two variables are only available for respondents. In a real project we would most probably have to compute the probability of being sampled ourselves.
- psu: This variable includes information on the primary sampling unit (PSU). In the UK this refers to the ‘postcode address file’.
- prob: Probability of being included in the sample (i.e. approached for survey).
data[vars.sample.data] %>%
as_data_frame() %>%
print()
## # A tibble: 5,600 x 3
## idno psu prob
## * <dbl> <dbl> <dbl>
## 1 100000003 9388 0.00020306164
## 2 100000005 9241 0.00020306164
## 3 100000008 9472 0.00020306164
## 4 100000009 9450 0.00005076541
## 5 100000010 9479 0.00010153082
## 6 100000012 9460 0.00010153082
## 7 100000015 9440 0.00010153082
## 8 100000016 9494 0.00020306164
## 9 100000017 9376 0.00010153082
## 10 100000020 9273 0.00010153082
## # ... with 5,590 more rows
The 7th ESS also contains variables for all sampled units (i.e. respondents and non-respondents). These give information about the events that occurred during the data collection process. We will use these variables as covariates during the computation of Non-response weights in step two.
- typesamp: Refers to the type of unit sampled. In the UK addresses were the final sampling units. In some other countries these were households and individual people.
- interva: Shows the final outcome of the contact. In the UK sample, only codes ‘Complete …’ and ‘No interview …’ were used for respondents and non-respondents respectively.
- telnum: The interviewed person gave his/her mobile phone to the interviewer.
- agea_1: Interviewer estimation of age of respondent or household member who refuses to give the interview.
- gendera1: Interviewer estimation of gender of respondent or household member who refuses to give the interview.
- type: Type of house sampled unit lives in.
- access: Entry phone or locked gate/door before reaching respondent’s individual door.
- physa: Interviewer assessment overall physical condition building/house.
- littera: Interviewer assessment of amount of litter and rubbish in the immediate vicinity.
- vandaa: Interviewer assessment of amount of vandalism and graffiti in the immediate vicinity.
data[vars.paradata] %>%
head(6)
## idno typesamp interva telnum
## 1 100000003 Address Complete and valid interview related to CF Present
## 2 100000005 Address Complete and valid interview related to CF Present
## 3 100000008 Address Complete and valid interview related to CF Present
## 4 100000009 Address Complete and valid interview related to CF Present
## 5 100000010 Address Complete and valid interview related to CF Present
## 6 100000012 Address Complete and valid interview related to CF Present
## agea_1 gendera1 type access
## 1 <NA> <NA> Single unit: Terraced house No, neither of these
## 2 <NA> <NA> Single unit: Semi-detached house No, neither of these
## 3 <NA> <NA> Single unit: Terraced house Yes, locked gate/door
## 4 <NA> <NA> Single unit: Semi-detached house No, neither of these
## 5 <NA> <NA> Single unit: Terraced house No, neither of these
## 6 <NA> <NA> Single unit: Terraced house No, neither of these
## physa littera vandaa
## 1 Satisfactory Small amount None or almost none
## 2 Good None or almost none None or almost none
## 3 Good None or almost none None or almost none
## 4 Satisfactory Large amount None or almost none
## 5 Satisfactory None or almost none None or almost none
## 6 Very good None or almost none None or almost none
2.1.2 Survey responses
Apart from the variables of interest (cigarette and alcohol consumption) our dataset has other variables obtained from survey responses. Obviously, these are only available for respondents. We will try to use some of these variables to calibrate the survey in Use of auxiliary data/calibration step. Some of these variables are:
- vote: Voted last national election (Yes/No)
- prtvtbgb: Party voted for in last national election
- prtclbgb: Which party feel closer to, United Kingdom
- prtdgcl: How close does the repondent feel to the party party from ‘prtclbgb’
- ctzcntr: Has UK citizenship (Yes/No)
- ctzshipc: Citizenship of respondent
- brncntr:Respondent born in the UK
- cntbrthc: Respondent country of birth
- gndr: Gender of respondent
- agea: Calculated age of respondent
- eisced: Highest level of education of respondent
- pdwrk: In paid work
- edctn: In education
- uempla: In unemployment, actively looking for a job
- uempli: In unemployment, not actively looking for a job
- rtrd: Retired
- wrkctra: Employment contract unlimited or limited duration
- hinctnta: Household’s total net income, all sources
data[c(resp.id, resp.x)] %>%
head(6)
## idno vote prtvtbgb prtclbgb prtdgcl ctzcntr ctzshipc
## 1 100000003 Yes Liberal Democrat <NA> <NA> Yes <NA>
## 2 100000005 Yes Liberal Democrat <NA> <NA> Yes <NA>
## 3 100000008 Yes Labour Labour Not close Yes <NA>
## 4 100000009 Yes Labour <NA> <NA> Yes <NA>
## 5 100000010 No <NA> <NA> <NA> Yes <NA>
## 6 100000012 No <NA> <NA> <NA> Yes <NA>
## brncntr cntbrthc gndr agea hhmmb
## 1 Yes <NA> Female 49 1
## 2 Yes <NA> Female 45 2
## 3 Yes <NA> Male 76 1
## 4 No <NA> Male 50 5
## 5 Yes <NA> Male 67 2
## 6 Yes <NA> Female 31 2
## eisced
## 1 ES-ISCED V2, higher tertiary education, >= MA level
## 2 ES-ISCED V1, lower tertiary education, BA level
## 3 ES-ISCED I , less than lower secondary
## 4 ES-ISCED II, lower secondary
## 5 ES-ISCED I , less than lower secondary
## 6 ES-ISCED IIIa, upper tier upper secondary
## region pdwrk edctn uempla uempli
## 1 West Midlands (England) Marked Not marked Not marked Not marked
## 2 Scotland Marked Not marked Not marked Not marked
## 3 East Midlands (England) Not marked Not marked Not marked Not marked
## 4 South East (England) Not marked Not marked Not marked Not marked
## 5 Yorkshire and the Humber Not marked Not marked Not marked Not marked
## 6 South West (England) Marked Not marked Not marked Not marked
## rtrd wrkctra hinctnta
## 1 Not marked Unlimited M - 4th decile
## 2 Not marked No contract R - 2nd decile
## 3 Marked Unlimited R - 2nd decile
## 4 Not marked Unlimited S - 6th decile
## 5 Marked <NA> J - 1st decile
## 6 Not marked Unlimited F - 5th decile