Recipes (preprocessing)

Learning outcomes/objective: Learn…

Sources: Preprocess your data with recipes, Methods for selecting variables in step functions, Introduction to recipes, James et al. (2013, Ch. 5)

1 Fundstücke/Finding(s)

2 Why preprocess data and recipes?

  • Preprocessing: = transforming raw data into a format that can be used by the model
    • e.g., data cleaning, feature selection, feature engineering, data normalization (e.g., scale/transform variables to similar ranges, etc.)
    • Various concepts overlap here (e.g, data encoding from text to numbers = feature engineering)
  • Feature engeneering: process of creating new features or transforming existing features in a dataset to improve performance and interpretability of ML model
    • Goal: create features that better capture the underlying patterns and relationships in the data to provide more accurate and robust predictions
  • Recipes provide pipeable steps for feature engineering and data preprocessing to prepare for modeling

3 Feature engeneering

  • Feature engeneering involves a wide range of techniques such as…
    • creating new features, i.e., combining existing features or transforming them to capture additional information
      • e.g., capture a person’s SES in a variable by combining different variables or adding polynomials
    • encoding of categorical variables to make them more informative for the model, e.g., adding them as dummies
    • normalizing or scaling features, e.g., centering/scaling variables to improve interpretability
    • handling missing values, e.g., imputing them (e.g., mean imputation, median imputation, or interpolation
    • reducing dimensionality, e.g., to reduce complexity of the model and improve performance
    • extracting features from text or images, e.g., convert text or image data to numeric variables to use them in ML models
    • extracting key features from raw variables, e.g., getting the day of the week out of a date variable

4 Recipes with tidymodels (1)

  • With tidymodels we can also use socalled recipes to define models and preprocess the data
  • recipe(): Create a recipe for preprocessing data
    • step_*(): function for different preprocessing steps (see here for an overview)
    • bake(recipe_*, new_data = xxx): apply recipe to different datasets
  • Advantage
    • We can easily use & compare different recipes later on
    • Also sometimes necessary to define different recipes for different models (e.g., logistic model vs. random forest model)


5 Recipes with tidymodels (2)

  • Example below: A recipe containing an outcome plus two numeric predictors that centers and scale (“normalize”) the predictors
library(tidymodels)
recipe1 <- recipe(lifesatisfaction ~ unemployed_active + age + education, 
                  data = data) %>%
  step_normalize(all_numeric_predictors())
recipe1
  • Recipes require definitions of variables (e.g., in formula above), roles that is how variables are used in the model and terms (partly happens automatically) (see here)
  • Using recipes: once a recipe is defined, it needs to be estimated before being applied to data
    • Most recipe steps have specific quantities that must be calculated or estimated
      • step_normalize(): needs to compute the training set’s mean for the selected columns
      • step_dummy(): needs to determine the factor levels of selected columns in order to make the appropriate indicator columns

6 Lab: Using recipes

We’ll use the ESS data that we have used in the previous lab. The variables were named not so well, so we have to rely on the codebook to understand their meaning. You can find it here or on the website of the ESS.

  • stflife: measures life satisfaction (How satisfied with life as a whole).
  • uempla: measures unemployment (Doing last 7 days: unemployed, actively looking for job).
  • uempli: measures life satisfaction (Doing last 7 days: unemployed, not actively looking for job).
  • eisced: measures education (Highest level of education, ES - ISCED).
  • cntry: measures a respondent’s country of origin (here held constant for France).
  • etc.
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               naniar,
               knitr,
               kableExtra)

We first import the data into R:

# Build and evaluate model without resampling
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))

6.1 Renaming and recoding

We recode and rename some of the variables. In principle, this could be added to the recipe however, it also represents an initial step to generate a clean dataset. Here we also create a factor variable containing religious denomination.

data <- data  %>%
    filter(cntry=="FR") %>% # filter french respondents 
 # Recode missings
 mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
        eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
        stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
        rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
 # Rename variables
 rename(country = cntry,
        unemployed_active = uempla,
        unemployed = uempli,
        trust = ppltrst,
        lifesatisfaction = stflife,
        education = eisced,
        age = agea,
        religion = rlgdnm) %>% # we add this variable
  mutate(religion = factor(recode(religion, # Recode and factor
    "1" = "Roman Catholic",
    "2" = "Protestant",
    "3" = "Eastern Orthodox",
    "4" = "Other Christian denomination",
    "5" = "Jewish",
    "6" = "Islam",
    "7" = "Eastern religions",
    "8" = "Other Non-Christian religions",
    .default = NA_character_)))

6.2 Building a predictive linear model using a recipe

6.3 Split the data

Below we split the data and create resampled partitions with vfold_cv().

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect
<Training/Testing/Total>
<1581/396/1977>
# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
#  10-fold cross-validation 
# A tibble: 10 x 2
   splits             id    
   <list>             <chr> 
 1 <split [1422/159]> Fold01
 2 <split [1423/158]> Fold02
 3 <split [1423/158]> Fold03
 4 <split [1423/158]> Fold04
 5 <split [1423/158]> Fold05
 6 <split [1423/158]> Fold06
 7 <split [1423/158]> Fold07
 8 <split [1423/158]> Fold08
 9 <split [1423/158]> Fold09
10 <split [1423/158]> Fold10

6.4 Define model, recipe and workflow

Then we define our model, a recipe namely recipe1 in which specify only the formula and data. In recipe2 we specify formula, data and use step_naomit() to omit missings as well use step_poly() to add polynomial expansion on some variables. Those expansions are then passed as variables to the linear regression model.

By using step_poly(age, trust, education, degree = 3) in recipe2 we include polynomials of the three predictors to the degree of 3.

# Define the model
  linear_mod <- 
  linear_reg() %>% # linear model
  set_engine("lm") %>%
  set_mode("regression")
  

# Define recipe 1
  recipe1 <- 
  recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion, 
         data = data)
  recipe1
  
# Define recipe 2
  recipe2 <- 
  recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion, 
         data = data) %>%
  step_normalize(all_numeric_predictors()) %>% 
  # step_dummy(all_nominal_predictors()) %>%
  step_naomit(age, trust, education) %>%
  step_poly(age, trust, education, degree = 3) # What does "degree" specify?
  recipe2

# Define workflows
  wflow1 <- 
  workflow() %>% 
  add_model(linear_mod) %>%
  add_recipe(recipe1)

  wflow2 <- 
  workflow() %>% 
  add_model(linear_mod) %>%
  add_recipe(recipe2)



6.5 Fit and evaluate the models (resampled training data)

6.6 Fit models

Then we evaluate our models using resampling the fit_resamples() function finally fitting the workflows to data_folds.

  fit_rs1 <- wflow1 %>% fit_resamples(resamples = data_folds)
  fit_rs1
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 x 4
   splits             id     .metrics         .notes          
   <list>             <chr>  <list>           <list>          
 1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
 2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
 3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
 4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
 5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
 6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
 7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
 8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
 9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>
  fit_rs2 <- wflow2 %>% fit_resamples(resamples = data_folds)
  fit_rs2
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 x 4
   splits             id     .metrics         .notes          
   <list>             <chr>  <list>           <list>          
 1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
 2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
 3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
 4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
 5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
 6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
 7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
 8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
 9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>



6.6.1 Evaluate models

We use collect_metrics() to evaluate our models averaging across our resamples, accounting for the natural variability across different analysis datasets that we could possibly use to built the model.

# Collect metrics across resamples
  metrics1 <- collect_metrics(fit_rs1)
  metrics1
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   1.97     10  0.0694 Preprocessor1_Model1
2 rsq     standard   0.111    10  0.0314 Preprocessor1_Model1
# Collect metrics across resamples
  metrics2 <- collect_metrics(fit_rs2)
  metrics2
# A tibble: 2 x 6
  .metric .estimator    mean     n  std_err .config             
  <chr>   <chr>        <dbl> <int>    <dbl> <chr>               
1 rmse    standard   241.       10 239.     Preprocessor1_Model1
2 rsq     standard     0.106    10   0.0259 Preprocessor1_Model1

Our less flexible model (recipe1) seems to provide a better fit than the more flexible model that include the polynomials (recipe2).



6.7 Use workflow for final model/prediction and evaluation

Finally, when we are happy with our workflow/model (after our assessment using resampling) we can fit the model/workflow to data_train (not using any resampling) and use it to predict the outcome in our data_test to calculate the accuracy. Below we calculate the MAE = mean absolute error for both our models (based on the two recipes).

# Fit model 1
  fit1 <- wflow1 %>% 
  fit(data = data_train)
  
  extract_fit_parsnip(fit1) %>% # Check coefficients
  tidy()
# A tibble: 12 x 5
   term                                   estimate std.error statistic  p.value
   <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                            4.96       0.705      7.04   4.36e-12
 2 unemployed_active                      0.0669     0.372      0.180  8.57e- 1
 3 age                                   -0.000359   0.00186   -0.193  8.47e- 1
 4 trust                                  0.264      0.0341     7.72   3.59e-14
 5 education                              0.121      0.0379     3.18   1.55e- 3
 6 religionEastern religions              1.01       1.18       0.858  3.91e- 1
 7 religionIslam                          0.172      0.684      0.251  8.02e- 1
 8 religionJewish                        -0.0731     0.926     -0.0790 9.37e- 1
 9 religionOther Christian denomination  -0.261      0.926     -0.282  7.78e- 1
10 religionOther Non-Christian religions  0.625      0.954      0.655  5.13e- 1
11 religionProtestant                    -0.119      0.744     -0.159  8.73e- 1
12 religionRoman Catholic                 0.516      0.661      0.780  4.36e- 1
# Fit model 2
  fit2 <- wflow2 %>% 
  fit(data = data_train)
  
  extract_fit_parsnip(fit2) %>%  # Check coefficients
  tidy()
# A tibble: 18 x 5
   term                                  estimate std.error statistic  p.value
   <chr>                                    <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                            6.64       0.658    10.1    1.69e-22
 2 unemployed_active                      0.00988    0.0702    0.141  8.88e- 1
 3 religionEastern religions              0.958      1.18      0.813  4.16e- 1
 4 religionIslam                          0.166      0.684     0.242  8.09e- 1
 5 religionJewish                        -0.212      0.929    -0.228  8.19e- 1
 6 religionOther Christian denomination  -0.115      0.926    -0.125  9.01e- 1
 7 religionOther Non-Christian religions  0.823      0.955     0.862  3.89e- 1
 8 religionProtestant                    -0.164      0.746    -0.220  8.26e- 1
 9 religionRoman Catholic                 0.541      0.665     0.813  4.16e- 1
10 age_poly_1                            -1.11       3.36     -0.331  7.41e- 1
11 age_poly_2                             3.04       3.22      0.945  3.45e- 1
12 age_poly_3                            -7.27       2.81     -2.59   9.81e- 3
13 trust_poly_1                          21.9        2.84      7.70   4.16e-14
14 trust_poly_2                           0.121      2.78      0.0434 9.65e- 1
15 trust_poly_3                          -3.59       2.77     -1.30   1.96e- 1
16 education_poly_1                      10.5        2.96      3.54   4.32e- 4
17 education_poly_2                       2.06       2.81      0.733  4.64e- 1
18 education_poly_3                      -2.72       2.84     -0.956  3.39e- 1
# Model 1: Calculate accuracy for test data
  fit1 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) %>%
  mae(truth = lifesatisfaction, estimate = .pred)  
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        1.54
# Model 2: Calculate accuracy for test data
  fit2 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) %>%
  mae(truth = lifesatisfaction, estimate = .pred)  
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        1.55

Beware: If the accuracy is too good to be true there is probably some coding/recoding error… (I initially added a coding error at the start)

An we can also check out the predicted values of the two models:

# Model 1: Augment & extract actual/predicted values
  fit1 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred)  
# A tibble: 396 x 2
   lifesatisfaction .pred
              <dbl> <dbl>
 1                7 NA   
 2                7  7.00
 3                5 NA   
 4                5 NA   
 5                8  7.91
 6                7 NA   
 7                8 NA   
 8                2 NA   
 9                9  7.78
10                8  5.69
# ... with 386 more rows
# Model 2: Augment & extract actual/predicted values
  fit2 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) 
# A tibble: 396 x 2
   lifesatisfaction .pred
              <dbl> <dbl>
 1                7 NA   
 2                7  6.82
 3                5 NA   
 4                5 NA   
 5                8  7.89
 6                7 NA   
 7                8 NA   
 8                2 NA   
 9                9  7.68
10                8  5.95
# ... with 386 more rows

7 All the code

# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               naniar,
               knitr,
               kableExtra)
# Build and evaluate model without resampling
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))
data <- data  %>%
    filter(cntry=="FR") %>% # filter french respondents 
 # Recode missings
 mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
        eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
        stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
        rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
 # Rename variables
 rename(country = cntry,
        unemployed_active = uempla,
        unemployed = uempli,
        trust = ppltrst,
        lifesatisfaction = stflife,
        education = eisced,
        age = agea,
        religion = rlgdnm) %>% # we add this variable
  mutate(religion = factor(recode(religion, # Recode and factor
    "1" = "Roman Catholic",
    "2" = "Protestant",
    "3" = "Eastern Orthodox",
    "4" = "Other Christian denomination",
    "5" = "Jewish",
    "6" = "Islam",
    "7" = "Eastern religions",
    "8" = "Other Non-Christian religions",
    .default = NA_character_)))
# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
# Define the model
  linear_mod <- 
  linear_reg() %>% # linear model
  set_engine("lm") %>%
  set_mode("regression")
  

# Define recipe 1
  recipe1 <- 
  recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion, 
         data = data)
  recipe1
  
# Define recipe 2
  recipe2 <- 
  recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion, 
         data = data) %>%
  step_normalize(all_numeric_predictors()) %>% 
  # step_dummy(all_nominal_predictors()) %>%
  step_naomit(age, trust, education) %>%
  step_poly(age, trust, education, degree = 3) # What does "degree" specify?
  recipe2

# Define workflows
  wflow1 <- 
  workflow() %>% 
  add_model(linear_mod) %>%
  add_recipe(recipe1)

  wflow2 <- 
  workflow() %>% 
  add_model(linear_mod) %>%
  add_recipe(recipe2)
  fit_rs1 <- wflow1 %>% fit_resamples(resamples = data_folds)
  fit_rs1
  
  
  fit_rs2 <- wflow2 %>% fit_resamples(resamples = data_folds)
  fit_rs2
# Collect metrics across resamples
  metrics1 <- collect_metrics(fit_rs1)
  metrics1
  
# Collect metrics across resamples
  metrics2 <- collect_metrics(fit_rs2)
  metrics2
# Fit model 1
  fit1 <- wflow1 %>% 
  fit(data = data_train)
  
  extract_fit_parsnip(fit1) %>% # Check coefficients
  tidy()

# Fit model 2
  fit2 <- wflow2 %>% 
  fit(data = data_train)
  
  extract_fit_parsnip(fit2) %>%  # Check coefficients
  tidy()

  
# Model 1: Calculate accuracy for test data
  fit1 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) %>%
  mae(truth = lifesatisfaction, estimate = .pred)  

# Model 2: Calculate accuracy for test data
  fit2 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) %>%
  mae(truth = lifesatisfaction, estimate = .pred)  
# Model 1: Augment & extract actual/predicted values
  fit1 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred)  

# Model 2: Augment & extract actual/predicted values
  fit2 %>%
  augment(data_test) %>%
  select(lifesatisfaction, .pred) 

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.