library(tidymodels)
<- recipe(lifesatisfaction ~ unemployed_active + age + education,
recipe1 data = data) %>%
step_normalize(all_numeric_predictors())
recipe1
Recipes (preprocessing)
Learning outcomes/objective: Learn…
- …what recipes are and how they help us to built ML models.
- …how to implement recipes with tidymodels in R.
Sources: Preprocess your data with recipes, Methods for selecting variables in step functions, Introduction to recipes, James et al. (2013, Ch. 5)
1 Fundstücke/Finding(s)
2 Why preprocess data and recipes?
- Preprocessing: = transforming raw data into a format that can be used by the model
- e.g., data cleaning, feature selection, feature engineering, data normalization (e.g., scale/transform variables to similar ranges, etc.)
- Various concepts overlap here (e.g, data encoding from text to numbers = feature engineering)
- Feature engeneering: process of creating new features or transforming existing features in a dataset to improve performance and interpretability of ML model
- Goal: create features that better capture the underlying patterns and relationships in the data to provide more accurate and robust predictions
- Recipes provide pipeable steps for feature engineering and data preprocessing to prepare for modeling
3 Feature engeneering
- Feature engeneering involves a wide range of techniques such as…
- creating new features, i.e., combining existing features or transforming them to capture additional information
- e.g., capture a person’s SES in a variable by combining different variables or adding polynomials
- encoding of categorical variables to make them more informative for the model, e.g., adding them as dummies
- normalizing or scaling features, e.g., centering/scaling variables to improve interpretability
- handling missing values, e.g., imputing them (e.g., mean imputation, median imputation, or interpolation
- reducing dimensionality, e.g., to reduce complexity of the model and improve performance
- extracting features from text or images, e.g., convert text or image data to numeric variables to use them in ML models
- extracting key features from raw variables, e.g., getting the day of the week out of a date variable
- creating new features, i.e., combining existing features or transforming them to capture additional information
4 Recipes with tidymodels (1)
- With tidymodels we can also use socalled recipes to define models and preprocess the data
recipe()
: Create a recipe for preprocessing datastep_*()
: function for different preprocessing steps (see here for an overview)bake(recipe_*, new_data = xxx)
: apply recipe to different datasets
- Advantage
- We can easily use & compare different recipes later on
- Also sometimes necessary to define different recipes for different models (e.g., logistic model vs. random forest model)
5 Recipes with tidymodels (2)
- Example below: A recipe containing an outcome plus two numeric predictors that centers and scale (“normalize”) the predictors
- Recipes require definitions of variables (e.g., in formula above), roles that is how variables are used in the model and terms (partly happens automatically) (see here)
- Using recipes: once a recipe is defined, it needs to be estimated before being applied to data
- Most recipe steps have specific quantities that must be calculated or estimated
step_normalize()
: needs to compute the training set’s mean for the selected columnsstep_dummy()
: needs to determine the factor levels of selected columns in order to make the appropriate indicator columns
- Most recipe steps have specific quantities that must be calculated or estimated
6 Lab: Using recipes
We’ll use the ESS data that we have used in the previous lab. The variables were named not so well, so we have to rely on the codebook to understand their meaning. You can find it here or on the website of the ESS.
stflife
: measures life satisfaction (How satisfied with life as a whole).uempla
: measures unemployment (Doing last 7 days: unemployed, actively looking for job).uempli
: measures life satisfaction (Doing last 7 days: unemployed, not actively looking for job).eisced
: measures education (Highest level of education, ES - ISCED).cntry
: measures a respondent’s country of origin (here held constant for France).- etc.
# install.packages(pacman)
::p_load(tidyverse,
pacman
tidymodels,
naniar,
knitr, kableExtra)
We first import the data into R:
# Build and evaluate model without resampling
<- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
data "1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))
6.1 Renaming and recoding
We recode and rename some of the variables. In principle, this could be added to the recipe however, it also represents an initial step to generate a clean dataset. Here we also create a factor variable containing religious denomination.
<- data %>%
data filter(cntry=="FR") %>% # filter french respondents
# Recode missings
mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
# Rename variables
rename(country = cntry,
unemployed_active = uempla,
unemployed = uempli,
trust = ppltrst,
lifesatisfaction = stflife,
education = eisced,
age = agea,
religion = rlgdnm) %>% # we add this variable
mutate(religion = factor(recode(religion, # Recode and factor
"1" = "Roman Catholic",
"2" = "Protestant",
"3" = "Eastern Orthodox",
"4" = "Other Christian denomination",
"5" = "Jewish",
"6" = "Islam",
"7" = "Eastern religions",
"8" = "Other Non-Christian religions",
.default = NA_character_)))
6.2 Building a predictive linear model using a recipe
6.3 Split the data
Below we split the data and create resampled partitions with vfold_cv()
.
# Split the data into training and test data
<- initial_split(data, prop = 0.80)
data_split # Inspect data_split
<Training/Testing/Total>
<1581/396/1977>
# Extract the two datasets
<- training(data_split)
data_train <- testing(data_split) # Do not touch until the end!
data_test
# Create resampled partitions
set.seed(345)
<- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data data_folds
# 10-fold cross-validation
# A tibble: 10 x 2
splits id
<list> <chr>
1 <split [1422/159]> Fold01
2 <split [1423/158]> Fold02
3 <split [1423/158]> Fold03
4 <split [1423/158]> Fold04
5 <split [1423/158]> Fold05
6 <split [1423/158]> Fold06
7 <split [1423/158]> Fold07
8 <split [1423/158]> Fold08
9 <split [1423/158]> Fold09
10 <split [1423/158]> Fold10
6.4 Define model, recipe and workflow
Then we define our model, a recipe namely recipe1
in which specify only the formula and data. In recipe2
we specify formula, data and use step_naomit()
to omit missings as well use step_poly()
to add polynomial expansion on some variables. Those expansions are then passed as variables to the linear regression model.
By using step_poly(age, trust, education, degree = 3)
in recipe2
we include polynomials of the three predictors to the degree of 3.
# Define the model
<-
linear_mod linear_reg() %>% # linear model
set_engine("lm") %>%
set_mode("regression")
# Define recipe 1
<-
recipe1 recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data)
recipe1
# Define recipe 2
<-
recipe2 recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data) %>%
step_normalize(all_numeric_predictors()) %>%
# step_dummy(all_nominal_predictors()) %>%
step_naomit(age, trust, education) %>%
step_poly(age, trust, education, degree = 3) # What does "degree" specify?
recipe2
# Define workflows
<-
wflow1 workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe1)
<-
wflow2 workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe2)
6.5 Fit and evaluate the models (resampled training data)
6.6 Fit models
Then we evaluate our models using resampling the fit_resamples()
function finally fitting the workflows to data_folds
.
<- wflow1 %>% fit_resamples(resamples = data_folds)
fit_rs1 fit_rs1
# Resampling results
# 10-fold cross-validation
# A tibble: 10 x 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>
<- wflow2 %>% fit_resamples(resamples = data_folds)
fit_rs2 fit_rs2
# Resampling results
# 10-fold cross-validation
# A tibble: 10 x 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>
6.6.1 Evaluate models
We use collect_metrics()
to evaluate our models averaging across our resamples, accounting for the natural variability across different analysis datasets that we could possibly use to built the model.
# Collect metrics across resamples
<- collect_metrics(fit_rs1)
metrics1 metrics1
# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.97 10 0.0694 Preprocessor1_Model1
2 rsq standard 0.111 10 0.0314 Preprocessor1_Model1
# Collect metrics across resamples
<- collect_metrics(fit_rs2)
metrics2 metrics2
# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 241. 10 239. Preprocessor1_Model1
2 rsq standard 0.106 10 0.0259 Preprocessor1_Model1
Our less flexible model (recipe1) seems to provide a better fit than the more flexible model that include the polynomials (recipe2).
6.7 Use workflow for final model/prediction and evaluation
Finally, when we are happy with our workflow/model (after our assessment using resampling) we can fit the model/workflow to data_train
(not using any resampling) and use it to predict the outcome in our data_test
to calculate the accuracy. Below we calculate the MAE = mean absolute error for both our models (based on the two recipes).
# Fit model 1
<- wflow1 %>%
fit1 fit(data = data_train)
extract_fit_parsnip(fit1) %>% # Check coefficients
tidy()
# A tibble: 12 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.96 0.705 7.04 4.36e-12
2 unemployed_active 0.0669 0.372 0.180 8.57e- 1
3 age -0.000359 0.00186 -0.193 8.47e- 1
4 trust 0.264 0.0341 7.72 3.59e-14
5 education 0.121 0.0379 3.18 1.55e- 3
6 religionEastern religions 1.01 1.18 0.858 3.91e- 1
7 religionIslam 0.172 0.684 0.251 8.02e- 1
8 religionJewish -0.0731 0.926 -0.0790 9.37e- 1
9 religionOther Christian denomination -0.261 0.926 -0.282 7.78e- 1
10 religionOther Non-Christian religions 0.625 0.954 0.655 5.13e- 1
11 religionProtestant -0.119 0.744 -0.159 8.73e- 1
12 religionRoman Catholic 0.516 0.661 0.780 4.36e- 1
# Fit model 2
<- wflow2 %>%
fit2 fit(data = data_train)
extract_fit_parsnip(fit2) %>% # Check coefficients
tidy()
# A tibble: 18 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.64 0.658 10.1 1.69e-22
2 unemployed_active 0.00988 0.0702 0.141 8.88e- 1
3 religionEastern religions 0.958 1.18 0.813 4.16e- 1
4 religionIslam 0.166 0.684 0.242 8.09e- 1
5 religionJewish -0.212 0.929 -0.228 8.19e- 1
6 religionOther Christian denomination -0.115 0.926 -0.125 9.01e- 1
7 religionOther Non-Christian religions 0.823 0.955 0.862 3.89e- 1
8 religionProtestant -0.164 0.746 -0.220 8.26e- 1
9 religionRoman Catholic 0.541 0.665 0.813 4.16e- 1
10 age_poly_1 -1.11 3.36 -0.331 7.41e- 1
11 age_poly_2 3.04 3.22 0.945 3.45e- 1
12 age_poly_3 -7.27 2.81 -2.59 9.81e- 3
13 trust_poly_1 21.9 2.84 7.70 4.16e-14
14 trust_poly_2 0.121 2.78 0.0434 9.65e- 1
15 trust_poly_3 -3.59 2.77 -1.30 1.96e- 1
16 education_poly_1 10.5 2.96 3.54 4.32e- 4
17 education_poly_2 2.06 2.81 0.733 4.64e- 1
18 education_poly_3 -2.72 2.84 -0.956 3.39e- 1
# Model 1: Calculate accuracy for test data
%>%
fit1 augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.54
# Model 2: Calculate accuracy for test data
%>%
fit2 augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.55
Beware: If the accuracy is too good to be true there is probably some coding/recoding error… (I initially added a coding error at the start)
An we can also check out the predicted values of the two models:
# Model 1: Augment & extract actual/predicted values
%>%
fit1 augment(data_test) %>%
select(lifesatisfaction, .pred)
# A tibble: 396 x 2
lifesatisfaction .pred
<dbl> <dbl>
1 7 NA
2 7 7.00
3 5 NA
4 5 NA
5 8 7.91
6 7 NA
7 8 NA
8 2 NA
9 9 7.78
10 8 5.69
# ... with 386 more rows
# Model 2: Augment & extract actual/predicted values
%>%
fit2 augment(data_test) %>%
select(lifesatisfaction, .pred)
# A tibble: 396 x 2
lifesatisfaction .pred
<dbl> <dbl>
1 7 NA
2 7 6.82
3 5 NA
4 5 NA
5 8 7.89
6 7 NA
7 8 NA
8 2 NA
9 9 7.68
10 8 5.95
# ... with 386 more rows
7 All the code
# install.packages(pacman)
::p_load(tidyverse,
pacman
tidymodels,
naniar,
knitr,
kableExtra)# Build and evaluate model without resampling
<- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
data "1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))
<- data %>%
data filter(cntry=="FR") %>% # filter french respondents
# Recode missings
mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
# Rename variables
rename(country = cntry,
unemployed_active = uempla,
unemployed = uempli,
trust = ppltrst,
lifesatisfaction = stflife,
education = eisced,
age = agea,
religion = rlgdnm) %>% # we add this variable
mutate(religion = factor(recode(religion, # Recode and factor
"1" = "Roman Catholic",
"2" = "Protestant",
"3" = "Eastern Orthodox",
"4" = "Other Christian denomination",
"5" = "Jewish",
"6" = "Islam",
"7" = "Eastern religions",
"8" = "Other Non-Christian religions",
.default = NA_character_)))
# Split the data into training and test data
<- initial_split(data, prop = 0.80)
data_split # Inspect
data_split
# Extract the two datasets
<- training(data_split)
data_train <- testing(data_split) # Do not touch until the end!
data_test
# Create resampled partitions
set.seed(345)
<- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
data_folds # Define the model
<-
linear_mod linear_reg() %>% # linear model
set_engine("lm") %>%
set_mode("regression")
# Define recipe 1
<-
recipe1 recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data)
recipe1
# Define recipe 2
<-
recipe2 recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data) %>%
step_normalize(all_numeric_predictors()) %>%
# step_dummy(all_nominal_predictors()) %>%
step_naomit(age, trust, education) %>%
step_poly(age, trust, education, degree = 3) # What does "degree" specify?
recipe2
# Define workflows
<-
wflow1 workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe1)
<-
wflow2 workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe2)
<- wflow1 %>% fit_resamples(resamples = data_folds)
fit_rs1
fit_rs1
<- wflow2 %>% fit_resamples(resamples = data_folds)
fit_rs2
fit_rs2# Collect metrics across resamples
<- collect_metrics(fit_rs1)
metrics1
metrics1
# Collect metrics across resamples
<- collect_metrics(fit_rs2)
metrics2
metrics2# Fit model 1
<- wflow1 %>%
fit1 fit(data = data_train)
extract_fit_parsnip(fit1) %>% # Check coefficients
tidy()
# Fit model 2
<- wflow2 %>%
fit2 fit(data = data_train)
extract_fit_parsnip(fit2) %>% # Check coefficients
tidy()
# Model 1: Calculate accuracy for test data
%>%
fit1 augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# Model 2: Calculate accuracy for test data
%>%
fit2 augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# Model 1: Augment & extract actual/predicted values
%>%
fit1 augment(data_test) %>%
select(lifesatisfaction, .pred)
# Model 2: Augment & extract actual/predicted values
%>%
fit2 augment(data_test) %>%
select(lifesatisfaction, .pred)