library(tidymodels)
recipe1 <- recipe(lifesatisfaction ~ unemployed_active + age + education,
data = data) %>%
step_normalize(all_numeric_predictors())
recipe1Recipes (preprocessing)
Learning outcomes/objective: Learn…
- …what recipes are and how they help us to built ML models.
- …how to implement recipes with tidymodels in R.
Sources: Preprocess your data with recipes, Methods for selecting variables in step functions, Introduction to recipes, James et al. (2013, Ch. 5)
1 Fundstücke/Finding(s)
2 Why preprocess data and recipes?
- Preprocessing: = transforming raw data into a format that can be used by the model
- e.g., data cleaning, feature selection, feature engineering, data normalization (e.g., scale/transform variables to similar ranges, etc.)
- Various concepts overlap here (e.g, data encoding from text to numbers = feature engineering)
- Feature engeneering: process of creating new features or transforming existing features in a dataset to improve performance and interpretability of ML model
- Goal: create features that better capture the underlying patterns and relationships in the data to provide more accurate and robust predictions
- Recipes provide pipeable steps for feature engineering and data preprocessing to prepare for modeling
3 Feature engeneering
- Feature engeneering involves a wide range of techniques such as…
- creating new features, i.e., combining existing features or transforming them to capture additional information
- e.g., capture a person’s SES in a variable by combining different variables or adding polynomials
- encoding of categorical variables to make them more informative for the model, e.g., adding them as dummies
- normalizing or scaling features, e.g., centering/scaling variables to improve interpretability
- handling missing values, e.g., imputing them (e.g., mean imputation, median imputation, or interpolation
- reducing dimensionality, e.g., to reduce complexity of the model and improve performance
- extracting features from text or images, e.g., convert text or image data to numeric variables to use them in ML models
- extracting key features from raw variables, e.g., getting the day of the week out of a date variable
- creating new features, i.e., combining existing features or transforming them to capture additional information
4 Recipes with tidymodels (1)
- With tidymodels we can also use socalled recipes to define models and preprocess the data
recipe(): Create a recipe for preprocessing datastep_*(): function for different preprocessing steps (see here for an overview)bake(recipe_*, new_data = xxx): apply recipe to different datasets
- Advantage
- We can easily use & compare different recipes later on
- Also sometimes necessary to define different recipes for different models (e.g., logistic model vs. random forest model)
5 Recipes with tidymodels (2)
- Example below: A recipe containing an outcome plus two numeric predictors that centers and scale (“normalize”) the predictors
- Recipes require definitions of variables (e.g., in formula above), roles that is how variables are used in the model and terms (partly happens automatically) (see here)
- Using recipes: once a recipe is defined, it needs to be estimated before being applied to data
- Most recipe steps have specific quantities that must be calculated or estimated
step_normalize(): needs to compute the training set’s mean for the selected columnsstep_dummy(): needs to determine the factor levels of selected columns in order to make the appropriate indicator columns
- Most recipe steps have specific quantities that must be calculated or estimated
6 Lab: Using recipes
We’ll use the ESS data that we have used in the previous lab. The variables were named not so well, so we have to rely on the codebook to understand their meaning. You can find it here or on the website of the ESS.
stflife: measures life satisfaction (How satisfied with life as a whole).uempla: measures unemployment (Doing last 7 days: unemployed, actively looking for job).uempli: measures life satisfaction (Doing last 7 days: unemployed, not actively looking for job).eisced: measures education (Highest level of education, ES - ISCED).cntry: measures a respondent’s country of origin (here held constant for France).- etc.
# install.packages(pacman)
pacman::p_load(tidyverse,
tidymodels,
naniar,
knitr,
kableExtra)We first import the data into R:
# Build and evaluate model without resampling
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))6.1 Renaming and recoding
We recode and rename some of the variables. In principle, this could be added to the recipe however, it also represents an initial step to generate a clean dataset. Here we also create a factor variable containing religious denomination.
data <- data %>%
filter(cntry=="FR") %>% # filter french respondents
# Recode missings
mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
# Rename variables
rename(country = cntry,
unemployed_active = uempla,
unemployed = uempli,
trust = ppltrst,
lifesatisfaction = stflife,
education = eisced,
age = agea,
religion = rlgdnm) %>% # we add this variable
mutate(religion = factor(recode(religion, # Recode and factor
"1" = "Roman Catholic",
"2" = "Protestant",
"3" = "Eastern Orthodox",
"4" = "Other Christian denomination",
"5" = "Jewish",
"6" = "Islam",
"7" = "Eastern religions",
"8" = "Other Non-Christian religions",
.default = NA_character_)))6.2 Building a predictive linear model using a recipe
6.3 Split the data
Below we split the data and create resampled partitions with vfold_cv().
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect<Training/Testing/Total>
<1581/396/1977>
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data # 10-fold cross-validation
# A tibble: 10 x 2
splits id
<list> <chr>
1 <split [1422/159]> Fold01
2 <split [1423/158]> Fold02
3 <split [1423/158]> Fold03
4 <split [1423/158]> Fold04
5 <split [1423/158]> Fold05
6 <split [1423/158]> Fold06
7 <split [1423/158]> Fold07
8 <split [1423/158]> Fold08
9 <split [1423/158]> Fold09
10 <split [1423/158]> Fold10
6.4 Define model, recipe and workflow
Then we define our model, a recipe namely recipe1 in which specify only the formula and data. In recipe2 we specify formula, data and use step_naomit() to omit missings as well use step_poly() to add polynomial expansion on some variables. Those expansions are then passed as variables to the linear regression model.
By using step_poly(age, trust, education, degree = 3) in recipe2 we include polynomials of the three predictors to the degree of 3.
# Define the model
linear_mod <-
linear_reg() %>% # linear model
set_engine("lm") %>%
set_mode("regression")
# Define recipe 1
recipe1 <-
recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data)
recipe1
# Define recipe 2
recipe2 <-
recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data) %>%
step_normalize(all_numeric_predictors()) %>%
# step_dummy(all_nominal_predictors()) %>%
step_naomit(age, trust, education) %>%
step_poly(age, trust, education, degree = 3) # What does "degree" specify?
recipe2
# Define workflows
wflow1 <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe1)
wflow2 <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe2)6.5 Fit and evaluate the models (resampled training data)
6.6 Fit models
Then we evaluate our models using resampling the fit_resamples() function finally fitting the workflows to data_folds.
fit_rs1 <- wflow1 %>% fit_resamples(resamples = data_folds)
fit_rs1# Resampling results
# 10-fold cross-validation
# A tibble: 10 x 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>
fit_rs2 <- wflow2 %>% fit_resamples(resamples = data_folds)
fit_rs2# Resampling results
# 10-fold cross-validation
# A tibble: 10 x 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [1422/159]> Fold01 <tibble [2 x 4]> <tibble [0 x 3]>
2 <split [1423/158]> Fold02 <tibble [2 x 4]> <tibble [0 x 3]>
3 <split [1423/158]> Fold03 <tibble [2 x 4]> <tibble [0 x 3]>
4 <split [1423/158]> Fold04 <tibble [2 x 4]> <tibble [0 x 3]>
5 <split [1423/158]> Fold05 <tibble [2 x 4]> <tibble [0 x 3]>
6 <split [1423/158]> Fold06 <tibble [2 x 4]> <tibble [0 x 3]>
7 <split [1423/158]> Fold07 <tibble [2 x 4]> <tibble [0 x 3]>
8 <split [1423/158]> Fold08 <tibble [2 x 4]> <tibble [0 x 3]>
9 <split [1423/158]> Fold09 <tibble [2 x 4]> <tibble [0 x 3]>
10 <split [1423/158]> Fold10 <tibble [2 x 4]> <tibble [0 x 3]>
6.6.1 Evaluate models
We use collect_metrics() to evaluate our models averaging across our resamples, accounting for the natural variability across different analysis datasets that we could possibly use to built the model.
# Collect metrics across resamples
metrics1 <- collect_metrics(fit_rs1)
metrics1# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.97 10 0.0694 Preprocessor1_Model1
2 rsq standard 0.111 10 0.0314 Preprocessor1_Model1
# Collect metrics across resamples
metrics2 <- collect_metrics(fit_rs2)
metrics2# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 241. 10 239. Preprocessor1_Model1
2 rsq standard 0.106 10 0.0259 Preprocessor1_Model1
Our less flexible model (recipe1) seems to provide a better fit than the more flexible model that include the polynomials (recipe2).
6.7 Use workflow for final model/prediction and evaluation
Finally, when we are happy with our workflow/model (after our assessment using resampling) we can fit the model/workflow to data_train (not using any resampling) and use it to predict the outcome in our data_test to calculate the accuracy. Below we calculate the MAE = mean absolute error for both our models (based on the two recipes).
# Fit model 1
fit1 <- wflow1 %>%
fit(data = data_train)
extract_fit_parsnip(fit1) %>% # Check coefficients
tidy()# A tibble: 12 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.96 0.705 7.04 4.36e-12
2 unemployed_active 0.0669 0.372 0.180 8.57e- 1
3 age -0.000359 0.00186 -0.193 8.47e- 1
4 trust 0.264 0.0341 7.72 3.59e-14
5 education 0.121 0.0379 3.18 1.55e- 3
6 religionEastern religions 1.01 1.18 0.858 3.91e- 1
7 religionIslam 0.172 0.684 0.251 8.02e- 1
8 religionJewish -0.0731 0.926 -0.0790 9.37e- 1
9 religionOther Christian denomination -0.261 0.926 -0.282 7.78e- 1
10 religionOther Non-Christian religions 0.625 0.954 0.655 5.13e- 1
11 religionProtestant -0.119 0.744 -0.159 8.73e- 1
12 religionRoman Catholic 0.516 0.661 0.780 4.36e- 1
# Fit model 2
fit2 <- wflow2 %>%
fit(data = data_train)
extract_fit_parsnip(fit2) %>% # Check coefficients
tidy()# A tibble: 18 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.64 0.658 10.1 1.69e-22
2 unemployed_active 0.00988 0.0702 0.141 8.88e- 1
3 religionEastern religions 0.958 1.18 0.813 4.16e- 1
4 religionIslam 0.166 0.684 0.242 8.09e- 1
5 religionJewish -0.212 0.929 -0.228 8.19e- 1
6 religionOther Christian denomination -0.115 0.926 -0.125 9.01e- 1
7 religionOther Non-Christian religions 0.823 0.955 0.862 3.89e- 1
8 religionProtestant -0.164 0.746 -0.220 8.26e- 1
9 religionRoman Catholic 0.541 0.665 0.813 4.16e- 1
10 age_poly_1 -1.11 3.36 -0.331 7.41e- 1
11 age_poly_2 3.04 3.22 0.945 3.45e- 1
12 age_poly_3 -7.27 2.81 -2.59 9.81e- 3
13 trust_poly_1 21.9 2.84 7.70 4.16e-14
14 trust_poly_2 0.121 2.78 0.0434 9.65e- 1
15 trust_poly_3 -3.59 2.77 -1.30 1.96e- 1
16 education_poly_1 10.5 2.96 3.54 4.32e- 4
17 education_poly_2 2.06 2.81 0.733 4.64e- 1
18 education_poly_3 -2.72 2.84 -0.956 3.39e- 1
# Model 1: Calculate accuracy for test data
fit1 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred) # A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.54
# Model 2: Calculate accuracy for test data
fit2 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred) # A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.55
Beware: If the accuracy is too good to be true there is probably some coding/recoding error… (I initially added a coding error at the start)
An we can also check out the predicted values of the two models:
# Model 1: Augment & extract actual/predicted values
fit1 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) # A tibble: 396 x 2
lifesatisfaction .pred
<dbl> <dbl>
1 7 NA
2 7 7.00
3 5 NA
4 5 NA
5 8 7.91
6 7 NA
7 8 NA
8 2 NA
9 9 7.78
10 8 5.69
# ... with 386 more rows
# Model 2: Augment & extract actual/predicted values
fit2 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) # A tibble: 396 x 2
lifesatisfaction .pred
<dbl> <dbl>
1 7 NA
2 7 6.82
3 5 NA
4 5 NA
5 8 7.89
6 7 NA
7 8 NA
8 2 NA
9 9 7.68
10 8 5.95
# ... with 386 more rows
7 All the code
# install.packages(pacman)
pacman::p_load(tidyverse,
tidymodels,
naniar,
knitr,
kableExtra)
# Build and evaluate model without resampling
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1azBdvM9-tDVQhADml_qlXjhpX72uHdWx"))
data <- data %>%
filter(cntry=="FR") %>% # filter french respondents
# Recode missings
mutate(ppltrst = if_else(ppltrst %in% c(55, 66, 77, 88, 99), NA, ppltrst),
eisced = if_else(eisced %in% c(55, 66, 77, 88, 99), NA, eisced),
stflife = if_else(stflife %in% c(55, 66, 77, 88, 99), NA, stflife),
rlgdnm = if_else(rlgdnm %in% c(66, 77, 99), NA, rlgdnm)) %>%
# Rename variables
rename(country = cntry,
unemployed_active = uempla,
unemployed = uempli,
trust = ppltrst,
lifesatisfaction = stflife,
education = eisced,
age = agea,
religion = rlgdnm) %>% # we add this variable
mutate(religion = factor(recode(religion, # Recode and factor
"1" = "Roman Catholic",
"2" = "Protestant",
"3" = "Eastern Orthodox",
"4" = "Other Christian denomination",
"5" = "Jewish",
"6" = "Islam",
"7" = "Eastern religions",
"8" = "Other Non-Christian religions",
.default = NA_character_)))
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(data_train, v = 10) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# Define the model
linear_mod <-
linear_reg() %>% # linear model
set_engine("lm") %>%
set_mode("regression")
# Define recipe 1
recipe1 <-
recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data)
recipe1
# Define recipe 2
recipe2 <-
recipe(lifesatisfaction ~ unemployed_active + age + trust + education + religion,
data = data) %>%
step_normalize(all_numeric_predictors()) %>%
# step_dummy(all_nominal_predictors()) %>%
step_naomit(age, trust, education) %>%
step_poly(age, trust, education, degree = 3) # What does "degree" specify?
recipe2
# Define workflows
wflow1 <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe1)
wflow2 <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(recipe2)
fit_rs1 <- wflow1 %>% fit_resamples(resamples = data_folds)
fit_rs1
fit_rs2 <- wflow2 %>% fit_resamples(resamples = data_folds)
fit_rs2
# Collect metrics across resamples
metrics1 <- collect_metrics(fit_rs1)
metrics1
# Collect metrics across resamples
metrics2 <- collect_metrics(fit_rs2)
metrics2
# Fit model 1
fit1 <- wflow1 %>%
fit(data = data_train)
extract_fit_parsnip(fit1) %>% # Check coefficients
tidy()
# Fit model 2
fit2 <- wflow2 %>%
fit(data = data_train)
extract_fit_parsnip(fit2) %>% # Check coefficients
tidy()
# Model 1: Calculate accuracy for test data
fit1 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# Model 2: Calculate accuracy for test data
fit2 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred) %>%
mae(truth = lifesatisfaction, estimate = .pred)
# Model 1: Augment & extract actual/predicted values
fit1 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred)
# Model 2: Augment & extract actual/predicted values
fit2 %>%
augment(data_test) %>%
select(lifesatisfaction, .pred)