H그룹 인사실에서 다양한 HR 업무 경험을 하며,
인사,조직문화진단,협업네트워크 등의 HR 데이터를 분석하고 있는 김광태입니다.
2019년부터 2020년 초까지 Analytics 역량을 키우기 위해 지금까지 들었던 강의/자료와 HR 관련 분석기법들은 
 아래 링크에 정리해두었습니다.
Data Analytics 분야는 하루가 다르게 발전하고 있고, 새로운 기법들이 도입되고 있기에 끊임없이 공부하고 있습니다.
함께 분석 노하우를 공유하며, 나누어주시거나 제가 올린 내용에 대한 문의는 언제든지 환영합니다. 
 yuaye.kt@gmail.com 로 메일 주시면, 회신 드리겠습니다:)
Attrition은 HR에서 항상 관심을 갖고 있는 주제이며 
 몇 년 전부터는 개방형 혁신으로, 산업분야와 회사간 인재 Pool의 경계가 모호해지면서, 
 SW, Bio 등을 중심으로 우수 인재에 대한 Talent Attraction이나 핵심 인력에 대한 
 Attrition/Turnover Management의 중요성이 높아지고 있습니다.1 
퇴사자를 예측하는 Attrition Modeling은 산업간 인재 Pool의 경계가 모호했던 미국 등 
 선진국 중심으로 지속 연구되어 왔으며, 아래 차트를 보시면 Management 분야에서
 Attrition/Turnover에 대한 연구가 지속 증가하고 있음을 확인할 수 있습니다.
이미 많은 기업에서 Attrition/Turnover에 대해 분석하고, 예측모델을 개발하여 
 employee Retention, Talent Attraction 등에 활용하고 있습니다.2 
data.frame(x=2000:2021, y=c(1,1,14,10,17,20,22,31,27,36,34,56,47,50,59,54,72,97,85,112,115,75)) %>% 
  ggplot(aes(x, y, label=paste0(y,"편")))+geom_line()+theme_bw()+theme(plot.caption= element_text(hjust= 0))+
  theme(axis.line = element_line(size=1), axis.ticks = element_line(size=1),panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())+geom_point()+ggrepel::geom_text_repel(size=3)+labs( x="publication year",y= "Number of publications in the field of management", caption = "Note. web of science 기준, Attrition/Turnover keyword 포함 논문 수") 
Python, R을 기반으로 많은 Open Source들이 존재하지만,
 대부분 hyperparameter tuning 등 분석기법 그 자체에만 집중하기에
 저는 theoretical background 및 HR 업무 경험을 추가하여 Modeling을 진행했습니다. 
이곳에는 3가지 Model을 기반으로 기본적인 분석을 진행한 내용을 정리하였으며,
 HR 경험 기반의 Modeling과 전처리 등 한번에 표현이 불가능한 내용들은 포함하지 못했습니다.
Attrition models combine variables that predict turnover into statistical algorithms that then estimate the probability of employee turnover within a given timeframe, or at a specific timepoint;
Attrition Modeling의 목적에 대해서는 여러가지 선행연구를 기반으로 네 가지 목적을 제시하였는데, 요약하면 다음과 같습니다. 
 1. pre-employment selection4
 2. validate and develop training initiative5
 3. facilitate workforce planning discussions with specific part of the company6
 4. create and hoc programs to reduce attrition7
Attrition Modeling은 이처럼 채용 의사결정, 인력 조정 및 개발 계획 수립을 위한 이니셔티브, 
 구성원의 Attrition을 줄이기 위한 intervention 설계 등, 전략적 HR을 위한 다양한 시사점을 제공합니다.
Purpose of Attrition Modeling: The formed attrition estimates can then serve a number of purposes, including use for pre-employment selection (Gibson et al., 2019; Strickland, 2005), to validate and develop training initiatives (McCloy et al., 2016; Strickland, 2005), to facilitate workforce planning discussions with specific parts of the company (Speer et al., 2019), to create ad hoc programs to reduce attrition (Strickland, 2005) and a variety of other HR purposes generally aimed at understanding and impacting employee turnover. The work is conducted both internally and by external vendors as well. For example, HR software companies currently offer features that include projected group-level turnover estimates within HR dashboards, as well as risk projections for individual employees. These are often accompanied by in-depth studies into the root causes of turnover, which then facilitate turnover interventions. Thus, attrition models serve various strategic HR purposes.
Kaggle에 올라와 있는 IBM HR Analytics Dataset을 기반으로
Attirition Model을 구축합니다.
Tidyverse 생태계를 따라, 최대한 tidy하게 작성하려고 노력하였으며,
그동안 People Analytics 업무를 어떤 흐름으로 진행해 왔는지 보여드리고자 노력했습니다. 
아래 Process를 기준으로 Modeling을 진행했습니다. 
| No | Process | R Packages | 
|---|---|---|
| 1 | Literature Review | |
| 2 | Data Import | tidyverse | 
| 3 | Tidy data + Transformation, Pre-Processing | tidyverse | 
| 4 | visualization for EDA, Feature Engineering | dlookr, ExpanDar, tidyverse | 
| 5 | Modeling(1) Logistic Regression | tidymodels | 
| 6 | Modeling(2) RandomForest | tidymodels, randomForest | 
| 7 | Modeling(3) AutoML | h2o | 
| 8 | Reporting | Bookdown | 
## [1] "/Users/raymondkim/Rproject/Turnover"
# read_csv 기반 tibble type으로 import 합니다. 
Dataset <- read_csv("archive/Data.csv")
## 
## ─ Column specification ────────────────────────────
## cols(
##   .default = col_double(),
##   Attrition = col_character(),
##   BusinessTravel = col_character(),
##   Department = col_character(),
##   EducationField = col_character(),
##   Gender = col_character(),
##   JobRole = col_character(),
##   MaritalStatus = col_character(),
##   Over18 = col_character(),
##   OverTime = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Data Import가 잘 되었는지 확인합니다. 
Dataset %>% glimpse
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
# 어떤 변수들로 구성되어 있는지 확인합니다. 
Dataset %>% diagnose() %>% arrange(unique_count)
## # A tibble: 35 x 6
##    variables      types   missing_count missing_percent unique_count unique_rate
##    <chr>          <chr>           <int>           <dbl>        <int>       <dbl>
##  1 EmployeeCount  numeric             0               0            1    0.000680
##  2 Over18         charac…             0               0            1    0.000680
##  3 StandardHours  numeric             0               0            1    0.000680
##  4 Attrition      charac…             0               0            2    0.00136 
##  5 Gender         charac…             0               0            2    0.00136 
##  6 OverTime       charac…             0               0            2    0.00136 
##  7 PerformanceRa… numeric             0               0            2    0.00136 
##  8 BusinessTravel charac…             0               0            3    0.00204 
##  9 Department     charac…             0               0            3    0.00204 
## 10 MaritalStatus  charac…             0               0            3    0.00204 
## # … with 25 more rows
  # unique_count = 1 변수(Over18,EmployeeCount, StandardHours), 
  # 의미 없는 변수(EmployeeNumber)제거
Dataset %>% dplyr::select(-Over18, -EmployeeCount, -StandardHours, -EmployeeNumber)->Dataset
Dataset %>% diagnose_category()
## # A tibble: 30 x 6
##    variables      levels                     N  freq ratio  rank
##    <chr>          <chr>                  <int> <int> <dbl> <int>
##  1 Attrition      No                      1470  1233 83.9      1
##  2 Attrition      Yes                     1470   237 16.1      2
##  3 BusinessTravel Travel_Rarely           1470  1043 71.0      1
##  4 BusinessTravel Travel_Frequently       1470   277 18.8      2
##  5 BusinessTravel Non-Travel              1470   150 10.2      3
##  6 Department     Research & Development  1470   961 65.4      1
##  7 Department     Sales                   1470   446 30.3      2
##  8 Department     Human Resources         1470    63  4.29     3
##  9 EducationField Life Sciences           1470   606 41.2      1
## 10 EducationField Medical                 1470   464 31.6      2
## # … with 20 more rows
Dataset %>% diagnose_numeric()
## # A tibble: 23 x 10
##    variables            min    Q1   mean median     Q3   max  zero minus outlier
##    <chr>              <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <int> <int>   <int>
##  1 Age                   18    30 3.69e1     36   43      60     0     0       0
##  2 DailyRate            102   465 8.02e2    802 1157    1499     0     0       0
##  3 DistanceFromHome       1     2 9.19e0      7   14      29     0     0       0
##  4 Education              1     2 2.91e0      3    4       5     0     0       0
##  5 EnvironmentSatisf…     1     2 2.72e0      3    4       4     0     0       0
##  6 HourlyRate            30    48 6.59e1     66   83.8   100     0     0       0
##  7 JobInvolvement         1     2 2.73e0      3    3       4     0     0       0
##  8 JobLevel               1     1 2.06e0      2    3       5     0     0       0
##  9 JobSatisfaction        1     2 2.73e0      3    4       4     0     0       0
## 10 MonthlyIncome       1009  2911 6.50e3   4919 8379   19999     0     0     114
## # … with 13 more rows
# Education, PerformanceRating, RelationshipSatisfaction, WorkLifeBalance, JobLevel, 
# StockOptionLevel, NumCompaniesWorked 이 Categorical variable임을 알 수 있음
# 향후 분석에서 의미를 파악하기 쉽도록 Factor로 변환 필요
#1) Education
gsub(1, 'below College',Dataset$Education) -> Dataset$Education
gsub(2, 'College',Dataset$Education) -> Dataset$Education
gsub(3, 'Bachelor',Dataset$Education) -> Dataset$Education
gsub(4, 'Master',Dataset$Education) -> Dataset$Education
gsub(5, 'Doctor',Dataset$Education) -> Dataset$Education
Dataset$Education %>% as.factor %>% unique
## [1] College       below College Master        Bachelor      Doctor       
## Levels: Bachelor below College College Doctor Master
#2) Performance Rating
gsub(1, 'Low',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(2, 'Good',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(3, 'Excellent',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(4, 'Outstanding',Dataset$PerformanceRating) -> Dataset$PerformanceRating
Dataset$PerformanceRating %>% as.factor %>% unique
## [1] Excellent   Outstanding
## Levels: Excellent Outstanding
#3) WorklifeBalance
gsub(1, 'Bad',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(2, 'Good',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(3, 'Better',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(4, 'Best',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
Dataset$WorkLifeBalance %>% as.factor %>% unique
## [1] Bad    Better Good   Best  
## Levels: Bad Best Better Good
#4) JobInvolvement
gsub(1, 'Low',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(2, 'Medium',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(3, 'High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(4, 'Very High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
Dataset$JobInvolvement %>% as.factor %>% unique
## [1] High      Medium    Very High Low      
## Levels: High Low Medium Very High
#5) EnvironmentSatisfaction
gsub(1, 'Low',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(2, 'Medium',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(3, 'High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(4, 'Very High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
Dataset$EnvironmentSatisfaction %>% as.factor %>% unique
## [1] Medium    High      Very High Low      
## Levels: High Low Medium Very High
#6) JobSatisfaction
gsub(1, 'Low',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(2, 'Medium',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(3, 'High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(4, 'Very High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
Dataset$JobSatisfaction %>% as.factor %>% unique
## [1] Very High Medium    High      Low      
## Levels: High Low Medium Very High
#7) RelationshipSatisfaction
gsub(1, 'Low',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(2, 'Medium',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(3, 'High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(4, 'Very High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
Dataset$RelationshipSatisfaction %>% as.factor %>% unique
## [1] Low       Very High Medium    High     
## Levels: High Low Medium Very High
# character , numeric variable 수 확인 
Dataset %>% diagnose() %>% dplyr::select(types) %>% table
## .
## character   numeric 
##        15        16# Missing value Check
Dataset %>% naniar::gg_miss_var()# 1차로 변환한 데이터셋을 저장해둡니다. 
saveRDS(Dataset, "Dataset.RDS")
# categorical variable 확인
Dataset %>% diagnose_category() 
## # A tibble: 57 x 6
##    variables      levels                     N  freq ratio  rank
##    <chr>          <chr>                  <int> <int> <dbl> <int>
##  1 Attrition      No                      1470  1233 83.9      1
##  2 Attrition      Yes                     1470   237 16.1      2
##  3 BusinessTravel Travel_Rarely           1470  1043 71.0      1
##  4 BusinessTravel Travel_Frequently       1470   277 18.8      2
##  5 BusinessTravel Non-Travel              1470   150 10.2      3
##  6 Department     Research & Development  1470   961 65.4      1
##  7 Department     Sales                   1470   446 30.3      2
##  8 Department     Human Resources         1470    63  4.29     3
##  9 Education      Bachelor                1470   572 38.9      1
## 10 Education      Master                  1470   398 27.1      2
## # … with 47 more rows
# Numeric variable 확인
Dataset %>% diagnose_numeric() 
## # A tibble: 16 x 10
##    variables          min    Q1     mean median     Q3   max  zero minus outlier
##    <chr>            <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl> <int> <int>   <int>
##  1 Age                 18    30  3.69e+1    36  4.3 e1    60     0     0       0
##  2 DailyRate          102   465  8.02e+2   802  1.16e3  1499     0     0       0
##  3 DistanceFromHome     1     2  9.19e+0     7  1.4 e1    29     0     0       0
##  4 HourlyRate          30    48  6.59e+1    66  8.38e1   100     0     0       0
##  5 JobLevel             1     1  2.06e+0     2  3   e0     5     0     0       0
##  6 MonthlyIncome     1009  2911  6.50e+3  4919  8.38e3 19999     0     0     114
##  7 MonthlyRate       2094  8047  1.43e+4 14236. 2.05e4 26999     0     0       0
##  8 NumCompaniesWor…     0     1  2.69e+0     2  4   e0     9   197     0      52
##  9 PercentSalaryHi…    11    12  1.52e+1    14  1.8 e1    25     0     0       0
## 10 StockOptionLevel     0     0  7.94e-1     1  1   e0     3   631     0      85
## 11 TotalWorkingYea…     0     6  1.13e+1    10  1.5 e1    40    11     0      63
## 12 TrainingTimesLa…     0     2  2.80e+0     3  3   e0     6    54     0     238
## 13 YearsAtCompany       0     3  7.01e+0     5  9   e0    40    44     0     104
## 14 YearsInCurrentR…     0     2  4.23e+0     3  7   e0    18   244     0      21
## 15 YearsSinceLastP…     0     0  2.19e+0     1  3   e0    15   581     0     107
## 16 YearsWithCurrMa…     0     2  4.12e+0     3  7   e0    17   263     0      14
# outlier 개수가 많은 순으로 정렬하여 변수 확인 
Dataset %>% diagnose_outlier() %>% arrange(desc(outliers_cnt))
## # A tibble: 16 x 6
##    variables    outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##    <chr>               <int>          <dbl>         <dbl>     <dbl>        <dbl>
##  1 TrainingTim…          238         16.2            4.14     2.80         2.54 
##  2 MonthlyInco…          114          7.76       18400.    6503.        5503.   
##  3 YearsSinceL…          107          7.28          11.1      2.19         1.48 
##  4 YearsAtComp…          104          7.07          23.5      7.01         5.75 
##  5 StockOption…           85          5.78           3        0.794        0.658
##  6 TotalWorkin…           63          4.29          32.6     11.3         10.3  
##  7 NumCompanie…           52          3.54           9        2.69         2.46 
##  8 YearsInCurr…           21          1.43          16        4.23         4.06 
##  9 YearsWithCu…           14          0.952         16.1      4.12         4.01 
## 10 Age                     0          0            NaN       36.9         36.9  
## 11 DailyRate               0          0            NaN      802.         802.   
## 12 DistanceFro…            0          0            NaN        9.19         9.19 
## 13 HourlyRate              0          0            NaN       65.9         65.9  
## 14 JobLevel                0          0            NaN        2.06         2.06 
## 15 MonthlyRate             0          0            NaN    14313.       14313.   
## 16 PercentSala…            0          0            NaN       15.2         15.2
# outlier 비율이 5 이상인 변수 확인
Dataset %>% diagnose_outlier() %>%  filter(outliers_ratio > 5) %>% 
  mutate(rate = outliers_mean / with_mean) %>% 
  arrange(desc(rate)) %>% dplyr::select(-outliers_cnt)
## # A tibble: 5 x 6
##   variables            outliers_ratio outliers_mean with_mean without_mean  rate
##   <chr>                         <dbl>         <dbl>     <dbl>        <dbl> <dbl>
## 1 YearsSinceLastPromo…           7.28         11.1      2.19         1.48   5.09
## 2 StockOptionLevel               5.78          3        0.794        0.658  3.78
## 3 YearsAtCompany                 7.07         23.5      7.01         5.75   3.36
## 4 MonthlyIncome                  7.76      18400.    6503.        5503.     2.83
## 5 TrainingTimesLastYe…          16.2           4.14     2.80         2.54   1.48YearsSinceLastPromotion와 StockOptionLevel, YearsAtCompany, MonthlyIncome, 
 Training Times Last Year 변수는 전체 평균보다 이상치의 평균이 큰 것 같습니다. 
이상치의 평균과 전체평균의 비율(rate)이 큰 경우에는 대체하거나 제거하는 것이 좋습니다. 
 하지만, 실제 업무 환경을 고려하면, 근속 연수나 스톡옵션 레벨, 승진 연차, 월급, 교육시간은 
 충분히 outlier가 있을 수 있고, 이러한 outlier가 실제 Attrition에 영향을 미칠 수 있습니다. 
이상치가 포함된 관측치의 descriptive statistics를 보며 제거해야 하는지 확인해보거나, 
Dataset %>% dplyr::select(find_outliers(.)) %>% describe()
## # A tibble: 9 x 26
##   variable         n    na    mean      sd se_mean   IQR skewness kurtosis   p00
##   <chr>        <int> <int>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl>
## 1 MonthlyInco…  1470     0 6.50e+3 4.71e+3 1.23e+2  5468    1.37    1.01    1009
## 2 NumCompanie…  1470     0 2.69e+0 2.50e+0 6.52e-2     3    1.03    0.0102     0
## 3 StockOption…  1470     0 7.94e-1 8.52e-1 2.22e-2     1    0.969   0.365      0
## 4 TotalWorkin…  1470     0 1.13e+1 7.78e+0 2.03e-1     9    1.12    0.918      0
## 5 TrainingTim…  1470     0 2.80e+0 1.29e+0 3.36e-2     1    0.553   0.495      0
## 6 YearsAtComp…  1470     0 7.01e+0 6.13e+0 1.60e-1     6    1.76    3.94       0
## 7 YearsInCurr…  1470     0 4.23e+0 3.62e+0 9.45e-2     5    0.917   0.477      0
## 8 YearsSinceL…  1470     0 2.19e+0 3.22e+0 8.40e-2     3    1.98    3.61       0
## 9 YearsWithCu…  1470     0 4.12e+0 3.57e+0 9.31e-2     5    0.833   0.171      0
## # … with 16 more variables: p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## #   p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## #   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
Dataset %>%
  plot_outlier(diagnose_outlier(Dataset) %>%
                 filter(outliers_ratio >= 0.5) %>%
                 dplyr::select(variables) %>%
                 unlist())
# dlookr 패키지 기반으로 아래 코드 한번이면, 레포트로 확인하실 수 있습니다. 
# Dataset %>% diagnose_web_report()Dataset %>% dplyr::select(MonthlyIncome) %>% plot_box_numeric()Dataset %>% dplyr::select(-MonthlyIncome) -> Dataset# Numeric variable만 추출하여 Multivariate Oultier를 구합니다. 
# cut off value = .99로 설정했습니다. 
Dataset %>% purrr::keep(is.numeric) -> outcheck_num
outcheck_num %>% chemometrics::Moutlier(quantile=.99)-> Mout 
# 원본 데이터셋과 다시 합침
Dataset %>% mutate(md=Mout$md)->Dataset
# Cutoff value = 6.015885 이상인 값 확인
Dataset %>% filter(md>Mout$cutoff) %>% nrow
## [1] 68Mout$cutoff보다 높은 68개 값의 Diversity를 확인합니다. # Original Dataset Attrition ratio
Dataset %>% dplyr::select(Attrition) %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No      1470  1233  83.9     1
## 2 Attrition Yes     1470   237  16.1     2
# Multivariate Outlier Dataset Attrition ratio
Dataset %>% filter(md>Mout$cutoff) %>% dplyr::select(Attrition) %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No        68    59  86.8     1
## 2 Attrition Yes       68     9  13.2     2
# 총 68개의 Multivariate Ouliter 관측치 발견하고, outlier와 md 제거하기 
Dataset %>% filter(md<Mout$cutoff) -> Dataset
Dataset %>% dplyr::select(-md) -> Dataset
# 1차 정제된 Dataset을 다시 저장
saveRDS(Dataset,"Dataset_pre.RDS")
# 1차 정제된 데이터셋 기반으로 ExPanDar 패키지를 통해 탐색적 분석을 진행합니다
# Correlation, Scatterplot 등 파악 가능하며, Web 기반으로 동작합니다. 
# Dataset %>% ExPanD()실제 탐색적 분석을 진행할 때는 ExPanDar 패키지 등을 사용하여 아래와 같이 Variable간의 관계를 다양한 관점에서 파악합니다. 
이것으로 Data에 대한 preprocessing과 EDA가 마무리 되었습니다. 
# Data Import
Dataset <- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_glm
# Setting Reference level
Data_glm$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels## [1] "Yes" "No"
set.seed(2727)
split <- initial_split(Data_glm, prop = .7, strata = Attrition)
glm_train <- training(split)
glm_test <- testing(split)
glm_train %>% nrow## [1] 982
glm_test %>% nrow## [1] 420
recipe를 활용하여 multicollinearity check, dummy coded, normalization를 진행합니다.
recipe를 보면, multicollinearity로 인해 제거된 변수는 없고, 
 Dummy code화와 Normalization이 잘 되었음을 확인할 수 있습니다.
# pre-processing by recipe
glm_train %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe
glm_recipe## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 982 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]
# make the test data set
glm_recipe %>% juice -> glm_train_re
# bake the test data set
glm_recipe %>% bake(glm_test) -> glm_test_re# Model Setting
glm_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')
glm_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
# Fitting Logistic Regression
glm_fit <- glm_model %>% fit(Attrition ~., data=glm_train_re)
# Attrition에 영향을 주는 요인을 살펴봅니다. 
tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)
## # A tibble: 18 x 5
##    term                             estimate std.error statistic  p.value
##    <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
##  1 Age                                 1.42      0.155      2.25 2.47e- 2
##  2 DailyRate                           1.26      0.116      1.96 4.97e- 2
##  3 DistanceFromHome                    0.688     0.111     -3.36 7.69e- 4
##  4 NumCompaniesWorked                  0.642     0.131     -3.37 7.57e- 4
##  5 YearsInCurrentRole                  1.72      0.234      2.32 2.01e- 2
##  6 YearsSinceLastPromotion             0.535     0.168     -3.72 2.02e- 4
##  7 BusinessTravel_Travel_Frequently    0.134     0.535     -3.76 1.67e- 4
##  8 BusinessTravel_Travel_Rarely        0.369     0.489     -2.04 4.14e- 2
##  9 EnvironmentSatisfaction_Low         0.285     0.325     -3.86 1.11e- 4
## 10 Gender_Male                         0.541     0.249     -2.47 1.34e- 2
## 11 JobInvolvement_Low                  0.203     0.417     -3.82 1.33e- 4
## 12 JobRole_Laboratory.Technician       0.253     0.618     -2.22 2.62e- 2
## 13 JobSatisfaction_Very.High           2.25      0.311      2.60 9.24e- 3
## 14 MaritalStatus_Single                0.310     0.441     -2.66 7.85e- 3
## 15 OverTime_Yes                        0.111     0.264     -8.33 7.77e-17
## 16 RelationshipSatisfaction_Low        0.438     0.325     -2.53 1.13e- 2
## 17 WorkLifeBalance_Better              5.11      0.440      3.71 2.07e- 4
## 18 WorkLifeBalance_Good                2.59      0.463      2.06 3.98e- 2# Model Prediction
pre_class <- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class %>% head
## # A tibble: 6 x 1
##   .pred_class
##   <fct>      
## 1 Yes        
## 2 No         
## 3 No         
## 4 Yes        
## 5 No         
## 6 No
pre_prob <- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob %>% head
## # A tibble: 6 x 2
##   .pred_Yes .pred_No
##       <dbl>    <dbl>
## 1   0.674      0.326
## 2   0.333      0.667
## 3   0.0371     0.963
## 4   0.800      0.200
## 5   0.00743    0.993
## 6   0.0659     0.934
evaluation_tbl <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class) %>% 
  bind_cols(pre_prob)
evaluation_tbl
## # A tibble: 420 x 4
##    Attrition .pred_class     .pred_Yes .pred_No
##    <fct>     <fct>               <dbl>    <dbl>
##  1 Yes       Yes         0.674            0.326
##  2 Yes       No          0.333            0.667
##  3 No        No          0.0371           0.963
##  4 Yes       Yes         0.800            0.200
##  5 No        No          0.00743          0.993
##  6 No        No          0.0659           0.934
##  7 No        No          0.000470         1.00 
##  8 Yes       No          0.214            0.786
##  9 No        No          0.00000000245    1.00 
## 10 Yes       No          0.100            0.900
## # … with 410 more rows#Evaluation
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class)
##           Truth
## Prediction Yes  No
##        Yes  33  14
##        No   35 338
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class) %>% summary
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.883
##  2 kap                  binary         0.509
##  3 sens                 binary         0.485
##  4 spec                 binary         0.960
##  5 ppv                  binary         0.702
##  6 npv                  binary         0.906
##  7 mcc                  binary         0.521
##  8 j_index              binary         0.446
##  9 bal_accuracy         binary         0.723
## 10 detection_prevalence binary         0.112
## 11 precision            binary         0.702
## 12 recall               binary         0.485
## 13 f_meas               binary         0.574
roc_auc(evaluation_tbl, truth = Attrition, .pred_Yes)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.887
evaluation_tbl %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()glm_train %>% as.data.frame %>% SMOTE_NC('Attrition')->glm_train_SMOTEglm_train %>% dplyr::select(Attrition)  %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No       982   822  83.7     1
## 2 Attrition Yes      982   160  16.3     2
glm_train_SMOTE %>% dplyr::select(Attrition)  %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No      1644   822    50     1
## 2 Attrition Yes     1644   822    50     1
# pre-processing by recipe
glm_train_SMOTE %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe
glm_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 1644 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]
# make the test data set
glm_recipe %>% juice -> glm_train_re
# bake the train data set
glm_recipe %>% bake(glm_test) -> glm_test_re
# Model Setting
glm_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')
glm_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm# Fitting Logistic Regression
glm_fit <- glm_model %>% fit(Attrition ~., data=glm_train_re)
tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)## # A tibble: 32 x 5
##    term                    estimate std.error statistic      p.value
##    <chr>                      <dbl>     <dbl>     <dbl>        <dbl>
##  1 (Intercept)              190.       0.974       5.39 0.0000000706
##  2 Age                        1.31     0.0936      2.88 0.00403     
##  3 DailyRate                  1.42     0.0786      4.48 0.00000758  
##  4 DistanceFromHome           0.677    0.0795     -4.91 0.000000930 
##  5 NumCompaniesWorked         0.687    0.0852     -4.40 0.0000106   
##  6 PercentSalaryHike          0.816    0.0957     -2.12 0.0336      
##  7 TotalWorkingYears          1.42     0.136       2.57 0.0102      
##  8 TrainingTimesLastYear      1.18     0.0759      2.21 0.0273      
##  9 YearsInCurrentRole         1.56     0.137       3.24 0.00121     
## 10 YearsSinceLastPromotion    0.576    0.103      -5.37 0.0000000800
## # … with 22 more rows
# Model Prediction
pre_class2 <- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class2 %>% head## # A tibble: 6 x 1
##   .pred_class
##   <fct>      
## 1 Yes        
## 2 Yes        
## 3 No         
## 4 Yes        
## 5 No         
## 6 No
pre_prob2 <- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob2 %>% head## # A tibble: 6 x 2
##   .pred_Yes .pred_No
##       <dbl>    <dbl>
## 1   0.581     0.419 
## 2   0.645     0.355 
## 3   0.102     0.898 
## 4   0.985     0.0154
## 5   0.00205   0.998 
## 6   0.00592   0.994
evaluation_tbl2 <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class2) %>% 
  bind_cols(pre_prob2)
evaluation_tbl2## # A tibble: 420 x 4
##    Attrition .pred_class    .pred_Yes .pred_No
##    <fct>     <fct>              <dbl>    <dbl>
##  1 Yes       Yes         0.581          0.419 
##  2 Yes       Yes         0.645          0.355 
##  3 No        No          0.102          0.898 
##  4 Yes       Yes         0.985          0.0154
##  5 No        No          0.00205        0.998 
##  6 No        No          0.00592        0.994 
##  7 No        No          0.00132        0.999 
##  8 Yes       No          0.372          0.628 
##  9 No        No          0.0000000108   1.00  
## 10 Yes       No          0.499          0.501 
## # … with 410 more rows
# Evaluation
conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class)##           Truth
## Prediction Yes  No
##        Yes  46  65
##        No   22 287
conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class) %>% summary## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.793
##  2 kap                  binary         0.392
##  3 sens                 binary         0.676
##  4 spec                 binary         0.815
##  5 ppv                  binary         0.414
##  6 npv                  binary         0.929
##  7 mcc                  binary         0.411
##  8 j_index              binary         0.492
##  9 bal_accuracy         binary         0.746
## 10 detection_prevalence binary         0.264
## 11 precision            binary         0.414
## 12 recall               binary         0.676
## 13 f_meas               binary         0.514
roc_auc(evaluation_tbl2, truth = Attrition, .pred_Yes)## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.830
evaluation_tbl %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()# Data Import
Dataset <- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_glm
# Setting Reference level
Data_glm$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels## [1] "Yes" "No"
set.seed(2727)
split <- initial_split(Data_glm, prop = .7, strata = Attrition)
glm_train <- training(split)
glm_test <- testing(split)
glm_train %>% nrow## [1] 982
glm_test %>% nrow## [1] 420
# pre-processing by recipe
glm_train %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe
glm_recipe## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 982 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]
# make the test data set
glm_recipe %>% juice -> glm_train_re
# bake the test data set
glm_recipe %>% bake(glm_test) -> glm_test_re# Model Improvement
glm(Attrition~., family = 'binomial', data=glm_train_re) %>% 
  MASS::stepAIC(direction = "backward") -> step_glm
glm_fit_mod <- glm_model %>% fit(step_glm$formula, data=glm_train_re)
# Improved Model Prediction
pre_class_re <-  glm_fit_mod %>% predict(new_data=glm_test_re, type="class")
pre_class_re %>% head
pre_prob_re <-  glm_fit_mod %>% predict(new_data=glm_test_re, type="prob")
pre_prob_re %>% head
evaluation_tbl_mod <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class_re) %>% 
  bind_cols(pre_prob_re)
evaluation_tbl_mod# Improved Model Evaluation
conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class)##           Truth
## Prediction Yes  No
##        Yes  31  14
##        No   37 338
conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class) %>% summary## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.879
##  2 kap                  binary         0.482
##  3 sens                 binary         0.456
##  4 spec                 binary         0.960
##  5 ppv                  binary         0.689
##  6 npv                  binary         0.901
##  7 mcc                  binary         0.496
##  8 j_index              binary         0.416
##  9 bal_accuracy         binary         0.708
## 10 detection_prevalence binary         0.107
## 11 precision            binary         0.689
## 12 recall               binary         0.456
## 13 f_meas               binary         0.549
roc_auc(evaluation_tbl_mod, truth = Attrition, .pred_Yes)## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.880
evaluation_tbl_mod %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()vip::vi(glm_fit_mod)
## # A tibble: 26 x 3
##    Variable                         Importance Sign 
##    <chr>                                 <dbl> <chr>
##  1 OverTime_Yes                           8.47 NEG  
##  2 EnvironmentSatisfaction_Low            5.00 NEG  
##  3 MaritalStatus_Single                   4.62 NEG  
##  4 BusinessTravel_Travel_Frequently       3.86 NEG  
##  5 YearsSinceLastPromotion                3.77 NEG  
##  6 NumCompaniesWorked                     3.77 NEG  
##  7 JobInvolvement_Low                     3.71 NEG  
##  8 EducationField_Life.Sciences           3.65 POS  
##  9 WorkLifeBalance_Better                 3.61 POS  
## 10 EducationField_Medical                 3.44 POS  
## # … with 16 more rows
vip(glm_fit_mod)
Dataset <- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_rnd
# Setting Reference level
Data_rnd$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
## [1] "Yes" "No"
set.seed(2727)
split <- initial_split(Data_rnd, prop = .7, strata = Attrition)
rnd_train <- training(split)
rnd_test <- testing(split)
rnd_train %>% as.data.frame %>% SMOTE_NC('Attrition')->rnd_train_SMOTE
# pre-processing by recipe
rnd_train_SMOTE %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> rnd_recipe
rnd_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 1644 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]
# make the test data set
rnd_recipe %>% juice -> rnd_train_re
# bake the train data set
rnd_recipe %>% bake(rnd_test) -> rnd_test_re# make validation set
set.seed(2727)
data_fold <- vfold_cv(rnd_train_re)# hyperparameter tune : mtry와 min_n만 설정, 개인 컴퓨터 core는 8개라 병렬 처리위한 thread는 6으로 설정
tune_spec <- rand_forest(mtry=tune(), trees = 1000, min_n = tune()) %>% 
  set_mode("classification") %>% set_engine('ranger', importance='impurity',seed=2727, num.threads=6)
tune_spec## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger
workflow() %>%
  add_model(tune_spec) %>% 
  add_formula(Attrition ~ .)-> workflow
workflow## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## 
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger
rnd_model<- workflow %>% 
  tune_grid(data_fold, 
          grid=20, 
          control=control_grid(save_pred = TRUE), 
          metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
 
# Graph for hyperparameter tuning
rnd_model %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")rnd_model %>%
  collect_metrics()## # A tibble: 140 x 8
##     mtry min_n .metric   .estimator  mean     n std_err .config              
##    <int> <int> <chr>     <chr>      <dbl> <int>   <dbl> <chr>                
##  1    46    18 accuracy  binary     0.886    10 0.00767 Preprocessor1_Model01
##  2    46    18 f_meas    binary     0.880    10 0.00862 Preprocessor1_Model01
##  3    46    18 precision binary     0.918    10 0.0131  Preprocessor1_Model01
##  4    46    18 recall    binary     0.848    10 0.0160  Preprocessor1_Model01
##  5    46    18 roc_auc   binary     0.956    10 0.00392 Preprocessor1_Model01
##  6    46    18 sens      binary     0.848    10 0.0160  Preprocessor1_Model01
##  7    46    18 spec      binary     0.926    10 0.0108  Preprocessor1_Model01
##  8    20    29 accuracy  binary     0.894    10 0.00712 Preprocessor1_Model02
##  9    20    29 f_meas    binary     0.888    10 0.00873 Preprocessor1_Model02
## 10    20    29 precision binary     0.928    10 0.0115  Preprocessor1_Model02
## # … with 130 more rows
# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나, 
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
rnd_model %>% select_best('roc_auc')->param_best  tune_spec %>% finalize_model(param_best)->rnd_best_model# workflow update  
  workflow %>% finalize_workflow(param_best) -> workflow_final
  workflow_final %>% last_fit(split, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit2
  
  rnd_best_fit2 %>% collect_predictions() %>%   
    conf_mat(truth = Attrition, estimate=.pred_class)##           Truth
## Prediction Yes  No
##        Yes  13  12
##        No   55 340
  rnd_best_fit2 %>% collect_predictions() %>% roc_curve(truth=Attrition, estimate=.pred_Yes) %>% autoplot()deploy_randf <- fit(workflow_final, Data_glm)
pull_workflow_fit(deploy_randf)$fit %>% vip::vi()## # A tibble: 29 x 2
##    Variable           Importance
##    <chr>                   <dbl>
##  1 Age                      28.4
##  2 OverTime                 27.0
##  3 DailyRate                24.0
##  4 TotalWorkingYears        22.3
##  5 DistanceFromHome         22.1
##  6 HourlyRate               21.1
##  7 MonthlyRate              19.9
##  8 YearsAtCompany           14.1
##  9 NumCompaniesWorked       13.4
## 10 PercentSalaryHike        13.2
## # … with 19 more rows
# Importance값이 10 이상인 Variable들만 갖고 Randomforest 다시 돌려보고자 합니다. 
Dataset <- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_rnd
# Setting Reference level
Data_rnd$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
## [1] "Yes" "No"
# pre-processing by recipe
Dataset %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> rnd_recipe_re
rnd_recipe_re %>% juice-> rnd_dataset
set.seed(2727)
split_re <- initial_split(rnd_dataset, prop = .7, strata = Attrition)
rnd_train <- training(split_re)
rnd_test <- testing(split_re)
rnd_train %>% as.data.frame %>% SMOTE('Attrition')->rnd_train_SMOTE
rnd_train %>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked , 
    TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole , 
    YearsSinceLastPromotion , BusinessTravel_Travel_Frequently , 
    BusinessTravel_Travel_Rarely , EducationField_Life.Sciences , 
    EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low , 
    Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High , 
    JobRole_Laboratory.Technician , JobRole_Research.Director , 
    JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High , 
    MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , 
    WorkLifeBalance_Better)-> rnd_train_re_sel
rnd_test  %>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked , 
    TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole , 
    YearsSinceLastPromotion , BusinessTravel_Travel_Frequently , 
    BusinessTravel_Travel_Rarely , EducationField_Life.Sciences , 
    EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low , 
    Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High , 
    JobRole_Laboratory.Technician , JobRole_Research.Director , 
    JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High , 
    MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , 
    WorkLifeBalance_Better)-> rnd_test_re_sel
# Validate data again
data_fold2 <- vfold_cv(rnd_train_re_sel)
# Workflow setting
workflow() %>%
  add_model(tune_spec) %>% 
  add_formula(step_glm$formula)-> workflow2
workflow2
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ Age + DailyRate + DistanceFromHome + NumCompaniesWorked + 
##     TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + 
##     YearsSinceLastPromotion + BusinessTravel_Travel_Frequently + 
##     BusinessTravel_Travel_Rarely + EducationField_Life.Sciences + 
##     EducationField_Medical + EducationField_Other + EnvironmentSatisfaction_Low + 
##     Gender_Male + JobInvolvement_Low + JobInvolvement_Very.High + 
##     JobRole_Laboratory.Technician + JobRole_Research.Director + 
##     JobRole_Sales.Representative + JobSatisfaction_Low + JobSatisfaction_Very.High + 
##     MaritalStatus_Single + OverTime_Yes + RelationshipSatisfaction_Low + 
##     WorkLifeBalance_Better
## 
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger
# hyperparameter tune
rnd_model_mod<- workflow2 %>% 
  tune_grid(data_fold2, 
          grid=20, 
          control=control_grid(save_pred = TRUE), 
          metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
 
# Graph for hyperparameter tuning
rnd_model_mod %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")
rnd_model_mod %>%
  collect_metrics()
## # A tibble: 140 x 8
##     mtry min_n .metric   .estimator  mean     n std_err .config              
##    <int> <int> <chr>     <chr>      <dbl> <int>   <dbl> <chr>                
##  1     5    12 accuracy  binary     0.857    10 0.0115  Preprocessor1_Model01
##  2     5    12 f_meas    binary     0.921    10 0.00716 Preprocessor1_Model01
##  3     5    12 precision binary     0.857    10 0.0126  Preprocessor1_Model01
##  4     5    12 recall    binary     0.995    10 0.00195 Preprocessor1_Model01
##  5     5    12 roc_auc   binary     0.804    10 0.0183  Preprocessor1_Model01
##  6     5    12 sens      binary     0.995    10 0.00195 Preprocessor1_Model01
##  7     5    12 spec      binary     0.149    10 0.00974 Preprocessor1_Model01
##  8     5    37 accuracy  binary     0.852    10 0.0132  Preprocessor1_Model02
##  9     5    37 f_meas    binary     0.918    10 0.00799 Preprocessor1_Model02
## 10     5    37 precision binary     0.851    10 0.0136  Preprocessor1_Model02
## # … with 130 more rows
# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나, 
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
rnd_model_mod %>% select_best('roc_auc')->param_best_modtune_spec %>% finalize_model(param_best_mod)->rnd_best_model
# workflow update  
  workflow2 %>% finalize_workflow(param_best_mod) -> workflow_final2
  workflow_final2 %>% last_fit(split_re, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit3
  
  rnd_best_fit3 %>% collect_predictions() %>%   
    conf_mat(truth = Attrition, estimate=.pred_class)##           Truth
## Prediction  No Yes
##        No  351  65
##        Yes   1   3
  rnd_best_fit3 %>% collect_metrics() %>% arrange(desc(.estimate))## # A tibble: 7 x 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 sens      binary        0.997  Preprocessor1_Model1
## 2 recall    binary        0.997  Preprocessor1_Model1
## 3 f_meas    binary        0.914  Preprocessor1_Model1
## 4 precision binary        0.844  Preprocessor1_Model1
## 5 accuracy  binary        0.843  Preprocessor1_Model1
## 6 roc_auc   binary        0.834  Preprocessor1_Model1
## 7 spec      binary        0.0441 Preprocessor1_Model1
  # model deploy
  deploy_randf <- fit(workflow_final, Data_glm)
  deploy_randf## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## 
## ─ Model ────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13L,      x), num.trees = ~1000, min.node.size = min_rows(~5L, x),      importance = ~"impurity", seed = ~2727, num.threads = ~6,      verbose = FALSE, probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1402 
## Number of independent variables:  29 
## Mtry:                             13 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1090518
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 7 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.1.3 
##     H2O cluster version age:    3 months and 6 days  
##     H2O cluster name:           H2O_started_from_R_raymondkim_eou682 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.20 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.5 (2021-03-31)
Dataset <- readRDS("Dataset_pre.RDS")
Dataset  %>% mutate_if(is.character, factor)->Data_auto
# Setting Reference level
Data_auto$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")
Data_auto %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe
h2o_recipe %>% juice -> Dataset_h2o
# Putting the original dataframe into an h2o format
Dataset_h2o %>% as.h2o(destination_frame = "h2o_df")->h2o_df
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
# Splitting into training, validation and testing sets
split_df <- h2o.splitFrame(h2o_df, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
h2o_train <- h2o.assign(split_df[[1]], "train")
h2o_validation <- h2o.assign(split_df[[2]], "validation")
h2o_test <- h2o.assign(split_df[[2]], "test")
h2o.describe(h2o_train)
##                                 Label Type Missing Zeros PosInf NegInf
## 1                                 Age real       0     0      0      0
## 2                           DailyRate real       0     0      0      0
## 3                    DistanceFromHome real       0     0      0      0
## 4                          HourlyRate real       0     0      0      0
## 5                            JobLevel real       0     0      0      0
## 6                         MonthlyRate real       0     0      0      0
## 7                  NumCompaniesWorked real       0     0      0      0
## 8                   PercentSalaryHike real       0     0      0      0
## 9                    StockOptionLevel real       0     0      0      0
## 10                  TotalWorkingYears real       0     0      0      0
## 11              TrainingTimesLastYear real       0     0      0      0
## 12                     YearsAtCompany real       0     0      0      0
## 13                 YearsInCurrentRole real       0     0      0      0
## 14            YearsSinceLastPromotion real       0     0      0      0
## 15               YearsWithCurrManager real       0     0      0      0
## 16                          Attrition enum       0   838      0      0
## 17   BusinessTravel_Travel_Frequently  int       0   825      0      0
## 18       BusinessTravel_Travel_Rarely  int       0   283      0      0
## 19                   Department_Sales  int       0   681      0      0
## 20            Education_below.College  int       0   879      0      0
## 21                  Education_College  int       0   785      0      0
## 22                   Education_Doctor  int       0   958      0      0
## 23                   Education_Master  int       0   739      0      0
## 24       EducationField_Life.Sciences  int       0   577      0      0
## 25           EducationField_Marketing  int       0   889      0      0
## 26             EducationField_Medical  int       0   678      0      0
## 27               EducationField_Other  int       0   940      0      0
## 28    EducationField_Technical.Degree  int       0   904      0      0
## 29        EnvironmentSatisfaction_Low  int       0   801      0      0
## 30     EnvironmentSatisfaction_Medium  int       0   802      0      0
## 31  EnvironmentSatisfaction_Very.High  int       0   687      0      0
## 32                        Gender_Male  int       0   402      0      0
## 33                 JobInvolvement_Low  int       0   945      0      0
## 34              JobInvolvement_Medium  int       0   736      0      0
## 35           JobInvolvement_Very.High  int       0   893      0      0
## 36            JobRole_Human.Resources  int       0   960      0      0
## 37      JobRole_Laboratory.Technician  int       0   811      0      0
## 38                    JobRole_Manager  int       0   941      0      0
## 39     JobRole_Manufacturing.Director  int       0   885      0      0
## 40          JobRole_Research.Director  int       0   944      0      0
## 41         JobRole_Research.Scientist  int       0   795      0      0
## 42            JobRole_Sales.Executive  int       0   762      0      0
## 43       JobRole_Sales.Representative  int       0   930      0      0
## 44                JobSatisfaction_Low  int       0   809      0      0
## 45             JobSatisfaction_Medium  int       0   793      0      0
## 46          JobSatisfaction_Very.High  int       0   679      0      0
## 47              MaritalStatus_Married  int       0   539      0      0
## 48               MaritalStatus_Single  int       0   672      0      0
## 49                       OverTime_Yes  int       0   719      0      0
## 50      PerformanceRating_Outstanding  int       0   847      0      0
## 51       RelationshipSatisfaction_Low  int       0   799      0      0
## 52    RelationshipSatisfaction_Medium  int       0   783      0      0
## 53 RelationshipSatisfaction_Very.High  int       0   708      0      0
## 54               WorkLifeBalance_Best  int       0   886      0      0
## 55             WorkLifeBalance_Better  int       0   395      0      0
## 56               WorkLifeBalance_Good  int       0   765      0      0
##           Min      Max         Mean     Sigma Cardinality
## 1  -2.0784303 2.685350  0.015732536 1.0072595          NA
## 2  -1.7519591 1.722043 -0.006184071 0.9943607          NA
## 3  -1.0113326 2.484078  0.001687752 1.0064210          NA
## 4  -1.7707175 1.676989 -0.013879977 0.9941769          NA
## 5  -0.9502098 2.905634 -0.009546925 0.9905219          NA
## 6  -1.7264474 1.799486 -0.006659469 1.0059270          NA
## 7  -1.0754126 2.542171  0.007798073 1.0042799          NA
## 8  -1.1608358 2.710195 -0.012502504 1.0029059          NA
## 9  -0.9287806 2.612880 -0.029990700 0.9763015          NA
## 10 -1.5239202 3.707345 -0.004939040 1.0006931          NA
## 11 -2.1847628 2.498781 -0.010428912 0.9911376          NA
## 12 -1.2771588 3.788408 -0.009389695 0.9921338          NA
## 13 -1.1788794 3.826003 -0.012229704 0.9933579          NA
## 14 -0.6928449 4.421544 -0.044992894 0.9661122          NA
## 15 -1.1627758 3.583973 -0.005964821 0.9948860          NA
## 16  0.0000000 1.000000  0.156092649 0.3631260           2
## 17  0.0000000 1.000000  0.169184290 0.3751035          NA
## 18  0.0000000 1.000000  0.715005035 0.4516395          NA
## 19  0.0000000 1.000000  0.314199396 0.4644301          NA
## 20  0.0000000 1.000000  0.114803625 0.3189454          NA
## 21  0.0000000 1.000000  0.209466264 0.4071327          NA
## 22  0.0000000 1.000000  0.035246727 0.1844957          NA
## 23  0.0000000 1.000000  0.255790534 0.4365245          NA
## 24  0.0000000 1.000000  0.418932528 0.4936329          NA
## 25  0.0000000 1.000000  0.104733132 0.3063635          NA
## 26  0.0000000 1.000000  0.317220544 0.4656286          NA
## 27  0.0000000 1.000000  0.053373615 0.2248907          NA
## 28  0.0000000 1.000000  0.089627392 0.2857911          NA
## 29  0.0000000 1.000000  0.193353474 0.3951267          NA
## 30  0.0000000 1.000000  0.192346425 0.3943423          NA
## 31  0.0000000 1.000000  0.308157100 0.4619645          NA
## 32  0.0000000 1.000000  0.595166163 0.4911072          NA
## 33  0.0000000 1.000000  0.048338369 0.2145883          NA
## 34  0.0000000 1.000000  0.258811682 0.4382027          NA
## 35  0.0000000 1.000000  0.100704935 0.3010893          NA
## 36  0.0000000 1.000000  0.033232628 0.1793338          NA
## 37  0.0000000 1.000000  0.183282981 0.3870933          NA
## 38  0.0000000 1.000000  0.052366566 0.2228774          NA
## 39  0.0000000 1.000000  0.108761329 0.3114964          NA
## 40  0.0000000 1.000000  0.049345418 0.2166973          NA
## 41  0.0000000 1.000000  0.199395770 0.3997474          NA
## 42  0.0000000 1.000000  0.232628399 0.4227202          NA
## 43  0.0000000 1.000000  0.063444109 0.2438829          NA
## 44  0.0000000 1.000000  0.185297080 0.3887342          NA
## 45  0.0000000 1.000000  0.201409869 0.4012556          NA
## 46  0.0000000 1.000000  0.316213494 0.4652316          NA
## 47  0.0000000 1.000000  0.457200403 0.4984159          NA
## 48  0.0000000 1.000000  0.323262840 0.4679578          NA
## 49  0.0000000 1.000000  0.275931521 0.4472077          NA
## 50  0.0000000 1.000000  0.147029204 0.3543135          NA
## 51  0.0000000 1.000000  0.195367573 0.3966832          NA
## 52  0.0000000 1.000000  0.211480363 0.4085640          NA
## 53  0.0000000 1.000000  0.287009063 0.4525938          NA
## 54  0.0000000 1.000000  0.107754280 0.3102261          NA
## 55  0.0000000 1.000000  0.602215509 0.4896871          NA
## 56  0.0000000 1.000000  0.229607251 0.4207922          NA# Establish X and Y (Features and Labels)
y <- "Attrition"
x <- setdiff(names(h2o_train), y)auto_ml <- h2o.automl(
    y = y,
    x = x,
    training_frame = h2o_train,
    leaderboard_frame = h2o_validation,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)## 
  |                                                                            
  |                                                                      |   0%
## 12:35:25.425: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:35:25.428: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 12 models.
## 12:36:22.957: StackedEnsemble_AllModels_AutoML_20210826_123525 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:37:29.811: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:37:29.828: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 23 models.
## 12:38:05.936: Skipping training of model GBM_5_AutoML_20210826_123729 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_123729.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 12:38:10.40: StackedEnsemble_BestOfFamily_AutoML_20210826_123729 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:38:11.54: StackedEnsemble_AllModels_AutoML_20210826_123729 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:16.622: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 33 models.
## 13:04:40.6: StackedEnsemble_BestOfFamily_AutoML_20210826_130416 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:41.11: StackedEnsemble_AllModels_AutoML_20210826_130416 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:15.851: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:05:15.853: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 43 models.
## 13:05:28.637: StackedEnsemble_BestOfFamily_AutoML_20210826_130515 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:29.646: StackedEnsemble_AllModels_AutoML_20210826_130515 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:08.231: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:06:08.234: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 53 models.
## 13:06:17.708: Skipping training of model GBM_5_AutoML_20210826_130608 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_130608.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 13:06:19.738: StackedEnsemble_BestOfFamily_AutoML_20210826_130608 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:20.743: StackedEnsemble_AllModels_AutoML_20210826_130608 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:23.609: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 63 models.
## 14:08:43.939: StackedEnsemble_BestOfFamily_AutoML_20210826_140823 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:44.947: StackedEnsemble_AllModels_AutoML_20210826_140823 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:30.900: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:09:30.902: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 73 models.
## 14:09:43.604: StackedEnsemble_BestOfFamily_AutoML_20210826_140930 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:44.610: StackedEnsemble_AllModels_AutoML_20210826_140930 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:32.891: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:10:32.892: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 83 models.
## 14:10:42.365: Skipping training of model GBM_5_AutoML_20210826_141032 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_141032.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:10:44.406: StackedEnsemble_BestOfFamily_AutoML_20210826_141032 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:45.415: StackedEnsemble_AllModels_AutoML_20210826_141032 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:38.327: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 93 models.
## 14:24:57.664: StackedEnsemble_BestOfFamily_AutoML_20210826_142438 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:58.672: StackedEnsemble_AllModels_AutoML_20210826_142438 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:25:56.530: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:25:56.531: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 103 models.
## 14:26:09.231: StackedEnsemble_BestOfFamily_AutoML_20210826_142556 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:26:10.238: StackedEnsemble_AllModels_AutoML_20210826_142556 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:11.872: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:27:11.874: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 113 models.
## 14:27:21.402: Skipping training of model GBM_5_AutoML_20210826_142711 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_142711.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:27:23.444: StackedEnsemble_BestOfFamily_AutoML_20210826_142711 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:24.451: StackedEnsemble_AllModels_AutoML_20210826_142711 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:40:41.11: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 123 models.
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |=====================                                                 |  30%
## 14:41:01.402: StackedEnsemble_BestOfFamily_AutoML_20210826_144041 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:41:02.410: StackedEnsemble_AllModels_AutoML_20210826_144041 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |======================================================================| 100%
# Best models
best_models <- auto_ml@leaderboard
best_models %>% as.data.frame %>% DT::datatable()# best model을 가져옵니다. 
best_model_id <- as.data.frame(best_models$model_id)[,1]
stacked_ensemble_model <- h2o.getModel(grep("StackedEnsemble_BestOfFamily", best_model_id, value=TRUE)[1])
metalearner <- h2o.getModel(stacked_ensemble_model@model$metalearner$name)
h2o.varimp_plot(metalearner) 
# explainer <- lime(rnd_train,SEBOF)
# explain_top <- lime::explain(rnd_train[1:5],explainer, n_labels = 2, n_features = 10)
# plot_explanations(explain_top)
glm <- h2o.getModel(grep("GLM", best_model_id, value = TRUE)[1])
xgb <- h2o.getModel(grep("XGBoost", best_model_id, value = TRUE)[1])
# Examine the variable importance of the top XGBoost model
# XGBoost can show the feature importance as oppose to the stack ensemble
h2o.varimp(glm) %>% DT::datatable()# We can also plot the base learner contributions to the ensemble.
h2o.varimp_plot(glm)h2o.varimp_plot(xgb)h2o.performance(auto_ml@leader, h2o_test)->performance_automl
h2o.confusionMatrix(performance_automl)## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.444758160347686:
##         No Yes    Error     Rate
## No     155   6 0.037267   =6/161
## Yes     12  23 0.342857   =12/35
## Totals 167  29 0.091837  =18/196
h2o.F1(performance_automl, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.6315789
h2o.accuracy(performance_automl, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.8928571
h2o.recall(performance_automl, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.5142857
h2o.auc(performance_automl)## [1] 0.9110914
plot(performance_automl, type="roc")model_path <- h2o.saveModel(auto_ml@leader, path=getwd(), force=TRUE)
model_path
## [1] "/Users/raymondkim/Rproject/Turnover/StackedEnsemble_BestOfFamily_AutoML_20210826_123312"\(Y = aX+b\)
predictive people analytics를 하려는 이유는 
 긍정적인 방향으로 변화시키고 싶은 Y에 영향을 미치는, 변화 가능한 X를 찾는 것인데, 
 앞선 분석들은 가능한 모든 X를 넣고 예측만 잘 하려고 하는 과정으로 볼 수 있습니다. 
퇴직 예측분석을 통해 변화시키고 싶은 것은 앞서도 살펴보았지만,아래와 같을 것입니다. 
| No | Y | 
|---|---|
| 1 | 핵심인재 Retention | 
| 2 | workforce planning | 
| 3 | 고용 전 적합성 판단 | 
| 4 | 교육 및 훈련계획 수립 | 
| No | X | intervention | 
|---|---|---|
| 1 | Years in Current Role | 사내 부서 이동 | 
| 2 | Years with current Manager | 사내 부서 이동, 조직장 보임 | 
| 3 | Over Time | 연장 근로 제한, 재택근무 | 
| 4 | WorkLIfe Balance | 연장 근로 제한, 재택근무 | 
| 5 | Environment Satisfaction | 근로 환경 개선 | 
| 6 | Distance From Home | 재택근무, 거점오피스 | 
| 7 | Business Travel | VR/영상회의 시스템 구축 | 
| 8 | Department | 사내 부서 이동 | 
| 9 | Training Time Last Year | 교육체계 수립 및 운영 | 
| 10 | Years Since Last Promotion | 승진 | 
| 11 | Years with current manager | 사내 부서 이동 | 
Dataset <- readRDS("Dataset_pre.RDS")
Dataset %>% colnames()
##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EnvironmentSatisfaction"  "Gender"                  
## [11] "HourlyRate"               "JobInvolvement"          
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyRate"              "NumCompaniesWorked"      
## [19] "OverTime"                 "PercentSalaryHike"       
## [21] "PerformanceRating"        "RelationshipSatisfaction"
## [23] "StockOptionLevel"         "TotalWorkingYears"       
## [25] "TrainingTimesLastYear"    "WorkLifeBalance"         
## [27] "YearsAtCompany"           "YearsInCurrentRole"      
## [29] "YearsSinceLastPromotion"  "YearsWithCurrManager"
Dataset %>% dplyr::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance, 
                          EnvironmentSatisfaction,DistanceFromHome, Department, 
                          TrainingTimesLastYear,YearsSinceLastPromotion,YearsWithCurrManager)->Dataset_HR
Dataset_HR  %>% mutate_if(is.character, factor)-> Data_HR
# Setting Reference level
Data_HR$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")
Data_HR %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe_re
h2o_recipe_re %>% juice -> Dataset_h2o_re
# Putting the original dataframe into an h2o format
Dataset_h2o_re %>% as.h2o(destination_frame = "h2o_df_re")->h2o_df_re
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
# Splitting into training, validation and testing sets
split_df_re <- h2o.splitFrame(h2o_df_re, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
h2o_train_re <- h2o.assign(split_df_re[[1]], "train")
h2o_validation_re <- h2o.assign(split_df_re[[2]], "validation")
h2o_test_re <- h2o.assign(split_df_re[[2]], "test")
h2o.describe(h2o_train_re)
##                                Label Type Missing Zeros PosInf NegInf
## 1                 YearsInCurrentRole real       0     0      0      0
## 2                   DistanceFromHome real       0     0      0      0
## 3              TrainingTimesLastYear real       0     0      0      0
## 4            YearsSinceLastPromotion real       0     0      0      0
## 5               YearsWithCurrManager real       0     0      0      0
## 6                          Attrition enum       0   838      0      0
## 7                       OverTime_Yes  int       0   719      0      0
## 8               WorkLifeBalance_Best  int       0   886      0      0
## 9             WorkLifeBalance_Better  int       0   395      0      0
## 10              WorkLifeBalance_Good  int       0   765      0      0
## 11       EnvironmentSatisfaction_Low  int       0   801      0      0
## 12    EnvironmentSatisfaction_Medium  int       0   802      0      0
## 13 EnvironmentSatisfaction_Very.High  int       0   687      0      0
## 14 Department_Research...Development  int       0   349      0      0
##           Min      Max         Mean     Sigma Cardinality
## 1  -1.1788794 3.826003 -0.012229704 0.9933579          NA
## 2  -1.0113326 2.484078  0.001687752 1.0064210          NA
## 3  -2.1847628 2.498781 -0.010428912 0.9911376          NA
## 4  -0.6928449 4.421544 -0.044992894 0.9661122          NA
## 5  -1.1627758 3.583973 -0.005964821 0.9948860          NA
## 6   0.0000000 1.000000  0.156092649 0.3631260           2
## 7   0.0000000 1.000000  0.275931521 0.4472077          NA
## 8   0.0000000 1.000000  0.107754280 0.3102261          NA
## 9   0.0000000 1.000000  0.602215509 0.4896871          NA
## 10  0.0000000 1.000000  0.229607251 0.4207922          NA
## 11  0.0000000 1.000000  0.193353474 0.3951267          NA
## 12  0.0000000 1.000000  0.192346425 0.3943423          NA
## 13  0.0000000 1.000000  0.308157100 0.4619645          NA
## 14  0.0000000 1.000000  0.648539778 0.4776669          NA# Establish X and Y (Features and Labels)
y1 <- "Attrition"
x1 <- setdiff(names(h2o_train_re), y)
automl <- h2o.automl(
    y = y1,
    x = x1,
    training_frame = h2o_train_re,
    validation_frame = h2o_validation_re,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)
## 
  |                                                                            
  |                                                                      |   0%
## 12:35:25.425: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:35:25.428: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 12 models.
## 12:36:22.957: StackedEnsemble_AllModels_AutoML_20210826_123525 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:37:29.811: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:37:29.828: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 23 models.
## 12:38:05.936: Skipping training of model GBM_5_AutoML_20210826_123729 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_123729.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 12:38:10.40: StackedEnsemble_BestOfFamily_AutoML_20210826_123729 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:38:11.54: StackedEnsemble_AllModels_AutoML_20210826_123729 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:16.622: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 33 models.
## 13:04:40.6: StackedEnsemble_BestOfFamily_AutoML_20210826_130416 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:41.11: StackedEnsemble_AllModels_AutoML_20210826_130416 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:15.851: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:05:15.853: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 43 models.
## 13:05:28.637: StackedEnsemble_BestOfFamily_AutoML_20210826_130515 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:29.646: StackedEnsemble_AllModels_AutoML_20210826_130515 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:08.231: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:06:08.234: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 53 models.
## 13:06:17.708: Skipping training of model GBM_5_AutoML_20210826_130608 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_130608.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 13:06:19.738: StackedEnsemble_BestOfFamily_AutoML_20210826_130608 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:20.743: StackedEnsemble_AllModels_AutoML_20210826_130608 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:23.609: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 63 models.
## 14:08:43.939: StackedEnsemble_BestOfFamily_AutoML_20210826_140823 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:44.947: StackedEnsemble_AllModels_AutoML_20210826_140823 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:30.900: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:09:30.902: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 73 models.
## 14:09:43.604: StackedEnsemble_BestOfFamily_AutoML_20210826_140930 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:44.610: StackedEnsemble_AllModels_AutoML_20210826_140930 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:32.891: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:10:32.892: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 83 models.
## 14:10:42.365: Skipping training of model GBM_5_AutoML_20210826_141032 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_141032.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:10:44.406: StackedEnsemble_BestOfFamily_AutoML_20210826_141032 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:45.415: StackedEnsemble_AllModels_AutoML_20210826_141032 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:38.327: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 93 models.
## 14:24:57.664: StackedEnsemble_BestOfFamily_AutoML_20210826_142438 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:58.672: StackedEnsemble_AllModels_AutoML_20210826_142438 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:25:56.530: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:25:56.531: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 103 models.
## 14:26:09.231: StackedEnsemble_BestOfFamily_AutoML_20210826_142556 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:26:10.238: StackedEnsemble_AllModels_AutoML_20210826_142556 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:11.872: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:27:11.874: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 113 models.
## 14:27:21.402: Skipping training of model GBM_5_AutoML_20210826_142711 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_142711.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:27:23.444: StackedEnsemble_BestOfFamily_AutoML_20210826_142711 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:24.451: StackedEnsemble_AllModels_AutoML_20210826_142711 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:40:41.11: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 123 models.
## 14:41:01.402: StackedEnsemble_BestOfFamily_AutoML_20210826_144041 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:41:02.410: StackedEnsemble_AllModels_AutoML_20210826_144041 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:42:16.854: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:42:16.855: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 133 models.
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=====================                                                 |  30%
## 14:42:29.663: StackedEnsemble_BestOfFamily_AutoML_20210826_144216 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:42:30.667: StackedEnsemble_AllModels_AutoML_20210826_144216 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |======================================================================| 100%# Best models
best_models <- automl@leaderboard
best_models %>% as.data.frame %>% DT::datatable()h2o.performance(automl@leader, h2o_test_re)->performance_automl_re
h2o.confusionMatrix(performance_automl_re)## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.261791635152933:
##         No Yes    Error     Rate
## No     146  15 0.093168  =15/161
## Yes     11  24 0.314286   =11/35
## Totals 157  39 0.132653  =26/196
h2o.F1(performance_automl_re, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.2
h2o.accuracy(performance_automl_re, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.8367347
h2o.recall(performance_automl_re, thresholds = .5)## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.1142857
h2o.auc(performance_automl_re)## [1] 0.8436557
plot(performance_automl_re, type="roc")# best model을 가져옵니다. 
best_model_id2 <- as.data.frame(best_models$model_id)[,1]
glm_re <- h2o.getModel(grep("GLM", best_model_id2, value = TRUE)[1])
h2o.varimp(glm_re) %>% DT::datatable()Dataset <- readRDS("Dataset_pre.RDS")
Dataset %>% dplyr::select(PerformanceRating) %>% unique## # A tibble: 2 x 1
##   PerformanceRating
##   <chr>            
## 1 Excellent        
## 2 Outstanding
Dataset %>% filter(PerformanceRating %in% "Outstanding") %>% nrow## [1] 212
Dataset %>% filter(PerformanceRating %in% "Outstanding") %>% dplyr::select(Attrition) %>% table## .
##  No Yes 
## 175  37
# Diversity는 비슷하게 구성되어 있음 
Dataset %>% filter(PerformanceRating %in% "Outstanding")->Dataset_High
Dataset_High  %>% mutate_if(is.character, factor)-> Data_High# Setting Reference level
Data_High$Attrition <- relevel(Data_High$Attrition, ref = "Yes")
Data_High %>% dplyr::select(-PerformanceRating) %>% 
  dplyr::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance, 
                          EnvironmentSatisfaction,DistanceFromHome, Department, 
                          TrainingTimesLastYear,YearsSinceLastPromotion,YearsWithCurrManager)->Data_High
Data_High %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe_High
h2o_recipe_High %>% juice -> Dataset_h2o_High
# Putting the original dataframe into an h2o format
Dataset_h2o_High %>% as.h2o(destination_frame = "h2o_df")->h2o_df_High## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
# Splitting into training, validation and testing sets
split_df_High <- h2o.splitFrame(h2o_df_High, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
h2o_train_High <- h2o.assign(split_df_High[[1]], "train")
h2o_validation_High <- h2o.assign(split_df_High[[2]], "validation")
h2o_test_High <- h2o.assign(split_df_High[[2]], "test")
h2o.describe(h2o_train_High)##                                Label Type Missing Zeros PosInf NegInf
## 1                 YearsInCurrentRole real       0     0      0      0
## 2                   DistanceFromHome real       0     0      0      0
## 3              TrainingTimesLastYear real       0     0      0      0
## 4            YearsSinceLastPromotion real       0     0      0      0
## 5               YearsWithCurrManager real       0     0      0      0
## 6                          Attrition enum       0   135      0      0
## 7                       OverTime_Yes  int       0   110      0      0
## 8               WorkLifeBalance_Best  int       0   140      0      0
## 9             WorkLifeBalance_Better  int       0    62      0      0
## 10              WorkLifeBalance_Good  int       0   124      0      0
## 11       EnvironmentSatisfaction_Low  int       0   129      0      0
## 12    EnvironmentSatisfaction_Medium  int       0   127      0      0
## 13 EnvironmentSatisfaction_Very.High  int       0   114      0      0
## 14 Department_Research...Development  int       0    48      0      0
## 15                  Department_Sales  int       0   116      0      0
##           Min      Max        Mean     Sigma Cardinality
## 1  -1.1954805 3.396897 -0.01832029 0.9943792          NA
## 2  -1.0107498 2.250017 -0.03340425 0.9781307          NA
## 3  -2.1589112 2.583982 -0.02761118 1.0003229          NA
## 4  -0.6732738 3.673956  0.01916470 0.9943281          NA
## 5  -1.2019820 3.242556 -0.01114584 0.9985143          NA
## 6   0.0000000 1.000000  0.14556962 0.3537956           2
## 7   0.0000000 1.000000  0.30379747 0.4613586          NA
## 8   0.0000000 1.000000  0.11392405 0.3187292          NA
## 9   0.0000000 1.000000  0.60759494 0.4898387          NA
## 10  0.0000000 1.000000  0.21518987 0.4122607          NA
## 11  0.0000000 1.000000  0.18354430 0.3883430          NA
## 12  0.0000000 1.000000  0.19620253 0.3983862          NA
## 13  0.0000000 1.000000  0.27848101 0.4496767          NA
## 14  0.0000000 1.000000  0.69620253 0.4613586          NA
## 15  0.0000000 1.000000  0.26582278 0.4431750          NA
# Establish X and Y (Features and Labels)
y <- "Attrition"
x <- setdiff(names(h2o_train_High), y)automl_high <- h2o.automl(
    y = y,
    x = x,
    training_frame = h2o_train_High,
    validation_frame = h2o_validation_High,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)# Best models
best_models_High <- automl_high@leaderboard
best_models_High %>% as.data.frame %>% DT::datatable()h2o.performance(automl_high@leader, h2o_test_High)->performance_automl_High
# 
# max f1 @ threshold = 0.479166666666667:
h2o.confusionMatrix(performance_automl_High)## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.479166666666667:
##        No Yes    Error   Rate
## No     20   0 0.000000  =0/20
## Yes     0   7 0.000000   =0/7
## Totals 20   7 0.000000  =0/27
h2o.F1(performance_automl_High)##      threshold        f1
## 1  0.873958333 0.2500000
## 2  0.770833333 0.4444444
## 3  0.729166667 0.6000000
## 4  0.645833333 0.7272727
## 5  0.583333333 0.8333333
## 6  0.500000000 0.9230769
## 7  0.479166667 1.0000000
## 8  0.423611111 0.9333333
## 9  0.187500000 0.8750000
## 10 0.135416667 0.7777778
## 11 0.130555555 0.7368421
## 12 0.114583333 0.7000000
## 13 0.062500000 0.6666667
## 14 0.041666667 0.6086957
## 15 0.028472222 0.5833333
## 16 0.020833333 0.4827586
## 17 0.010416667 0.4666667
## 18 0.006944444 0.4516129
## 19 0.000000000 0.4117647
h2o.accuracy(performance_automl_High)##      threshold  accuracy
## 1  0.873958333 0.7777778
## 2  0.770833333 0.8148148
## 3  0.729166667 0.8518519
## 4  0.645833333 0.8888889
## 5  0.583333333 0.9259259
## 6  0.500000000 0.9629630
## 7  0.479166667 1.0000000
## 8  0.423611111 0.9629630
## 9  0.187500000 0.9259259
## 10 0.135416667 0.8518519
## 11 0.130555555 0.8148148
## 12 0.114583333 0.7777778
## 13 0.062500000 0.7407407
## 14 0.041666667 0.6666667
## 15 0.028472222 0.6296296
## 16 0.020833333 0.4444444
## 17 0.010416667 0.4074074
## 18 0.006944444 0.3703704
## 19 0.000000000 0.2592593
h2o.recall(performance_automl_High)##      threshold       tpr
## 1  0.873958333 0.1428571
## 2  0.770833333 0.2857143
## 3  0.729166667 0.4285714
## 4  0.645833333 0.5714286
## 5  0.583333333 0.7142857
## 6  0.500000000 0.8571429
## 7  0.479166667 1.0000000
## 8  0.423611111 1.0000000
## 9  0.187500000 1.0000000
## 10 0.135416667 1.0000000
## 11 0.130555555 1.0000000
## 12 0.114583333 1.0000000
## 13 0.062500000 1.0000000
## 14 0.041666667 1.0000000
## 15 0.028472222 1.0000000
## 16 0.020833333 1.0000000
## 17 0.010416667 1.0000000
## 18 0.006944444 1.0000000
## 19 0.000000000 1.0000000
h2o.precision(performance_automl_High)##      threshold precision
## 1  0.873958333 1.0000000
## 2  0.770833333 1.0000000
## 3  0.729166667 1.0000000
## 4  0.645833333 1.0000000
## 5  0.583333333 1.0000000
## 6  0.500000000 1.0000000
## 7  0.479166667 1.0000000
## 8  0.423611111 0.8750000
## 9  0.187500000 0.7777778
## 10 0.135416667 0.6363636
## 11 0.130555555 0.5833333
## 12 0.114583333 0.5384615
## 13 0.062500000 0.5000000
## 14 0.041666667 0.4375000
## 15 0.028472222 0.4117647
## 16 0.020833333 0.3181818
## 17 0.010416667 0.3043478
## 18 0.006944444 0.2916667
## 19 0.000000000 0.2592593
h2o.auc(performance_automl_High)## [1] 1
plot(performance_automl_High, type="roc")# best model을 가져옵니다. 
best_model_id_High <- as.data.frame(best_models_High$model_id)[,1]
DRF_re <- h2o.getModel(grep("DRF", best_model_id_High, value = TRUE)[1])
h2o.varimp(DRF_re) %>% DT::datatable()Group Level로 나누면 또 다른 변수가 나옴을 확인했습니다. 
이제, 이렇게 얻은 분석결과는 조직의 상황과 맥락에 맞춰 해석하고 그에 맞는 Internvention을 기획하여 실행하면 됩니다!
h2o.shutdown()## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?
https://hbr.org/2019/08/better-ways-to-predict-whos-going-to-quit↩︎
Speer, A. B. (2021). Empirical attrition modelling and discrimination: Balancing validity and group differences. Human Resource Management Journal.↩︎
Gibson, C., Koenig, N., Griffith, J., & Hardy, J. H. (2019). Selecting for retention: Understanding turnover prehire. Industrial and Organizational Psychology, 12(3), 338-341.↩︎
McCloy, R. A., Smith, E. A., & Anderson, M. G. (2016). Predicting voluntary turnover from engagement data. In 31st Annual Conference of the Society for Industrial & Organizational Psychology, Anaheim, CA.↩︎
Speer, A. B., Dutta, S., Chen, M., & Trussell, G. (2019). Here to stay or go? Connecting turnover research to applied attrition modeling. Industrial and Organizational Psychology, 12(3), 277-301.↩︎
Strickland, W. J. (2005). A longitudinal examination of first term attrition and reenlistment among FY1999 enlisted accessions. HUMAN RESOURCES RESEARCH ORGANIZATION ALEXANDRIA VA.↩︎
Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453.↩︎
박재신, & 방성완. (2015). 불균형 자료의 분류분석에서 샘플링 기법을 이용한로지스틱 회귀분석. Journal of The Korean Data Analysis Society, 17(4), 1877-1888.↩︎