Chapter 11 Cluster Analysis
These notes are primarily taken from studying DataCamp courses Cluster Analysis in R and Unsupervised Learning in R, AIHR, UC Business Analytics R Programming Guide, and PSU STAT-505.
Cluster analysis is a data exploration (mining) tool for dividing features into clusters, distinct populations with no a priori defining characteristics. The goal is to describe those populations with the observed data. Popular uses of clustering are audience segmentation, creating personas, detecting anomalies, and pattern recognition in images.
There are two common approaches to cluster analysis.
Agglomerative hierarchical algorithms start by defining each data point as a cluster, then repeatedly combine the two closest clusters into a new cluster until all data points are merged into a single cluster.
Non-hierarchical methods such as K-means initially randomly partitions the data into a set of K clusters, then iteratively moves observations into different clusters until there is no sensible reassignment possible.
Setup
I will learn by example, using the IBM HR Analytics Employee Attrition & Performance data set from Kaggle to discover which factors are associated with employee turnover and whether distinct clusters of employees are more susceptible to turnover. The clusters can help personalize employee experience (AIHR). This data set includes 1,470 employee records consisting of the EmployeeNumber
, a flag for Attrition
during some time frame, and 32 other descriptive variables.
library(tidyverse)
library(plotly) # interactive graphing
library(cluster) # daisy and pam
library(Rtsne) # dimensionality reduction and visualization
library(dendextend) # color_branches
set.seed(1234) # reproducibility
dat <- read_csv("./input/WA_Fn-UseC_-HR-Employee-Attrition.csv")
dat <- dat %>%
mutate_if(is.character, as_factor) %>%
mutate(
EnvironmentSatisfaction = factor(EnvironmentSatisfaction, ordered = TRUE),
StockOptionLevel = factor(StockOptionLevel, ordered = TRUE),
JobLevel = factor(JobLevel, ordered = TRUE),
JobInvolvement = factor(JobInvolvement, ordered = TRUE)
) %>%
select(EmployeeNumber, Attrition, everything())
my_skim <- skimr::skim_with(numeric = skimr::sfl(p25 = NULL, p50 = NULL, p75 = NULL, hist = NULL))
my_skim(dat)
Name | dat |
Number of rows | 1470 |
Number of columns | 35 |
_______________________ | |
Column type frequency: | |
factor | 13 |
numeric | 22 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Attrition | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 1043, Tra: 277, Non: 150 |
Department | 0 | 1 | FALSE | 3 | Res: 961, Sal: 446, Hum: 63 |
EducationField | 0 | 1 | FALSE | 6 | Lif: 606, Med: 464, Mar: 159, Tec: 132 |
EnvironmentSatisfaction | 0 | 1 | TRUE | 4 | 3: 453, 4: 446, 2: 287, 1: 284 |
Gender | 0 | 1 | FALSE | 2 | Mal: 882, Fem: 588 |
JobInvolvement | 0 | 1 | TRUE | 4 | 3: 868, 2: 375, 4: 144, 1: 83 |
JobLevel | 0 | 1 | TRUE | 5 | 1: 543, 2: 534, 3: 218, 4: 106 |
JobRole | 0 | 1 | FALSE | 9 | Sal: 326, Res: 292, Lab: 259, Man: 145 |
MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |
Over18 | 0 | 1 | FALSE | 1 | Y: 1470 |
OverTime | 0 | 1 | FALSE | 2 | No: 1054, Yes: 416 |
StockOptionLevel | 0 | 1 | TRUE | 4 | 0: 631, 1: 596, 2: 158, 3: 85 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p100 |
---|---|---|---|---|---|---|
EmployeeNumber | 0 | 1 | 1024.87 | 602.02 | 1 | 2068 |
Age | 0 | 1 | 36.92 | 9.14 | 18 | 60 |
DailyRate | 0 | 1 | 802.49 | 403.51 | 102 | 1499 |
DistanceFromHome | 0 | 1 | 9.19 | 8.11 | 1 | 29 |
Education | 0 | 1 | 2.91 | 1.02 | 1 | 5 |
EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1 |
HourlyRate | 0 | 1 | 65.89 | 20.33 | 30 | 100 |
JobSatisfaction | 0 | 1 | 2.73 | 1.10 | 1 | 4 |
MonthlyIncome | 0 | 1 | 6502.93 | 4707.96 | 1009 | 19999 |
MonthlyRate | 0 | 1 | 14313.10 | 7117.79 | 2094 | 26999 |
NumCompaniesWorked | 0 | 1 | 2.69 | 2.50 | 0 | 9 |
PercentSalaryHike | 0 | 1 | 15.21 | 3.66 | 11 | 25 |
PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 4 |
RelationshipSatisfaction | 0 | 1 | 2.71 | 1.08 | 1 | 4 |
StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80 |
TotalWorkingYears | 0 | 1 | 11.28 | 7.78 | 0 | 40 |
TrainingTimesLastYear | 0 | 1 | 2.80 | 1.29 | 0 | 6 |
WorkLifeBalance | 0 | 1 | 2.76 | 0.71 | 1 | 4 |
YearsAtCompany | 0 | 1 | 7.01 | 6.13 | 0 | 40 |
YearsInCurrentRole | 0 | 1 | 4.23 | 3.62 | 0 | 18 |
YearsSinceLastPromotion | 0 | 1 | 2.19 | 3.22 | 0 | 15 |
YearsWithCurrManager | 0 | 1 | 4.12 | 3.57 | 0 | 17 |
You would normally start a cluster analysis with an exploration of the data to determine which variables are interesting and relevant to your goal. I’ll bypass that rigor and just use a binary correlation analysis. Binary correlation analysis converts features into binary format by binning the continuous features and one-hot encoding the binary features. correlate()
calculates the correlation coefficient for each binary feature to the response variable. A Correlation Funnel is an tornado plot that lists the highest correlation features (based on absolute magnitude) at the top of the and the lowest correlation features at the bottom. Read more on the correlationfunel GitHub README.
Using binary correlation, I’ll include just the variables with a correlation coefficient of at least 0.10. For our employee attrition data set, OverTime
(Y|N) has the largest correlation, JobLevel = 1
, MonthlyIncome <= 2,695.80
, etc.
dat %>%
select(-EmployeeNumber) %>%
correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.01) %>%
correlationfunnel::correlate(Attrition__Yes) %>%
correlationfunnel::plot_correlation_funnel(interactive = FALSE) %>%
ggplotly() # Makes prettier, but drops the labels
Using the cutoff of 0.1 leaves 14 features for the analysis.
vars <- c(
"EmployeeNumber", "Attrition",
"OverTime", "JobLevel", "MonthlyIncome", "YearsAtCompany", "StockOptionLevel",
"YearsWithCurrManager", "TotalWorkingYears", "MaritalStatus", "Age",
"YearsInCurrentRole", "JobRole", "EnvironmentSatisfaction", "JobInvolvement",
"BusinessTravel"
)
dat_2 <- dat %>% select(one_of(vars))
Data Preparation
The concept of distance is central to clustering. Two observations are similar if the distance between their features is relatively small. To compare feature distances, they should be on a similar scale. There are many ways to define distance (see options in ?dist
), but the two most common are Euclidean, \(d = \sqrt{\sum{(x_i - y_i)^2}}\), and binary, 1 minus the proportion of shared features (Wikipedia, PSU-505). If you have a mix a feature types, use the Gower distance (Analytics Vidhya) range-normalizes the quantitative variables, one-hot encodes the nominal variables, and ranks the ordinal variables. Then it calculates distances using the Manhattan distance for quantitative and ordinal variables, and the Dice coefficient for nominal variables.
Gower’s distance is computationally expensive, so you could one-hot encode the data and standardize the variables as \((x - \bar{x}) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1, like this:
# cannot one-hot encode ordered factors, so change to unordered
x <- dat_2 %>% mutate_if(is.ordered, ~factor(., ordered = FALSE))
dat_2_mtrx <- mltools::one_hot(data.table::as.data.table(x[, 2:16])) %>%
as.matrix()
row.names(dat_2_mtrx) <- dat_2$EmployeeNumber
dat_2_mtrx <- na.omit(dat_2_mtrx)
dat_2_mtrx <- scale(dat_2_mtrx)
dat_2_dist <- dist(dat_2_mtrx)
But normally you would go ahead and calculate Gower’s distance using daisy()
.
As a sanity check, let’s see the most similar and dissimilar pairs of employees according to their Gower distance. Here are the most similar employees.
x <- as.matrix(dat_2_gwr)
dat_2[which(x == min(x[x != 0]), arr.ind = TRUE)[1, ], ] %>%
t() %>% as.data.frame() %>% rownames_to_column() %>%
flextable::flextable() %>% flextable::autofit()
rowname | V1 | V2 |
EmployeeNumber | 1624 | 614 |
Attrition | Yes | Yes |
OverTime | Yes | Yes |
JobLevel | 1 | 1 |
MonthlyIncome | 1569 | 1878 |
YearsAtCompany | 0 | 0 |
StockOptionLevel | 0 | 0 |
YearsWithCurrManager | 0 | 0 |
TotalWorkingYears | 0 | 0 |
MaritalStatus | Single | Single |
Age | 18 | 18 |
YearsInCurrentRole | 0 | 0 |
JobRole | Sales Representative | Sales Representative |
EnvironmentSatisfaction | 2 | 2 |
JobInvolvement | 3 | 3 |
BusinessTravel | Travel_Frequently | Travel_Frequently |
They are identical except for MonthlyIncome
. Here are the most dissimilar employees.
dat_2[which(x == max(x), arr.ind = TRUE)[1, ], ] %>%
t() %>% as.data.frame() %>% rownames_to_column() %>%
flextable::flextable() %>% flextable::autofit()
rowname | V1 | V2 |
EmployeeNumber | 1094 | 825 |
Attrition | No | Yes |
OverTime | No | Yes |
JobLevel | 1 | 5 |
MonthlyIncome | 4621 | 19246 |
YearsAtCompany | 3 | 31 |
StockOptionLevel | 3 | 0 |
YearsWithCurrManager | 2 | 8 |
TotalWorkingYears | 3 | 40 |
MaritalStatus | Married | Single |
Age | 27 | 58 |
YearsInCurrentRole | 2 | 15 |
JobRole | Laboratory Technician | Research Director |
EnvironmentSatisfaction | 1 | 4 |
JobInvolvement | 1 | 3 |
BusinessTravel | Non-Travel | Travel_Rarely |
These two employees have nothing in common. With the data preparation complete, we can finally perform our cluster analysis. I’ll try K-means and HCA.