Chapter 11 Cluster Analysis

These notes are primarily taken from studying DataCamp courses Cluster Analysis in R and Unsupervised Learning in R, AIHR, UC Business Analytics R Programming Guide, and PSU STAT-505.

Cluster analysis is a data exploration (mining) tool for dividing features into clusters, distinct populations with no a priori defining characteristics. The goal is to describe those populations with the observed data. Popular uses of clustering are audience segmentation, creating personas, detecting anomalies, and pattern recognition in images.

There are two common approaches to cluster analysis.

Agglomerative hierarchical algorithms start by defining each data point as a cluster, then repeatedly combine the two closest clusters into a new cluster until all data points are merged into a single cluster.
Non-hierarchical methods such as K-means initially randomly partitions the data into a set of K clusters, then iteratively moves observations into different clusters until there is no sensible reassignment possible.

Setup

I will learn by example, using the IBM HR Analytics Employee Attrition & Performance data set from Kaggle to discover which factors are associated with employee turnover and whether distinct clusters of employees are more susceptible to turnover. The clusters can help personalize employee experience (AIHR). This data set includes 1,470 employee records consisting of the EmployeeNumber, a flag for Attrition during some time frame, and 32 other descriptive variables.

library(tidyverse)
library(plotly)            # interactive graphing
library(cluster)           # daisy and pam
library(Rtsne)             # dimensionality reduction and visualization
library(dendextend)        # color_branches

set.seed(1234)  # reproducibility

dat <- read_csv("./input/WA_Fn-UseC_-HR-Employee-Attrition.csv")
dat <- dat %>%
  mutate_if(is.character, as_factor) %>%
  mutate(
    EnvironmentSatisfaction = factor(EnvironmentSatisfaction, ordered = TRUE),
    StockOptionLevel = factor(StockOptionLevel, ordered = TRUE),
    JobLevel = factor(JobLevel, ordered = TRUE),
    JobInvolvement = factor(JobInvolvement, ordered = TRUE)
  ) %>%
  select(EmployeeNumber, Attrition, everything())
my_skim <- skimr::skim_with(numeric = skimr::sfl(p25 = NULL, p50 = NULL, p75 = NULL, hist = NULL))
my_skim(dat)

Table 11.1: Data summary
Name	dat
Number of rows	1470
Number of columns	35
_______________________
Column type frequency:
factor	13
numeric	22
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Attrition	1	FALSE	2	No: 1233, Yes: 237
BusinessTravel	1	FALSE	3	Tra: 1043, Tra: 277, Non: 150
Department	1	FALSE	3	Res: 961, Sal: 446, Hum: 63
EducationField	1	FALSE	6	Lif: 606, Med: 464, Mar: 159, Tec: 132
EnvironmentSatisfaction	1	TRUE	4	3: 453, 4: 446, 2: 287, 1: 284
Gender	1	FALSE	2	Mal: 882, Fem: 588
JobInvolvement	1	TRUE	4	3: 868, 2: 375, 4: 144, 1: 83
JobLevel	1	TRUE	5	1: 543, 2: 534, 3: 218, 4: 106
JobRole	1	FALSE	9	Sal: 326, Res: 292, Lab: 259, Man: 145
MaritalStatus	1	FALSE	3	Mar: 673, Sin: 470, Div: 327
Over18	1	FALSE	1	Y: 1470
OverTime	1	FALSE	2	No: 1054, Yes: 416
StockOptionLevel	1	TRUE	4	0: 631, 1: 596, 2: 158, 3: 85

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p100
EmployeeNumber	1	1024.87	602.02	1	2068
Age	1	36.92	9.14	18	60
DailyRate	1	802.49	403.51	102	1499
DistanceFromHome	1	9.19	8.11	1	29
Education	1	2.91	1.02	1	5
EmployeeCount	1	1.00	0.00	1	1
HourlyRate	1	65.89	20.33	30	100
JobSatisfaction	1	2.73	1.10	1	4
MonthlyIncome	1	6502.93	4707.96	1009	19999
MonthlyRate	1	14313.10	7117.79	2094	26999
NumCompaniesWorked	1	2.69	2.50	0	9
PercentSalaryHike	1	15.21	3.66	11	25
PerformanceRating	1	3.15	0.36	3	4
RelationshipSatisfaction	1	2.71	1.08	1	4
StandardHours	1	80.00	0.00	80	80
TotalWorkingYears	1	11.28	7.78	0	40
TrainingTimesLastYear	1	2.80	1.29	0	6
WorkLifeBalance	1	2.76	0.71	1	4
YearsAtCompany	1	7.01	6.13	0	40
YearsInCurrentRole	1	4.23	3.62	0	18
YearsSinceLastPromotion	1	2.19	3.22	0	15
YearsWithCurrManager	1	4.12	3.57	0	17

You would normally start a cluster analysis with an exploration of the data to determine which variables are interesting and relevant to your goal. I’ll bypass that rigor and just use a binary correlation analysis. Binary correlation analysis converts features into binary format by binning the continuous features and one-hot encoding the binary features. correlate() calculates the correlation coefficient for each binary feature to the response variable. A Correlation Funnel is an tornado plot that lists the highest correlation features (based on absolute magnitude) at the top of the and the lowest correlation features at the bottom. Read more on the correlationfunel GitHub README.

Using binary correlation, I’ll include just the variables with a correlation coefficient of at least 0.10. For our employee attrition data set, OverTime (Y|N) has the largest correlation, JobLevel = 1, MonthlyIncome <= 2,695.80, etc.

dat %>%
  select(-EmployeeNumber) %>%
  correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.01) %>%
  correlationfunnel::correlate(Attrition__Yes) %>%
  correlationfunnel::plot_correlation_funnel(interactive = FALSE) %>%
  ggplotly()  # Makes prettier, but drops the labels

Using the cutoff of 0.1 leaves 14 features for the analysis.

vars <- c(
  "EmployeeNumber", "Attrition", 
  "OverTime", "JobLevel", "MonthlyIncome", "YearsAtCompany", "StockOptionLevel",
  "YearsWithCurrManager", "TotalWorkingYears", "MaritalStatus", "Age", 
  "YearsInCurrentRole", "JobRole", "EnvironmentSatisfaction", "JobInvolvement", 
  "BusinessTravel"
)

dat_2 <- dat %>% select(one_of(vars))

Data Preparation

The concept of distance is central to clustering. Two observations are similar if the distance between their features is relatively small. To compare feature distances, they should be on a similar scale. There are many ways to define distance (see options in ?dist), but the two most common are Euclidean, \(d = \sqrt{\sum{(x_i - y_i)^2}}\), and binary, 1 minus the proportion of shared features (Wikipedia, PSU-505). If you have a mix a feature types, use the Gower distance (Analytics Vidhya) range-normalizes the quantitative variables, one-hot encodes the nominal variables, and ranks the ordinal variables. Then it calculates distances using the Manhattan distance for quantitative and ordinal variables, and the Dice coefficient for nominal variables.

Gower’s distance is computationally expensive, so you could one-hot encode the data and standardize the variables as \((x - \bar{x}) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1, like this:

# cannot one-hot encode ordered factors, so change to unordered
x <- dat_2 %>% mutate_if(is.ordered, ~factor(., ordered = FALSE))
dat_2_mtrx <- mltools::one_hot(data.table::as.data.table(x[, 2:16])) %>% 
  as.matrix() 
row.names(dat_2_mtrx) <- dat_2$EmployeeNumber
dat_2_mtrx <- na.omit(dat_2_mtrx)
dat_2_mtrx <- scale(dat_2_mtrx)
dat_2_dist <- dist(dat_2_mtrx)

But normally you would go ahead and calculate Gower’s distance using daisy().

dat_2_gwr <- cluster::daisy(dat_2[, 2:16], metric = "gower")

As a sanity check, let’s see the most similar and dissimilar pairs of employees according to their Gower distance. Here are the most similar employees.

x <- as.matrix(dat_2_gwr)
dat_2[which(x == min(x[x != 0]), arr.ind = TRUE)[1, ], ] %>% 
  t() %>% as.data.frame() %>% rownames_to_column() %>% 
  flextable::flextable() %>% flextable::autofit()

rowname	V1	V2
EmployeeNumber	1624	614
Attrition	Yes	Yes
OverTime	Yes	Yes
JobLevel	1	1
MonthlyIncome	1569	1878
YearsAtCompany	0	0
StockOptionLevel	0	0
YearsWithCurrManager	0	0
TotalWorkingYears	0	0
MaritalStatus	Single	Single
Age	18	18
YearsInCurrentRole	0	0
JobRole	Sales Representative	Sales Representative
EnvironmentSatisfaction	2	2
JobInvolvement	3	3
BusinessTravel	Travel_Frequently	Travel_Frequently

They are identical except for MonthlyIncome. Here are the most dissimilar employees.

dat_2[which(x == max(x), arr.ind = TRUE)[1, ], ] %>% 
  t() %>% as.data.frame() %>% rownames_to_column() %>% 
  flextable::flextable() %>% flextable::autofit()

rowname	V1	V2
EmployeeNumber	1094	825
Attrition	No	Yes
OverTime	No	Yes
JobLevel	1	5
MonthlyIncome	4621	19246
YearsAtCompany	3	31
StockOptionLevel	3	0
YearsWithCurrManager	2	8
TotalWorkingYears	3	40
MaritalStatus	Married	Single
Age	27	58
YearsInCurrentRole	2	15
JobRole	Laboratory Technician	Research Director
EnvironmentSatisfaction	1	4
JobInvolvement	1	3
BusinessTravel	Non-Travel	Travel_Rarely

These two employees have nothing in common. With the data preparation complete, we can finally perform our cluster analysis. I’ll try K-means and HCA.