Chapter 11 Cluster Analysis

These notes are primarily taken from studying DataCamp courses Cluster Analysis in R and Unsupervised Learning in R, AIHR, UC Business Analytics R Programming Guide, and PSU STAT-505.

Cluster analysis is a data exploration (mining) tool for dividing features into clusters, distinct populations with no a priori defining characteristics. The goal is to describe those populations with the observed data. Popular uses of clustering are audience segmentation, creating personas, detecting anomalies, and pattern recognition in images.

There are two common approaches to cluster analysis.

  • Agglomerative hierarchical algorithms start by defining each data point as a cluster, then repeatedly combine the two closest clusters into a new cluster until all data points are merged into a single cluster.

  • Non-hierarchical methods such as K-means initially randomly partitions the data into a set of K clusters, then iteratively moves observations into different clusters until there is no sensible reassignment possible.

Setup

I will learn by example, using the IBM HR Analytics Employee Attrition & Performance data set from Kaggle to discover which factors are associated with employee turnover and whether distinct clusters of employees are more susceptible to turnover. The clusters can help personalize employee experience (AIHR). This data set includes 1,470 employee records consisting of the EmployeeNumber, a flag for Attrition during some time frame, and 32 other descriptive variables.

library(tidyverse)
library(plotly)            # interactive graphing
library(cluster)           # daisy and pam
library(Rtsne)             # dimensionality reduction and visualization
library(dendextend)        # color_branches

set.seed(1234)  # reproducibility

dat <- read_csv("./input/WA_Fn-UseC_-HR-Employee-Attrition.csv")
dat <- dat %>%
  mutate_if(is.character, as_factor) %>%
  mutate(
    EnvironmentSatisfaction = factor(EnvironmentSatisfaction, ordered = TRUE),
    StockOptionLevel = factor(StockOptionLevel, ordered = TRUE),
    JobLevel = factor(JobLevel, ordered = TRUE),
    JobInvolvement = factor(JobInvolvement, ordered = TRUE)
  ) %>%
  select(EmployeeNumber, Attrition, everything())
my_skim <- skimr::skim_with(numeric = skimr::sfl(p25 = NULL, p50 = NULL, p75 = NULL, hist = NULL))
my_skim(dat)
Table 11.1: Data summary
Name dat
Number of rows 1470
Number of columns 35
_______________________
Column type frequency:
factor 13
numeric 22
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 1233, Yes: 237
BusinessTravel 0 1 FALSE 3 Tra: 1043, Tra: 277, Non: 150
Department 0 1 FALSE 3 Res: 961, Sal: 446, Hum: 63
EducationField 0 1 FALSE 6 Lif: 606, Med: 464, Mar: 159, Tec: 132
EnvironmentSatisfaction 0 1 TRUE 4 3: 453, 4: 446, 2: 287, 1: 284
Gender 0 1 FALSE 2 Mal: 882, Fem: 588
JobInvolvement 0 1 TRUE 4 3: 868, 2: 375, 4: 144, 1: 83
JobLevel 0 1 TRUE 5 1: 543, 2: 534, 3: 218, 4: 106
JobRole 0 1 FALSE 9 Sal: 326, Res: 292, Lab: 259, Man: 145
MaritalStatus 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327
Over18 0 1 FALSE 1 Y: 1470
OverTime 0 1 FALSE 2 No: 1054, Yes: 416
StockOptionLevel 0 1 TRUE 4 0: 631, 1: 596, 2: 158, 3: 85

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p100
EmployeeNumber 0 1 1024.87 602.02 1 2068
Age 0 1 36.92 9.14 18 60
DailyRate 0 1 802.49 403.51 102 1499
DistanceFromHome 0 1 9.19 8.11 1 29
Education 0 1 2.91 1.02 1 5
EmployeeCount 0 1 1.00 0.00 1 1
HourlyRate 0 1 65.89 20.33 30 100
JobSatisfaction 0 1 2.73 1.10 1 4
MonthlyIncome 0 1 6502.93 4707.96 1009 19999
MonthlyRate 0 1 14313.10 7117.79 2094 26999
NumCompaniesWorked 0 1 2.69 2.50 0 9
PercentSalaryHike 0 1 15.21 3.66 11 25
PerformanceRating 0 1 3.15 0.36 3 4
RelationshipSatisfaction 0 1 2.71 1.08 1 4
StandardHours 0 1 80.00 0.00 80 80
TotalWorkingYears 0 1 11.28 7.78 0 40
TrainingTimesLastYear 0 1 2.80 1.29 0 6
WorkLifeBalance 0 1 2.76 0.71 1 4
YearsAtCompany 0 1 7.01 6.13 0 40
YearsInCurrentRole 0 1 4.23 3.62 0 18
YearsSinceLastPromotion 0 1 2.19 3.22 0 15
YearsWithCurrManager 0 1 4.12 3.57 0 17

You would normally start a cluster analysis with an exploration of the data to determine which variables are interesting and relevant to your goal. I’ll bypass that rigor and just use a binary correlation analysis. Binary correlation analysis converts features into binary format by binning the continuous features and one-hot encoding the binary features. correlate() calculates the correlation coefficient for each binary feature to the response variable. A Correlation Funnel is an tornado plot that lists the highest correlation features (based on absolute magnitude) at the top of the and the lowest correlation features at the bottom. Read more on the correlationfunel GitHub README.

Using binary correlation, I’ll include just the variables with a correlation coefficient of at least 0.10. For our employee attrition data set, OverTime (Y|N) has the largest correlation, JobLevel = 1, MonthlyIncome <= 2,695.80, etc.

dat %>%
  select(-EmployeeNumber) %>%
  correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.01) %>%
  correlationfunnel::correlate(Attrition__Yes) %>%
  correlationfunnel::plot_correlation_funnel(interactive = FALSE) %>%
  ggplotly()  # Makes prettier, but drops the labels

Using the cutoff of 0.1 leaves 14 features for the analysis.

vars <- c(
  "EmployeeNumber", "Attrition", 
  "OverTime", "JobLevel", "MonthlyIncome", "YearsAtCompany", "StockOptionLevel",
  "YearsWithCurrManager", "TotalWorkingYears", "MaritalStatus", "Age", 
  "YearsInCurrentRole", "JobRole", "EnvironmentSatisfaction", "JobInvolvement", 
  "BusinessTravel"
)

dat_2 <- dat %>% select(one_of(vars))

Data Preparation

The concept of distance is central to clustering. Two observations are similar if the distance between their features is relatively small. To compare feature distances, they should be on a similar scale. There are many ways to define distance (see options in ?dist), but the two most common are Euclidean, \(d = \sqrt{\sum{(x_i - y_i)^2}}\), and binary, 1 minus the proportion of shared features (Wikipedia, PSU-505). If you have a mix a feature types, use the Gower distance (Analytics Vidhya) range-normalizes the quantitative variables, one-hot encodes the nominal variables, and ranks the ordinal variables. Then it calculates distances using the Manhattan distance for quantitative and ordinal variables, and the Dice coefficient for nominal variables.

Gower’s distance is computationally expensive, so you could one-hot encode the data and standardize the variables as \((x - \bar{x}) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1, like this:

# cannot one-hot encode ordered factors, so change to unordered
x <- dat_2 %>% mutate_if(is.ordered, ~factor(., ordered = FALSE))
dat_2_mtrx <- mltools::one_hot(data.table::as.data.table(x[, 2:16])) %>% 
  as.matrix() 
row.names(dat_2_mtrx) <- dat_2$EmployeeNumber
dat_2_mtrx <- na.omit(dat_2_mtrx)
dat_2_mtrx <- scale(dat_2_mtrx)
dat_2_dist <- dist(dat_2_mtrx)

But normally you would go ahead and calculate Gower’s distance using daisy().

dat_2_gwr <- cluster::daisy(dat_2[, 2:16], metric = "gower")

As a sanity check, let’s see the most similar and dissimilar pairs of employees according to their Gower distance. Here are the most similar employees.

x <- as.matrix(dat_2_gwr)
dat_2[which(x == min(x[x != 0]), arr.ind = TRUE)[1, ], ] %>% 
  t() %>% as.data.frame() %>% rownames_to_column() %>% 
  flextable::flextable() %>% flextable::autofit()

rowname

V1

V2

EmployeeNumber

1624

614

Attrition

Yes

Yes

OverTime

Yes

Yes

JobLevel

1

1

MonthlyIncome

1569

1878

YearsAtCompany

0

0

StockOptionLevel

0

0

YearsWithCurrManager

0

0

TotalWorkingYears

0

0

MaritalStatus

Single

Single

Age

18

18

YearsInCurrentRole

0

0

JobRole

Sales Representative

Sales Representative

EnvironmentSatisfaction

2

2

JobInvolvement

3

3

BusinessTravel

Travel_Frequently

Travel_Frequently

They are identical except for MonthlyIncome. Here are the most dissimilar employees.

dat_2[which(x == max(x), arr.ind = TRUE)[1, ], ] %>% 
  t() %>% as.data.frame() %>% rownames_to_column() %>% 
  flextable::flextable() %>% flextable::autofit()

rowname

V1

V2

EmployeeNumber

1094

825

Attrition

No

Yes

OverTime

No

Yes

JobLevel

1

5

MonthlyIncome

4621

19246

YearsAtCompany

3

31

StockOptionLevel

3

0

YearsWithCurrManager

2

8

TotalWorkingYears

3

40

MaritalStatus

Married

Single

Age

27

58

YearsInCurrentRole

2

15

JobRole

Laboratory Technician

Research Director

EnvironmentSatisfaction

1

4

JobInvolvement

1

3

BusinessTravel

Non-Travel

Travel_Rarely

These two employees have nothing in common. With the data preparation complete, we can finally perform our cluster analysis. I’ll try K-means and HCA.