7.4 Non-hierarchical clustering

We now perform a non-hierarchical cluster analysis in which we ask for three clusters (as determined by the hierarchical cluster analysis):

# there is an element of randomness in cluster analysis
# this means that you will not always get the same output every time you do a cluster analysis
# if you do want to always get the same output, you need to fix R's random number generator with the set.seed command
set.seed(1)

# the nstart argument should be included and set to 25, but its explanation is out of the scope of this tutori al
kmeans.clustering <- kmeans(cluster.data, 3, nstart = 25)

Add to the equipment dataset a variable that indicates to which cluster an observation belongs:

equipment <- equipment %>% 
  mutate(km.group = factor(kmeans.clustering$cluster, labels=c("cl1","cl2","cl3"))) # Factorize the cluster indicator from the kmeans.clustering data frame and add it to the equipment data frame.

Inspect the clusters:

equipment %>% 
  group_by(km.group) %>% # group by cluster (km.group)
  summarise(count = n(), 
            variety = mean(variety_of_choice), 
            electronics = mean(electronics), 
            furniture = mean(furniture), 
            service = mean(quality_of_service), 
            prices = mean(low_prices), 
            return = mean(return_policy)) # Then ask for the number of respondents and for the means of the ratings.

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 8
##   km.group count variety electronics furniture service prices return
##   <fct>    <int>   <dbl>       <dbl>     <dbl>   <dbl>  <dbl>  <dbl>
## 1 cl1         14    6.93        2.79      1.43    3.5    8.29   6.29
## 2 cl2         18    9.11        6.06      5.78    2.39   3.67   3.17
## 3 cl3          8    5           4.38      1.75    8.5    2.5    4.38

We see that:

cluster 1 attaches more importance (than other clusters) to quality of service
cluster 2 attaches more importance to variety of choice
cluster 3 attaches more importance to low prices

We can also test whether there are significant differences between clusters in, for example, variety of choice. For this we use a one-way ANOVA:

# remotes::install_github("samuelfranssens/type3anova") # to install the type3anova package. 
# You need the remotes package for this and the car package needs to be installed for the type3anova package to work
library(type3anova)
type3anova(lm(variety_of_choice ~ km.group, data=equipment))

## # A tibble: 3 x 6
##   term            ss   df1   df2      f pvalue
##   <chr>        <dbl> <dbl> <int>  <dbl>  <dbl>
## 1 (Intercept) 1757.      1    37 1335.       0
## 2 km.group     101.      2    37   38.5      0
## 3 Residuals     48.7    37    37   NA       NA

There are significant differences between clusters in importance attached to variety of choice, and this makes sense because the purpose of cluster analysis is to maximize between-cluster differences. Let’s follow this up with Tukey’s HSD to see exactly which means differ from each other:

TukeyHSD(aov(variety_of_choice ~ km.group, data=equipment), 
         "km.group") # The first argument is an "aov" object, the second is our independent variable.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = variety_of_choice ~ km.group, data = equipment)
## 
## $km.group
##              diff       lwr        upr     p adj
## cl2-cl1  2.182540  1.184332  3.1807470 0.0000145
## cl3-cl1 -1.928571 -3.170076 -0.6870668 0.0015154
## cl3-cl2 -4.111111 -5.301397 -2.9208248 0.0000000

We see that in every pair of means, the difference is significant.