7.4 Non-hierarchical clustering

We now perform a non-hierarchical cluster analysis in which we ask for three clusters (as determined by the hierarchical cluster analysis):

Add to the equipment dataset a variable that indicates to which cluster an observation belongs:

Inspect the clusters:

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 8
##   km.group count variety electronics furniture service prices return
##   <fct>    <int>   <dbl>       <dbl>     <dbl>   <dbl>  <dbl>  <dbl>
## 1 cl1         14    6.93        2.79      1.43    3.5    8.29   6.29
## 2 cl2         18    9.11        6.06      5.78    2.39   3.67   3.17
## 3 cl3          8    5           4.38      1.75    8.5    2.5    4.38

We see that:

  • cluster 1 attaches more importance (than other clusters) to quality of service

  • cluster 2 attaches more importance to variety of choice

  • cluster 3 attaches more importance to low prices

We can also test whether there are significant differences between clusters in, for example, variety of choice. For this we use a one-way ANOVA:

## # A tibble: 3 x 6
##   term            ss   df1   df2      f pvalue
##   <chr>        <dbl> <dbl> <int>  <dbl>  <dbl>
## 1 (Intercept) 1757.      1    37 1335.       0
## 2 km.group     101.      2    37   38.5      0
## 3 Residuals     48.7    37    37   NA       NA

There are significant differences between clusters in importance attached to variety of choice, and this makes sense because the purpose of cluster analysis is to maximize between-cluster differences. Let’s follow this up with Tukey’s HSD to see exactly which means differ from each other:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = variety_of_choice ~ km.group, data = equipment)
## 
## $km.group
##              diff       lwr        upr     p adj
## cl2-cl1  2.182540  1.184332  3.1807470 0.0000145
## cl3-cl1 -1.928571 -3.170076 -0.6870668 0.0015154
## cl3-cl2 -4.111111 -5.301397 -2.9208248 0.0000000

We see that in every pair of means, the difference is significant.