Chapter 6 Cluster analysis

Cluster analysis is the collection of techniques designed to find subgroups or clusters in a dataset of variables \(X_1,\ldots,X_k\). Depending on the similarities between the observations, these are partitioned in homogeneous groups as separated as possible between them. Clustering methods can be classified into two main categories:

  • Partition methods. Given a fixed number of cluster \(k\), these methods aim to assign each observation of \(X_1,\ldots,X_k\) to a unique cluster, in such a way that the within-cluster variation is as small as possible (the clusters are as homogeneous as possible) while the between cluster variation is as large as possible (the clusters are as separated as possible).
  • Hierarchical methods. These methods construct a hierarchy for the observations in terms of their similitudes. This results in a tree-based representation of the data in terms of a dendrogram, which depicts how the observations are clustered at different levels – from the smallest groups of one element to the largest representing the whole dataset.

We will see the basics of the most well-known partition method, namely \(k\)-means clustering, and of the agglomerative hierarchical clustering.