# Chapter 6 Cluster analysis

Cluster analysis is the collection of techniques designed to find subgroups or *clusters* in a dataset of variables \(X_1,\ldots,X_k\). Depending on the similarities between the observations, these are partitioned in homogeneous groups as separated as possible between them. Clustering methods can be classified into two main categories:

**Partition methods**. Given a fixed number of cluster \(k\), these methods aim to assign each observation of \(X_1,\ldots,X_k\) to a unique cluster, in such a way that the*within-cluster variation*is as small as possible (the clusters are as homogeneous as possible) while the*between cluster variation*is as large as possible (the clusters are as separated as possible).**Hierarchical methods**. These methods construct a hierarchy for the observations in terms of their similitudes. This results in a tree-based representation of the data in terms of a*dendogram*, which depicts how the observations are clustered at different levels – from the smallest groups of one element to the largest representing the whole dataset.

We will see the basics of the most well-known partition method, namely *\(k\)-means clustering*, and of the *agglomerative hierarchical clustering*.