# 3 Hierarchical Clustering

## 3.1 Introduction

**Hierarchical clustering** is an alternative approach to k-means clustering,which does not require a pre-specification of the number of clusters.

The idea of hierarchical clustering is to treat every observation as its own cluster. Then, at each step, we merge the two clusters that are more similar until all observations are clustered together. This can be represented in a tree shaped image called a *dendrogram*.

The height of the branches indicate how different the clusters are. The
distance between the groups is usually referred to as **linkage**. There are
4 types of linkage:

Complete linkage: It computes all pairwise dissimilarities between the data points in cluster A and cluster B. The maximum value of these dissimilarities is the distance between the two clusters.

Single linkage: Similar to complete linkage but it takes the smallest (minimum) dissimilarity as distance between the two clusters.

Average linkage: It computes all pairwise dissimilarities between the data points in cluster A and cluster B and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster A (a mean vector of length p variables) and the centroid for cluster B.

Complete and average linkage are more commonly used methods. In terms of dissimilarity measure, we will use the Euclidean distance but there are other options.

## 3.2 Readings

Read the following chapters of *An introduction to statistical learning*:

12.4.2 Hierarchical Clustering

12.4.3 Practical Issues in Clustering

## 3.3 Practice session

### Task 1 - Identify clusters

Using the bdiag.csv,
let’s use 2 of the variables that characterise the cell nuclei: *radius_mean*
and *texture_mean* and build a dendrogram

We will use the function `hclust()`

to build the dendrogram and the
function `dist`

that computes the distances between observations:

```
#read the dataset
bdiag.data <- read.csv("https://www.dropbox.com/s/vp44yozebx5xgok/bdiag.csv?dl=1",
stringsAsFactors = TRUE)
#select a subset of the variables
bdiag.2vars <- bdiag.data[,c("radius_mean", "texture_mean")]
#distances between the observations
bdiag.dist <- dist(bdiag.2vars, method = "euclidean")
#### what is dist() doing?##################################
bdiag.dist[1] #is the distance between obs1 and obs2
```

`## [1] 7.82742`

```
## radius_mean texture_mean
## 1 17.99 10.38
## 2 20.57 17.77
```

```
sqrt((bdiag.2vars[1, 1] - bdiag.2vars[2,1 ])^2 +
(bdiag.2vars[1, 2] - bdiag.2vars[2,2 ])^2 ) #Eucl distance
```

`## [1] 7.82742`

```
#############################################################
#Dendrogram using the complete linkage method
bdiag.ddgram <- hclust(bdiag.dist, method="complete")
#Plot the dendrogram
#the option hang = -1 will make the
#labels appear below 0
plot(bdiag.ddgram, cex=.4, hang = -1)
```

If we cut the tree at the height of 20, we get 3 clusters

We can draw a rectangle around the 3 clusters

And obtain the cluster for each observation

```
## group3
## 1 2
## 498 71
```

**TRY IT YOURSELF:**

- Get 2 clusters with hierachical clustering using the variables
*age*,*weight*,*height*,*adipos*,*free*,*neck*,*chest*,*abdom*,*hip*,*thigh*,*knee*,*ankle*,*biceps*,*forearm*and*wrist*. and compare the clustering result with the observed*diagnosis*

## See the solution code

```
#select a subset of the variables
bdiag.10vars <- bdiag.data[,c("radius_mean", "texture_mean",
"perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean",
"concavity_mean", "concave.points_mean",
"symmetry_mean", "fractal_dimension_mean")]
#distances between the observations
bdiag.dist10 <- dist(bdiag.10vars, method = "euclidean")
#Dendrogram using the complete linkage method
bdiag.ddgram10 <- hclust(bdiag.dist10, method="complete")
plot(bdiag.ddgram, cex=.4, hang = -1)
```

- How does the clustering changes with different linkage methods?

## See the solution code

```
#select a subset of the variables
bdiag.10vars <- bdiag.data[,c("radius_mean", "texture_mean",
"perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean",
"concavity_mean", "concave.points_mean",
"symmetry_mean", "fractal_dimension_mean")]
#distances between the observations
bdiag.dist10 <- dist(bdiag.10vars, method = "euclidean")
#Dendrogram using the complete linkage method
bdiag.ddgram10.comp <- hclust(bdiag.dist10, method="complete")
bdiag.ddgram10.sing <- hclust(bdiag.dist10, method="single")
bdiag.ddgram10.aver <- hclust(bdiag.dist10, method="average")
bdiag.ddgram10.cent <- hclust(bdiag.dist10, method="centroid")
bdiag.2vars$cluster.comp <- cutree(bdiag.ddgram10.comp, k = 2)
bdiag.2vars$cluster.sing <- cutree(bdiag.ddgram10.sing, k = 2)
bdiag.2vars$cluster.aver <- cutree(bdiag.ddgram10.aver, k = 2)
bdiag.2vars$cluster.cent <- cutree(bdiag.ddgram10.cent, k = 2)
table(bdiag.2vars$cluster.comp, bdiag.2vars$cluster.sing)
table(bdiag.2vars$cluster.comp, bdiag.2vars$cluster.aver)
table(bdiag.2vars$cluster.comp, bdiag.2vars$cluster.cent)
```

## 3.4 Exercises

Solve the following exercises:

- The dataset
*fat*is available in the*library(faraway)*.

The dataset contains several physical measurements.

Using the variables *age*, *weight*, *height*, *adipos*, *free*, *neck*, *chest*,
*abdom*, *hip*, *thigh*, *knee*, *ankle*, *biceps*, *forearm* and *wrist*

Plot 3 clusters produce by hierarchical cluster based on the two principal components of the data?

Compare the result above with the clusters obtained using all the variables.