3 Assignment K-Means Clustering
Let’s apply K-means clustering on the same data set we used for kNN.
You have to determine a number of still unknown clusters of, in this case, makes and models of cars.
There is no criterion that we can use as a training and test set!
The questions and assignments are:
- Read the file (cars.csv).
- You have seen from the description in the previous assignment that the variable for origin (US, versus non-US) is a factor variable. We cannot calculate distances from a factor variable. Because we want to include it anyway, we have to make it a dummy (0/1) variable.
- Normalize the data.
- Determine the number of clusters using the (graphical) method described above.
- Determine the clustering, and add the cluster to the data set.
- Describe the clusters in terms of all variables used in the clustering.
- Characterize (label) the clusters.
- Repeat the exercise with more or fewer clusters, and decide if the new solutions are better than the original solution!
3.1 Solution: Some Help
Read the data:
rm(list=ls())
<-read.csv("cars.csv",header=TRUE)
carsstr(cars)
## 'data.frame': 74 obs. of 10 variables:
## $ origin : chr "usa" "usa" "usa" "usa" ...
## $ price : num 4099 4749 3799 4816 7827 ...
## $ mileage : num 8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
## $ repair : num 3 3 3 3 4 3 3 3 3 3 ...
## $ headspace : num 6.25 7.5 7.5 11.25 10 ...
## $ trunkspace : num 308 308 336 448 560 588 280 448 476 364 ...
## $ weight : num 1318 1508 1188 1462 1836 ...
## $ length : num 465 432 420 490 555 ...
## $ turningcircle: num 12.2 12.2 10.7 12.2 13.1 ...
## $ gear_ratio : num 3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...
And make a function for normalizing your data:
# normalize the data
<- function(x){
normalizer return((x-min(x))/(max(x)-min(x)))
}
And normalize the data, after creating a dummy for origin.
<- cars # copy of cars
cars2
# Add dummy for origin
$originDum <- ifelse(cars2$origin=="usa",1,0)
cars2str(cars2)
## 'data.frame': 74 obs. of 11 variables:
## $ origin : chr "usa" "usa" "usa" "usa" ...
## $ price : num 4099 4749 3799 4816 7827 ...
## $ mileage : num 8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
## $ repair : num 3 3 3 3 4 3 3 3 3 3 ...
## $ headspace : num 6.25 7.5 7.5 11.25 10 ...
## $ trunkspace : num 308 308 336 448 560 588 280 448 476 364 ...
## $ weight : num 1318 1508 1188 1462 1836 ...
## $ length : num 465 432 420 490 555 ...
## $ turningcircle: num 12.2 12.2 10.7 12.2 13.1 ...
## $ gear_ratio : num 3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...
## $ originDum : num 1 1 1 1 1 1 1 1 1 1 ...
table(cars2$origin,cars2$originDum)
##
## 0 1
## other 22 0
## usa 0 52
$origin<-NULL
cars2
# normalize
<- as.data.frame(lapply(cars2, normalizer))
cars2_n summary(cars2_n)
## price mileage repair headspace
## Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.07366 1st Qu.:0.2069 1st Qu.:0.500 1st Qu.:0.2857
## Median :0.13599 Median :0.2759 Median :0.500 Median :0.4286
## Mean :0.22784 Mean :0.3206 Mean :0.598 Mean :0.4266
## 3rd Qu.:0.24108 3rd Qu.:0.4397 3rd Qu.:0.750 3rd Qu.:0.5714
## Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.0000
## trunkspace weight length turningcircle
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2917 1st Qu.:0.1591 1st Qu.:0.3077 1st Qu.:0.2504
## Median :0.5000 Median :0.4643 Median :0.5549 Median :0.4501
## Mean :0.4865 Mean :0.4089 Mean :0.5048 Mean :0.4329
## 3rd Qu.:0.6528 3rd Qu.:0.5974 3rd Qu.:0.6786 3rd Qu.:0.6007
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## gear_ratio originDum
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3176 1st Qu.:0.0000
## Median :0.4500 Median :1.0000
## Mean :0.4852 Mean :0.7027
## 3rd Qu.:0.6838 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
The normalized data to use, are now in cars2_n.
Decide on the best number of clusters:
# optimal number of clusters
<- 10 # max number for wss-plot
c <- (nrow(cars2_n)-1)*sum(apply(cars2_n,2,var)); wss[1] wss
## [1] 52.77075
for (i in 2:c) wss[i] <- sum(kmeans(cars2_n, centers=i)$withinss)
plot(1:c, wss, type="b", xlab="# of Clusters", ylab="Heterogeneity")