3 Assignment K-Means Clustering

Let's apply K-means clustering on the same data set we used for kNN.

You have to determine a number of still unknown clusters of, in this case, makes and models of cars.

There is no criterion that we can use as a training and test set!

The questions and assignments are:

Read the file (cars.csv).
You have seen from the description in the previous assignment that the variable for origin (US, versus non-US) is a factor variable. We cannot calculate distances from a factor variable. Because we want to include it anyway, we have to make it a dummy (0/1) variable.
Normalize the data.
Determine the number of clusters using the (graphical) method described above.
Determine the clustering, and add the cluster to the data set.
Describe the clusters in terms of all variables used in the clustering.
Characterize (label) the clusters.
Repeat the exercise with more or fewer clusters, and decide if the new solutions are better than the original solution!

3.1 Solution: Some Help

Read the data:

rm(list=ls())
cars<-read.csv("cars.csv",header=TRUE)
str(cars)

## 'data.frame':    74 obs. of  10 variables:
##  $ origin       : chr  "usa" "usa" "usa" "usa" ...
##  $ price        : num  4099 4749 3799 4816 7827 ...
##  $ mileage      : num  8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
##  $ repair       : num  3 3 3 3 4 3 3 3 3 3 ...
##  $ headspace    : num  6.25 7.5 7.5 11.25 10 ...
##  $ trunkspace   : num  308 308 336 448 560 588 280 448 476 364 ...
##  $ weight       : num  1318 1508 1188 1462 1836 ...
##  $ length       : num  465 432 420 490 555 ...
##  $ turningcircle: num  12.2 12.2 10.7 12.2 13.1 ...
##  $ gear_ratio   : num  3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...

And make a function for normalizing your data:

# normalize the data
normalizer <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}

And normalize the data, after creating a dummy for origin.

cars2 <- cars # copy of cars

# Add dummy for origin
cars2$originDum <- ifelse(cars2$origin=="usa",1,0)
str(cars2)

## 'data.frame':    74 obs. of  11 variables:
##  $ origin       : chr  "usa" "usa" "usa" "usa" ...
##  $ price        : num  4099 4749 3799 4816 7827 ...
##  $ mileage      : num  8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
##  $ repair       : num  3 3 3 3 4 3 3 3 3 3 ...
##  $ headspace    : num  6.25 7.5 7.5 11.25 10 ...
##  $ trunkspace   : num  308 308 336 448 560 588 280 448 476 364 ...
##  $ weight       : num  1318 1508 1188 1462 1836 ...
##  $ length       : num  465 432 420 490 555 ...
##  $ turningcircle: num  12.2 12.2 10.7 12.2 13.1 ...
##  $ gear_ratio   : num  3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...
##  $ originDum    : num  1 1 1 1 1 1 1 1 1 1 ...

table(cars2$origin,cars2$originDum)

##        
##          0  1
##   other 22  0
##   usa    0 52

cars2$origin<-NULL

# normalize
cars2_n <- as.data.frame(lapply(cars2, normalizer))
summary(cars2_n)

##      price            mileage           repair        headspace     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.07366   1st Qu.:0.2069   1st Qu.:0.500   1st Qu.:0.2857  
##  Median :0.13599   Median :0.2759   Median :0.500   Median :0.4286  
##  Mean   :0.22784   Mean   :0.3206   Mean   :0.598   Mean   :0.4266  
##  3rd Qu.:0.24108   3rd Qu.:0.4397   3rd Qu.:0.750   3rd Qu.:0.5714  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##    trunkspace         weight           length       turningcircle   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2917   1st Qu.:0.1591   1st Qu.:0.3077   1st Qu.:0.2504  
##  Median :0.5000   Median :0.4643   Median :0.5549   Median :0.4501  
##  Mean   :0.4865   Mean   :0.4089   Mean   :0.5048   Mean   :0.4329  
##  3rd Qu.:0.6528   3rd Qu.:0.5974   3rd Qu.:0.6786   3rd Qu.:0.6007  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    gear_ratio       originDum     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3176   1st Qu.:0.0000  
##  Median :0.4500   Median :1.0000  
##  Mean   :0.4852   Mean   :0.7027  
##  3rd Qu.:0.6838   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

The normalized data to use, are now in cars2_n.

Decide on the best number of clusters:

# optimal number of clusters
c <- 10 # max number for wss-plot
wss <- (nrow(cars2_n)-1)*sum(apply(cars2_n,2,var)); wss[1]

## [1] 52.77075

for (i in 2:c) wss[i] <- sum(kmeans(cars2_n, centers=i)$withinss)
plot(1:c, wss, type="b", xlab="# of Clusters", ylab="Heterogeneity")