Chapter 6 Assignment: A Solution

6.1 kNN Method

The tasks and questions are:

Read the file (cars.csv)
Describe the file
how many records?
How many variables?
What are the minimum, mean, and maximum values of those variables?
Prepare the data before applying the kNN method.
Determine the "label" variable!
Normalize the file (excluding the "label" variable!). The resulting values of all variables should be in the range from 0 to 1.
Create a training and test set. Use 80% of the records for the training set. The data may be sorted by important characteristics!
Determine k.
Train the model, and evaluate the model: how good are the predictions in the test set?
What are possible improvements in the model? Apply it!

Tip: the data are on GitHub.

You can download the data, or read the data directly using:

cars <- read.csv("https://raw.githubusercontent.com/ssmresearch/hanminor/main/cars.csv")

## KNN

rm(list=ls())

# Step 1: Read Data
cars <- read.csv("https://raw.githubusercontent.com/ssmresearch/hanminor/main/cars.csv")
head(cars)

##   origin price mileage repair headspace trunkspace weight length turningcircle
## 1    usa  4099     8.8      3      6.25        308 1318.5  465.0         12.20
## 2    usa  4749     6.8      3      7.50        308 1507.5  432.5         12.20
## 3    usa  3799     8.8      3      7.50        336 1188.0  420.0         10.68
## 4    usa  4816     8.0      3     11.25        448 1462.5  490.0         12.20
## 5    usa  7827     6.0      4     10.00        560 1836.0  555.0         13.12
## 6    usa  5788     7.2      3     10.00        588 1651.5  545.0         13.12
##   gear_ratio
## 1       3.58
## 2       2.53
## 3       3.08
## 4       2.93
## 5       2.41
## 6       2.73

str(cars)

## 'data.frame':    74 obs. of  10 variables:
##  $ origin       : chr  "usa" "usa" "usa" "usa" ...
##  $ price        : num  4099 4749 3799 4816 7827 ...
##  $ mileage      : num  8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
##  $ repair       : num  3 3 3 3 4 3 3 3 3 3 ...
##  $ headspace    : num  6.25 7.5 7.5 11.25 10 ...
##  $ trunkspace   : num  308 308 336 448 560 588 280 448 476 364 ...
##  $ weight       : num  1318 1508 1188 1462 1836 ...
##  $ length       : num  465 432 420 490 555 ...
##  $ turningcircle: num  12.2 12.2 10.7 12.2 13.1 ...
##  $ gear_ratio   : num  3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...

# 10 variabels
# factor ("label") in first column
# "features" in kcolumns 2:10

# Step 2: Prepare Data

# 2.1 Normalize

normalizer <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# Normalize data (except the label)
cars_n <- as.data.frame(lapply(cars[,2:10], normalizer))
summary(cars_n) # check that values of all variables are between 0 and 1!

##      price            mileage           repair        headspace     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.07366   1st Qu.:0.2069   1st Qu.:0.500   1st Qu.:0.2857  
##  Median :0.13599   Median :0.2759   Median :0.500   Median :0.4286  
##  Mean   :0.22784   Mean   :0.3206   Mean   :0.598   Mean   :0.4266  
##  3rd Qu.:0.24108   3rd Qu.:0.4397   3rd Qu.:0.750   3rd Qu.:0.5714  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##    trunkspace         weight           length       turningcircle   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2917   1st Qu.:0.1591   1st Qu.:0.3077   1st Qu.:0.2504  
##  Median :0.5000   Median :0.4643   Median :0.5549   Median :0.4501  
##  Mean   :0.4865   Mean   :0.4089   Mean   :0.5048   Mean   :0.4329  
##  3rd Qu.:0.6528   3rd Qu.:0.5974   3rd Qu.:0.6786   3rd Qu.:0.6007  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    gear_ratio    
##  Min.   :0.0000  
##  1st Qu.:0.3176  
##  Median :0.4500  
##  Mean   :0.4852  
##  3rd Qu.:0.6838  
##  Max.   :1.0000

# 2.2 Training and test set

# Random sample of 80%
set.seed(98989)                     # For reproducible results

x  <- nrow(cars_n)                  # bepaal aantal observeringen (74)
xs <- round(0.80*x,0)
cat("# of records:",x,"\nSample:", xs)

## # of records: 74 
## Sample: 59

train_sample <- sample(x,xs)        # sample xs out of x (approx 80%)
str(train_sample)                   # show the sample

##  int [1:59] 58 32 19 8 24 13 74 39 22 64 ...

# split the data in training and test set (59 and 15 observations, respectively)
# For criteria, in columns 2:10, and for label variable (column 1)

cars_train <- cars_n[train_sample ,]
cars_test  <- cars_n[-train_sample,]
cars_train_label <- cars[train_sample, 1]
cars_test_label  <- cars[-train_sample, 1]

str(cars_train)

## 'data.frame':    59 obs. of  9 variables:
##  $ price        : num  0.1417 0.0971 0.0526 0.1505 0.087 ...
##  $ mileage      : num  0.414 0.207 0.241 0.276 0.552 ...
##  $ repair       : num  0.75 0.5 0.5 0.5 0.75 0.5 1 0.5 0.25 0.75 ...
##  $ headspace    : num  0.286 0.429 0.571 0.143 0 ...
##  $ trunkspace   : num  0.167 0.556 0.444 0.611 0.222 ...
##  $ weight       : num  0.169 0.523 0.542 0.494 0.013 ...
##  $ length       : num  0.3077 0.6154 0.6044 0.6374 0.0549 ...
##  $ turningcircle: num  0.151 0.501 0.601 0.55 0.101 ...
##  $ gear_ratio   : num  0.794 0.141 0.218 0.435 0.565 ...

str(cars_test)

## 'data.frame':    15 obs. of  9 variables:
##  $ price        : num  0.1156 0.0403 0.3596 0.1437 0.057 ...
##  $ mileage      : num  0.172 0.345 0.103 0.345 0.207 ...
##  $ repair       : num  0.5 0.5 0.75 0.25 0.25 0.5 0.75 0.75 0.5 0.5 ...
##  $ headspace    : num  0.429 0.429 0.714 0.143 0.714 ...
##  $ trunkspace   : num  0.333 0.389 0.833 0.611 0.667 ...
##  $ weight       : num  0.516 0.286 0.753 0.474 0.597 ...
##  $ length       : num  0.341 0.286 0.879 0.637 0.703 ...
##  $ turningcircle: num  0.45 0.201 0.601 0.501 0.75 ...
##  $ gear_ratio   : num  0.2 0.524 0.129 0.318 0.165 ...

str(cars_train_label); length(cars_train_label)

##  chr [1:59] "other" "usa" "usa" "usa" "usa" "usa" "other" "usa" "usa" ...

## [1] 59

str(cars_test_label) ; length(cars_test_label)

##  chr [1:15] "usa" "usa" "usa" "usa" "usa" "usa" "usa" "usa" "usa" "usa" ...

## [1] 15

# Step 3: Train model

# install.packages("class")
library(class)

# Choice of k: odd number
k <- round(sqrt(nrow(cars_n)))  # Rule of thumb
cat("Rule of thumb, for k:",k)

## Rule of thumb, for k: 9

k <- ifelse(k%%2==0, k-1,k)      # k is odd (9) already

cat("k is", k,"[ = odd number, approximate square root of",x,"]\n")

## k is 9 [ = odd number, approximate square root of 74 ]

cars_pred <- knn(train=cars_train, test=cars_test, cl=cars_train_label, k)

# Step 4: Evaluate Model

library(gmodels)
CrossTable(cars_test_label,cars_pred, expected=TRUE, format="SPSS")

## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |         Expected Values |
## | Chi-square contribution |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  15 
## 
##                 | cars_pred 
## cars_test_label |    other  |      usa  | Row Total | 
## ----------------|-----------|-----------|-----------|
##           other |        3  |        0  |        3  | 
##                 |    0.600  |    2.400  |           | 
##                 |    9.600  |    2.400  |           | 
##                 |  100.000% |    0.000% |   20.000% | 
##                 |  100.000% |    0.000% |           | 
##                 |   20.000% |    0.000% |           | 
## ----------------|-----------|-----------|-----------|
##             usa |        0  |       12  |       12  | 
##                 |    2.400  |    9.600  |           | 
##                 |    2.400  |    0.600  |           | 
##                 |    0.000% |  100.000% |   80.000% | 
##                 |    0.000% |  100.000% |           | 
##                 |    0.000% |   80.000% |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |        3  |       12  |       15  | 
##                 |   20.000% |   80.000% |           | 
## ----------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  15     d.f. =  1     p =  0.0001075112 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  9.401042     d.f. =  1     p =  0.002168622 
## 
##  
##        Minimum expected frequency: 0.6 
## Cells with Expected Frequency < 5: 3 of 4 (75%)

# Step 5: Improve Model

# Alternative values for k (3; 5; 7; 9; 11; 13; 15)

kset   <- c(3, 5, 7, 9, 11, 13, 15)
(ssize  <- x-xs) # opslaan van de steekproefomvang; 74 - 59 = 15

## [1] 15

for (k in kset) {
  cars_pred <- knn(train=cars_train, test=cars_test,cl=cars_train_label,k)
  true_neg <- table(cars_test_label,cars_pred)[1,1]
  true_pos <- table(cars_test_label,cars_pred)[2,2]
  corn = true_neg + true_pos
  corp = 100*round((corn / ssize), 4)
  cat("Correctly Predicted (k=", k,"):",corn, "(",corp,"%)\n")
}

## Correctly Predicted (k= 3 ): 15 ( 100 %)
## Correctly Predicted (k= 5 ): 15 ( 100 %)
## Correctly Predicted (k= 7 ): 14 ( 93.33 %)
## Correctly Predicted (k= 9 ): 15 ( 100 %)
## Correctly Predicted (k= 11 ): 15 ( 100 %)
## Correctly Predicted (k= 13 ): 15 ( 100 %)
## Correctly Predicted (k= 15 ): 15 ( 100 %)

correct <- c(15,15,14,15,15,15,15)
(k_opt <- data.frame(kset, correct))

##   kset correct
## 1    3      15
## 2    5      15
## 3    7      14
## 4    9      15
## 5   11      15
## 6   13      15
## 7   15      15

plot(kset, correct, type="l", main="% correct\nby # of neighbours (k)",
    xlab="# neighbours (k)", ylab="% Correct")

6.2 KMeans

## KMeans

rm(list=ls())
cars<-read.csv("cars.csv",header=TRUE)
str(cars)

## 'data.frame':    74 obs. of  10 variables:
##  $ origin       : chr  "usa" "usa" "usa" "usa" ...
##  $ price        : num  4099 4749 3799 4816 7827 ...
##  $ mileage      : num  8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
##  $ repair       : num  3 3 3 3 4 3 3 3 3 3 ...
##  $ headspace    : num  6.25 7.5 7.5 11.25 10 ...
##  $ trunkspace   : num  308 308 336 448 560 588 280 448 476 364 ...
##  $ weight       : num  1318 1508 1188 1462 1836 ...
##  $ length       : num  465 432 420 490 555 ...
##  $ turningcircle: num  12.2 12.2 10.7 12.2 13.1 ...
##  $ gear_ratio   : num  3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...

# normaliseren van data
normalizer <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# Include origin (now a factor variable; redefine as dummy 0/1)

cars2 <- cars # copy of cars

# Add dummy for origin
cars2$originDum <- ifelse(cars2$origin=="usa",1,0)
str(cars2)

## 'data.frame':    74 obs. of  11 variables:
##  $ origin       : chr  "usa" "usa" "usa" "usa" ...
##  $ price        : num  4099 4749 3799 4816 7827 ...
##  $ mileage      : num  8.8 6.8 8.8 8 6 7.2 10.4 8 6.4 7.6 ...
##  $ repair       : num  3 3 3 3 4 3 3 3 3 3 ...
##  $ headspace    : num  6.25 7.5 7.5 11.25 10 ...
##  $ trunkspace   : num  308 308 336 448 560 588 280 448 476 364 ...
##  $ weight       : num  1318 1508 1188 1462 1836 ...
##  $ length       : num  465 432 420 490 555 ...
##  $ turningcircle: num  12.2 12.2 10.7 12.2 13.1 ...
##  $ gear_ratio   : num  3.58 2.53 3.08 2.93 2.41 2.73 2.87 2.93 2.93 3.08 ...
##  $ originDum    : num  1 1 1 1 1 1 1 1 1 1 ...

table(cars2$origin,cars2$originDum)

##        
##          0  1
##   other 22  0
##   usa    0 52

cars2$origin<-NULL

# normalize
cars2_n <- as.data.frame(lapply(cars2, normalizer))
summary(cars2_n)

##      price            mileage           repair        headspace     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.07366   1st Qu.:0.2069   1st Qu.:0.500   1st Qu.:0.2857  
##  Median :0.13599   Median :0.2759   Median :0.500   Median :0.4286  
##  Mean   :0.22784   Mean   :0.3206   Mean   :0.598   Mean   :0.4266  
##  3rd Qu.:0.24108   3rd Qu.:0.4397   3rd Qu.:0.750   3rd Qu.:0.5714  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##    trunkspace         weight           length       turningcircle   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2917   1st Qu.:0.1591   1st Qu.:0.3077   1st Qu.:0.2504  
##  Median :0.5000   Median :0.4643   Median :0.5549   Median :0.4501  
##  Mean   :0.4865   Mean   :0.4089   Mean   :0.5048   Mean   :0.4329  
##  3rd Qu.:0.6528   3rd Qu.:0.5974   3rd Qu.:0.6786   3rd Qu.:0.6007  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    gear_ratio       originDum     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3176   1st Qu.:0.0000  
##  Median :0.4500   Median :1.0000  
##  Mean   :0.4852   Mean   :0.7027  
##  3rd Qu.:0.6838   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

# optimal number of clusters
c <- 10 # max number for wss-plot
wss <- (nrow(cars2_n)-1)*sum(apply(cars2_n,2,var)); wss[1]

## [1] 52.77075

for (i in 2:c) wss[i] <- sum(kmeans(cars2_n, centers=i)$withinss)
plot(1:c, wss, type="b", xlab="# of Clusters", ylab="Heterogeneity")

# Elbow at 3. Agree?

library(stats)
set.seed(111)
autoCluster <- kmeans(cars2_n, 3)
autoCluster$size

## [1] 30 22 22

round(autoCluster$centers, 2)

##   price mileage repair headspace trunkspace weight length turningcircle
## 1  0.31    0.17   0.52      0.64       0.68   0.65   0.74          0.63
## 2  0.10    0.40   0.49      0.24       0.35   0.31   0.40          0.37
## 3  0.25    0.44   0.82      0.32       0.36   0.18   0.29          0.22
##   gear_ratio originDum
## 1       0.28         1
## 2       0.47         1
## 3       0.77         0

# Add cluster (1, 2 of 3) to data set
cars2$cluster <- autoCluster$cluster

# Aggregate (or summarize) info
names(cars2)

##  [1] "price"         "mileage"       "repair"        "headspace"    
##  [5] "trunkspace"    "weight"        "length"        "turningcircle"
##  [9] "gear_ratio"    "originDum"     "cluster"

aggregate(data=cars2, mileage ~ cluster, mean)

##   cluster  mileage
## 1       1 6.786667
## 2       2 9.490909
## 3       3 9.909091

aggregate(data=cars2, length ~ cluster, mean)

##   cluster   length
## 1       1 523.2500
## 2       2 445.4545
## 3       3 421.3636

aggregate(data=cars2, price  ~ cluster, mean)

##   cluster    price
## 1       1 7225.967
## 2       2 4499.409
## 3       3 6384.682

aggregate(data=cars2, price  ~ cluster, min)

##   cluster price
## 1       1  3291
## 2       2  3299
## 3       3  3748

aggregate(data=cars2, price  ~ cluster, max)

##   cluster price
## 1       1 15906
## 2       2  6486
## 3       3 12990

# install.packages("doBy")
library(doBy)

## Warning: package 'doBy' was built under R version 4.0.4

summaryBy(list(c("price","headspace","originDum"), c("cluster")), data=cars2, FUN=c(min,mean,max))

##   cluster price.min headspace.min originDum.min price.mean headspace.mean
## 1       1      3291          6.25             1   7225.967       9.375000
## 2       2      3299          3.75             1   4499.409       5.852273
## 3       3      3748          3.75             0   6384.682       6.534091
##   originDum.mean price.max headspace.max originDum.max
## 1              1     15906         12.50             1
## 2              1      6486         10.00             1
## 3              0     12990          8.75             0