Chapter 6 KNN Algorithm

The KNN, K Nearest Neighbours, algorithm is an algorithm that can be used for both unsupervised and supervised learning. It can be used in a regression and in a classification context.

6.1 KNN in Classification

The idea behind the KNN algorithm is simple. Suppose a binary classification problem, i.e. the dataset provides a couple of features and a binary target Y-variable. The dataset contains hostorical data.
To predict the value of the target variable for a new observation, based on the values of the features, the distance of the new observation to the observations in the data set is calculated. The K observatioons wtih the lowest distance are filtered. The majority Y-value for these K observations is used as the predicted Y-value for the new observation.

Before this algorithm can be used two important choices must be made: (1) which value of K to use and (2) which distance metric to use.
Besides the choice of a distance metric two important preprocessing steps have to be performed, (1) categorical variables must be transformed into dummy variables and (2) numeric variables must be standardized/ normalized. The first step is necessary, because calculating a distance requires numerical values. The second step is performed, because otherwise the units chosen largely determine the distances.
Different techniques can be used for normalizing the values of a numeric variable. Standardizing the values, i.e. transforming the values in the z-score is an option, z-score = \(\frac{x-mean}{SD}\). Another option is standardizing the values using the formula below, xnorm = \(\frac{max-x}{max-min}\). As in all classification problems, a choice about the metric to use to assess the model must be made.

6.1.1 Choosing appropriate K value

There is no general rule for the choice of the best K-value. It is most common to divide the data in a training an a test set, test a number of choices for K and check which choice leads to the best result.

6.1.2 Distance metrics

Commonly used distance metrics are (1) Eucledian distance and (2) Manhattan distance. Be aware that there are lots of other distance metrics as well.

Example 1

df <- read_csv("data/uci_diabetes/diabetes_data_upload.csv")

Predicting diabetes
Data set: example from UCI website

The data set consists of 520 observations on 17 features. The target variable is the Class variable which can take on two values, Positive and Negative. All but one of the features are binary. The non binary feature is the Age feature.

Table 1
First Six Observations of the Diabetes Data Set

flextable(head(df)) %>% 
  fontsize(size = 7, part = "all")


Table 2
Summary of the data

summary(df)
##       Age           Gender            Polyuria          Polydipsia       
##  Min.   :16.00   Length:520         Length:520         Length:520        
##  1st Qu.:39.00   Class :character   Class :character   Class :character  
##  Median :47.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :48.03                                                           
##  3rd Qu.:57.00                                                           
##  Max.   :90.00                                                           
##  sudden weight loss   weakness          Polyphagia        Genital thrush    
##  Length:520         Length:520         Length:520         Length:520        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  visual blurring      Itching          Irritability       delayed healing   
##  Length:520         Length:520         Length:520         Length:520        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  partial paresis    muscle stiffness     Alopecia           Obesity         
##  Length:520         Length:520         Length:520         Length:520        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     class          
##  Length:520        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Data cleaning and Preprocessing

Correct variable names (no spaces)
Preprocessing: normalize Age variable
Transform other variables into dummies
Transform class variable into a factor variable, this is a requirement for the caret::knn3() function
Split data set in training and test set; 70% in training, 30% in test set

names(df) <- make.names(names(df)) %>% 
  toupper() %>% str_replace_all("[\\.]", "_")


df_prep <- df %>% 
  mutate(AGE_NORM = (AGE - min(AGE))/(max(AGE)-min(AGE))) %>% 
  select(AGE_NORM, everything(), -AGE) %>% 
  mutate(GENDER = ifelse(GENDER=="Male", 1, 0))

#variables in column 3-16 are all variables which only take on "Yes" or "No" values; these can be transformed into 0/1 variable with one line of code
df_prep[,3:16] <- apply(df_prep[,3:16], 2, function(x) ifelse(x=="Yes", 1, 0))

#transform CLASS variable into type factor variable
df_prep$CLASS <- factor(df_prep$CLASS, levels = c("Positive", "Negative"),
                        labels = c("POS", "NEG"))

#split in training end test set
set.seed(20210309)
train <- sample(1:nrow(df), size = .7*nrow(df))
df_train <- df_prep[train,]
df_test <- df_prep[-train,]

Generate a knn-model. Choice to make beforehand: value of k; first attempt k = 5

#generate model using class::knn3() function
#forst attempt with k=5 as an arbitrary choice
knn_model_1 <- knn3(CLASS~., data = df_train, k = 5)
knn_model_1
5-nearest neighbor model
Training set outcome distribution:

POS NEG 
234 130 

Assess model
Choose metric to assess model: Accuracy
Use model to make predictions for test data
Calculate Accuracy

#make predcitions with the generated model
knn_model_1_preds <- predict(knn_model_1, newdata = df_test, type = 'class')

#construct a confusion matrix
cf_1 <- table(knn_model_1_preds, df_test$CLASS)

#calculate Accuracy
acc_1 <- (cf_1[1,1] + cf_1[2,2]) / sum(cf_1)

Table 3
Confusion Matrix knn_1 model on Test Data Set

cf_1
                 
knn_model_1_preds POS NEG
              POS  77   5
              NEG   9  65


Accuracy = (77 + 65) / (77 + 5 +9 + 65) = 0.91.

confusionMatrix(knn_model_1_preds, df_test$CLASS)

#knn_model_1 <- train(class~., data = df_prep, method = ‘knn’) #knn_model_1

Example 1

Predicting credibility bank customers Data set: example from UCI website
The data set contains the following 23 variables as explanatory variables:

  • X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
  • X2: Gender (1 = male; 2 = female).
  • X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  • X4: Marital status (1 = married; 2 = single; 3 = others).
  • X5: Age (year).
  • X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  • X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
  • X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

There is one binary variable, default payment (Yes = 1, No = 0), as the response variable.

#https://archive.ics.uci.edu/ml/machine-learning-databases/00350/

data <- readxl::read_xls("data/uci_bank/default of credit card clients.xls",
                         col_names = TRUE)  
data <- data[-1,]

6.1.2.1 Example 1: Predicting stockmarket going up or down

Dataset: ISLR:: Smarket.
See ?ISLR::Smarket for more information.
Responsive variable: Direction; two possible values: “Up”, “Down”.
Predictors: Lag1, Lag2, Lag3, Lag4, Lag5.
Choices made:
- distance metric: eucledian distance
- k: comparing different values to find the optimal value
- model assessment metric: accuracy

df <- ISLR::Smarket
#data from 2001 to 2004 as training data
#data from 2005 as test data
df_train <- filter(df, Year < 2005) %>% 
  select(Lag1, Lag2, Lag3, Lag4, Lag5, Direction)
df_test <- filter(df, Year == 2005) %>% 
  select(Lag1, Lag2, Lag3, Lag4, Lag5, Direction)

# model 1: k = 1
model_k1 <- knn(train = df_train[, 1:5],
                test = df_test[, 1:5],
                cl = df_train[, 6],
                k = 1)
table(model_k1, df_test$Direction)
        
model_k1 Down Up
    Down   55 66
    Up     56 75
acc_k1 <- round(sum(model_k1 == df_test$Direction)/nrow(df_test), 1)

# accuracy for various k values

accuracies <- data.frame(K = 1:150, ACCURACY = NA)

for (k in 1:150) {
  model_knn <- knn(train = df_train[, 1:5],
                test = df_test[, 1:5],
                cl = df_train$Direction,
                k = k)
  accuracies[k, 2] <- round(sum(model_knn == df_test$Direction)/nrow(df_test), 3)
}

ggplot(accuracies, aes(x = K, y = ACCURACY)) +
  geom_line() +
  ylim(0.4, 0.65) +
  theme_minimal()

Same example using caret.

knn_model_caret <- train(Direction~.,
                         data = df[, c("Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Direction")],
                         method = "knn",
                         tuneGrid = data.frame(k = seq(1, 40, 2)))

preds_caret <- predict(knn_model_caret, newdata = df_test)
confusionMatrix(preds_caret, df_test$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down  Up
      Down   39  37
      Up     72 104
                                          
               Accuracy : 0.5675          
                 95% CI : (0.5038, 0.6295)
    No Information Rate : 0.5595          
    P-Value [Acc > NIR] : 0.425500        
                                          
                  Kappa : 0.092           
                                          
 Mcnemar's Test P-Value : 0.001128        
                                          
            Sensitivity : 0.3514          
            Specificity : 0.7376          
         Pos Pred Value : 0.5132          
         Neg Pred Value : 0.5909          
             Prevalence : 0.4405          
         Detection Rate : 0.1548          
   Detection Prevalence : 0.3016          
      Balanced Accuracy : 0.5445          
                                          
       'Positive' Class : Down            
                                          
library(tidyverse)
library(openxlsx)
library(flextable)
library(lubridate)
library(caret)
library(tm)
library(e1071)
#options(scipen = 999)

df_train <- tibble(Y = c("POS", "POS","POS" ,"POS" ,"POS",
                         "POS", "NEG","NEG", "NEG", "NEG"),
                   X1 = c(1,1,1,1,0,0,1,0,0,0))

df_train_2 <- tibble(Y = c("POS", "POS","POS" ,"POS" ,"POS",
                         "POS", "NEG","NEG", "NEG", "NEG"),
                     X1 = c(1,1,1,1,0,0,1,0,0,0),
                     X2 = c(1,0,1,0,1,0,1,0,1,0))