Chapter 3 The k-NN Algorithm
It is easiest to illustrate the nearest neighbor method (k-NN, for short) with an example.
A microfinance institute in Africa makes loans available to starting entrepreneurs.
Until now, loan applications have been mainly processed and assessed by field workers (loan officers) who visit villages.
About 40% of the applications are approved.
The field workers have gained a lot of experience over time and are able to make an educated guess on whether clients are able to pay the interest and repay the loan (over 90% of the loans are repaid on time).
However, with the introduction of mobile technology, an increasing number of applications reach the head offices directly, and are assessed by administrative staff who have no direct contact with the applicants.
The head office has a wealth of data on loans granted and rejected, and on characteristics of the applicants.
The challenge is to find out if it's possible to determine whether a loan should be granted or not, on the basis of a number of simple characteristics (including age, gender, experience).
3.1 Distance, and Similarity
New applicants can be compared with "similar" applicants from the past.
If a loan has been granted to the majority of that set, then we also grant the loan to the new applicant (and reject the loan otherwise).
The set of similar applicants is determined by a number of characteristics that we believe are relevant.
Ten characteristics have been selected from the data on the application form:
- Gender (a dummy variable: 0 = female; 1 = male)
- Age (in years)
- Experience (number of years working, self-employed or in paid employment)
- Own assessment of entrepreneurial skills (a 10-point scale; 1 = very poor; 10 = excellent)
- Been self-employed in the past (0 = no; 1 = yes)
- Education level (1 = none; 2 = primary education; 3 = secondary education; 4 = university)
- Parent education level (same coding)
- In possession of a savings account (0/1)
- Living in the city or countryside (0 = countryside; 1 = city)
- Marital status (0 = not married; 1 = married).
Discussions with field workers have shown that these characteristics are relevant, but in an often quite complex way.
Older applicants bring more experience, which is positive.
At the same time, older applicants often lack the energy to successfully start something new, especially if they have not had a business before.
Younger entrepreneurs have no experience but have better skills (such as knowledge of IT).
Education level is a good indication especially if it is higher than that of the parents (it indicates perseverance and commitment).
In short, the decision whether or not to allocate a loan is complex and, moreover, at least partly intuitive.
When a new applicant registers, we will compare his or her data with the applicants from the past.
We search the database for applicants with similar characteristics. We may find applicants with exactly the same characteristics. The distance between the new applicant and the existing customer then is zero.
The distance between two people is determined on the basis of the differences in the ten characteristics.
You can compute he distance in a variety of ways.
- You can measure the absolute differences, and then sum them over the 10 characteristics.
- But mostly we use the so-called Euclidean distance.
We work in three steps.
- In the first step, we determine the squared differences between two objects (or, as in our example, people).
- In the second step, we add up all (in our example, ten) squared differences.
- And, finally, in step 3, we will take the square root of the result of step 2.
Note: this is actually an extension of the Pythagorean theorem that you probably remember from high school!
The straight-line distance between two points can be calculated as the square root of the sum of the squared differences in horizontal and vertical directions. The extension is that we now have more than two dimensions.
3.2 Normalize or Standardize Data
When calculating distances, we have to take into account the units used.
Age is expressed in years, and the range of measurements (the highest observed age minus the lowest observed age) in our customer base can be as high as 40 (years).
For gender we use a (0/1) dummy variable that gives 1 as the maximum difference.
If we also included the amount in the savings account as a characteristic, the value (in units of local currency) could run in the thousands or even millions.
The problem of different and incomparable units can be solved in two ways: by normalization, or by standardization.
3.3 Normalizing Data
Normalization is the recalculation of the data to values between 0 (the minimum) and 1 (the maximum) with the formula:
xn = (x - xmin)/ xmax - xmin)
For example, if age ranges from 20 to 50, then an applicant aged 35 has a normalized age of
xn = (35-20) / (50-20) = 0.50 (halfway between the lowest and highest age observed).
Normalizing variables makes them comparable. Computing distances for non-normalized data, in effect assigns higher weights to variables with wider ranges of values.
Self Test You are in a class of 20 students. Your grade for this module is 7. The grades of your class range from 4 (minimum) to 9 (maximum). What is your normalized score for the exam?
3.4 Standardizing Data
An alternative to normalization is standardization.
Here we determine the Z-scores of the data, assuming the data are normally distributed.
The Z scores are negative (positive) for values smaller (larger) than the mean.
Standardization is preferable to normalization if the criteria are unbounded (e.g., ranging from 0 to ∞) and/or there are outliers in either direction (e.g., very low or high incomes, compared to the rest of the group).
Whichever the choice, outliers tend to have an impact on your analysis, and it pays off to somehow deal with this issue! If the outlier reflects an error, then obviously it has to be corrected. If the outlier is not representative of the population (Bill Gates applying for a microloan), then deleting it is an option. Recoding very high values to a predetermined maximum might also work. In general: try to see how sensitive your analysis is to different methods!!
Suppose all persons in your data set earn between 0 and 100 monetary units, except for one person who earns 10,000. The bulk of the normalized values would then be between 0 and 0.01, and have little weight in finding nearest neigbours.
3.5 Choosing the Number of Neighbors (k)
Oh yes, it's k-NN, not NN! Let's turn our attention to k!
The k signifies the number of nearest neighbours we want to analyze.
The outcome of the algorithm is sensitive to the choice of k.
If we set k very large, then it becomes ever more likely that the majority in the dataset will dominate.
In our case, since the majority of applications are rejected, the likelihood of a majority of rejections increases as k increases.
In the extreme case, with k as large as the data set, we reject all applications because the majority of applications in the past was rejected! But if the bank rejects all applications, then it will cease to exist. That's not what we want.
If, on the other hand, we choose a very small k (in the extreme case, k=1) we run the risk that small groups that may be uncorrectly labeled will have an undesirably large influence on the predictions.
For example, a person with extreme scores on, say, age and entrepreneurial skills, who got a loan but really shouldn't have gotten it, has an undue influence on the predictions of new applicants with the same characteristics.
So, we have to find some optimum, in between these extremes.
As a rule of thumb, we often use the square root of the number of observations.
For example, for 500 observations, we would set k to √500 = 22. In the case of a binary classification (a classification into two groups), however, we prefer an odd number so that there is always a majority; we can set k to 21 or 23.
This is just a rule of thumb. To make the model as good as possible, we will experiment with the value of k!
3.6 Application of k-NN in R
Let's apply the k-NN algorithm step by step to an example.
The steps are as follows.
- Reading and describing the data
- Preparing the data for application of the k-NN algorithm:
- Normalize (or standardize) the data
- Splitting the data into a training set and a test set
- Training the model on the data
- Evaluating the model
- Improving the model.
3.6.1 Step 1: Reading and Describing the Data
We are going to work in an R-script.
Our data comes from a STATA file (ML3_1_KNN.dta) that we can read with read.dta().
To use this function we must first load the package foreign, with library(foreign).
We read the file as a data frame called knndata.
We can look at the structure of that file with the str() command.
library(foreign)
knndata<-read.dta("ML3_1_KNN.dta",convert.factors=FALSE)
str(knndata)
## 'data.frame': 500 obs. of 13 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ male : num 1 0 0 0 0 1 0 1 0 0 ...
## $ age : num 49 29 40 33 27 40 46 46 25 32 ...
## $ exp : num 2 1 20 12 1 8 21 2 7 1 ...
## $ skills : num 3 1 10 10 10 10 1 10 10 10 ...
## $ busexp : num 0 0 1 0 0 0 1 0 0 0 ...
## $ educ : num 1 1 1 1 2 3 3 2 2 1 ...
## $ educpar: num 1 1 1 1 4 2 4 2 4 1 ...
## $ saving : num 0 0 0 0 0 0 1 0 1 0 ...
## $ city : num 0 1 0 0 1 1 0 0 0 0 ...
## $ marital: num 1 0 1 1 1 0 1 1 0 1 ...
## $ savacc : num 6.28 14.37 11.51 5.16 17.86 ...
## $ accept : num 0 0 1 0 0 0 1 1 0 1 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr "14 Apr 2016 11:56"
## - attr(*, "formats")= chr [1:13] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int [1:13] 254 254 254 254 254 254 254 254 254 254 ...
## - attr(*, "val.labels")= chr [1:13] "" "" "" "" ...
## - attr(*, "var.labels")= chr [1:13] "" "" "" "" ...
## - attr(*, "version")= int 12
From the output we see that the data frame has 500 observations and 13 variables.
The values of the first records in the data frame are displayed.
Before we get started, we still have to do the necessary preliminary work.
- Some variables are not needed for the analysis (id; savacc).
- The variable id is a serial number of the records in the file and does not contain useful information.
- The variable savacc is an estimated amount of the applicant's savings. Because the savings balance is only a snapshot, the loan officers look at having a savings account or not as an indication, rather than the savings balance.
- The variable that we call the label in the k-NN method must be a factor variable. The label here is accept, but it is a numeric variable.
- We have to check that all variables have values in the expected range (no ages above 80 or below 16; all dummies are indeed 0 or 1; the entrepreneurial skills are on a 10-point scale with values from 1 to 10; and so on).
3.6.1.1 Selection and Description of Variables
First of all, we select the variables to be used in our analysis.
There are several ways to do this.
It is good practice to position the label variable that indicates the group the object (or person) belongs to in the first column. Our label variable (accept) is now in the last of the 13 columns.
We do not need the variables id and savacc.
First we create a new factor variable, acceptFac, from the numeric variable accept, with the factor() function.
In that function we can indicate:
- what the values (levels) are in the original variable, and
- how we want to label those values.
We can describe the new variable with the self-written CFR() function introduced in the previous chapter.
knndata$acceptFac <- factor(knndata$accept, levels=c(0,1),
labels=c("Reject","Accept"))
round(100*table(knndata$acceptFac)/length(knndata$acceptFac),0)
##
## Reject Accept
## 61 39
Indeed, around 60% (61%, to be more precise) of the 500 applications were rejected.
Now that we have the new variable acceptFac we can create a new file with only the required variables, and acceptFac in the first column 1.
With the head() function we display the first (6) observations, and summary() gives a description of all variables.
Always study the output carefully: any coding errors or outliers will quickly come to light! This file does not appear to contain any coding errors or strange values. The coding for education level is as expected from 1 to 4; ages range from 21 to 50; and so on.
knndata2 <- knndata[,c("acceptFac","male","age","exp","skills",
"busexp","educ","educpar","saving",
"city","marital")]
head(knndata2)
## acceptFac male age exp skills busexp educ educpar saving city marital
## 1 Reject 1 49 2 3 0 1 1 0 0 1
## 2 Reject 0 29 1 1 0 1 1 0 1 0
## 3 Accept 0 40 20 10 1 1 1 0 0 1
## 4 Reject 0 33 12 10 0 1 1 0 0 1
## 5 Reject 0 27 1 10 0 2 4 0 1 1
## 6 Reject 1 40 8 10 0 3 2 0 1 0
summary(knndata2)
## acceptFac male age exp skills
## Reject:304 Min. :0.000 Min. :21.00 Min. : 0.000 Min. : 1.000
## Accept:196 1st Qu.:0.000 1st Qu.:29.00 1st Qu.: 4.000 1st Qu.: 1.000
## Median :0.000 Median :37.00 Median : 8.000 Median : 3.000
## Mean :0.412 Mean :35.98 Mean : 9.214 Mean : 5.224
## 3rd Qu.:1.000 3rd Qu.:43.00 3rd Qu.:13.000 3rd Qu.:10.000
## Max. :1.000 Max. :50.00 Max. :30.000 Max. :10.000
## busexp educ educpar saving city
## Min. :0.000 Min. :1.000 Min. :1.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :2.000 Median :2.00 Median :0.000 Median :0.000
## Mean :0.232 Mean :2.016 Mean :2.16 Mean :0.244 Mean :0.498
## 3rd Qu.:0.000 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :1.000 Max. :4.000 Max. :4.00 Max. :1.000 Max. :1.000
## marital
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.614
## 3rd Qu.:1.000
## Max. :1.000
3.6.2 Step 2: Preparing the Data
3.6.2.1 Normalizing the Data
The k-NN method - like most methods for clustering objects - is based on distances between objects.
However, we are dealing with variables that are expressed in different units.
- There are dummy variables that take on value of 0 or 1
- age is expressed in years (with a maximum of 50)
- skills is measured on a 10-point scale
- educational level (educ) is measured on a 4-point scale.
To bring all variables to the same level, we can either normalize or standardize them.
Often we use normalization based on the value of an object on a scale that runs from 0 (the minimum value of the variable) to 1 (the maximum). The normalized values always lie between 0 and 1, which is a nice feature.
Note that the dummy variables in this method remain unchanged!
We can determine the normalized value for each variable separately.
For example for age we can write the following lines:
# Example: normalizing age
knndata2$age_n <- (knndata2$age-min(knndata2$age))/(max(knndata2$age)-min(knndata2$age))
head(knndata2[,c("age","age_n")])
## age age_n
## 1 49 0.9655172
## 2 29 0.2758621
## 3 40 0.6551724
## 4 33 0.4137931
## 5 27 0.2068966
## 6 40 0.6551724
Just to check, someone of 29 years old has a normalized age of \((29-21)/(50-21) = 0.2759\).
This method requires a lot of typing, and is very error-prone. Whenever things are repetitive, consider writing a function!
We will write a function called normalizer().
# normalize data
normalizer <- function(x){
return((x-min(x))/(max(x)-min(x)))
}
The normalizer() function is defined as an object.
Once run, the function will appear as an object in your top right screen, in RStudio.
Between the brackets "{" and "}", is stated what the function does.
We normalize the argument (x) of the function, with a simple formula that makes use of standard functions in R, namely max() and min().
We can now apply the function to the 10 variables in our file.
We cannot and should not normalize the label variable: it's a factor variable which cannot be normalized!
This all can be done in one step by applying the function to a list of data. A data frame is in fact a list of vectors pasted together.
To apply operations to a list, we use lapply() function.
We write the normalized data in a new data frame knndata2_n. Note again that we leave out the first column (the label variable: accept or reject).
# Normalized data; leave out the label variable acceptFac (in the first column)
knndata2_n <- as.data.frame(lapply(knndata[,2:11], normalizer))
summary(knndata2_n)
## male age exp skills
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.2759 1st Qu.:0.1333 1st Qu.:0.0000
## Median :0.000 Median :0.5517 Median :0.2667 Median :0.2222
## Mean :0.412 Mean :0.5164 Mean :0.3071 Mean :0.4693
## 3rd Qu.:1.000 3rd Qu.:0.7586 3rd Qu.:0.4333 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## busexp educ educpar saving
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.000 Median :0.3333 Median :0.3333 Median :0.000
## Mean :0.232 Mean :0.3387 Mean :0.3867 Mean :0.244
## 3rd Qu.:0.000 3rd Qu.:0.6667 3rd Qu.:0.6667 3rd Qu.:0.000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000
## city marital
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :1.000
## Mean :0.498 Mean :0.614
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
The output of summary() confirms that the normalizer() function does what it is supposed to do: the minima and maxima of all variables are now all 0 and 1. The variables are now expressed in comparable units. Or unit-free, if you prefer.
Normalizing is an important step. The software will also work with data that has not been normalized without any warning. But the variables with the greatest range (here age) then carry much more weight, and that's probably not what we want. Computers are smarter than you think. But never smarter than you are.
3.6.2.2 Splitting the Data into a Training Set and a Test Set
The next step is to split the data into a training and a test set.
With the training set we want to develop a model that predicts grouping (accept or reject) based on the 10 criteria that we suspect are important.
But we must determine whether the model works properly on a test set of data not used in the development of the model.
An recurring question is the size of the two sets.
On the one hand, we naturally want to use as much information as possible to determine the best possible model.
But at the same time, we want to have a test set that is large enough to make statistically sound statements about the predictive power of the model.
Commonly used rules of thumb are to use 20% or 25% of the data set for testing the model. But with very large data sets, that percentage can be adjusted downwards.
Another issue is how to split the data into a training and a test set.
If the data are in random order then we can simply use the first 80% of the records for the training set and the last 20% for the test set. But if the data is sorted in some way (for example by age, or by the label variable) then it is very unwise to follow this method. In general it is better to take a random sample from the data file.
Often, we are not entirely sure of how the data are organized. So, a random sample is recommended!
Drawing a sample is done as follows.
We know that the data file has about 500 records (but we can let R do the work with the nrow() function).
For convenience, we want to have a test set of exactly 100 records.
Functions dealing with samples and random numbers, are actually not truly random: based on a certain starting point (indicated by a number that we call seed) the result of the drawings is fixed. This is often useful, because results can be reproduced with that information.
The starting point we chose with set.seed() is 12321. Running set.seed(12321) produces the same results as described below. Using a different starting point, or not running set.seed() at all, gives different results. However, if the sample is large enough, the conclusions are bound to be the same.
The sample stored in the vector train_steekproef is a set of 400 unique numbers. The size of our file turns out to be 500 (from the nrow() function), and we ourselves determined that the test set must have a size of 100 records.
set.seed(12321) # for reproducibility of the results!
x <- nrow(knndata2_n) # determine number of observations (500)
train_set <- sample(x,x-100) # sample a training set (and leave 100 for testing)
str(train_set) # show me the structure of the training set
## int [1:400] 351 262 126 56 360 271 406 308 26 173 ...
# use the vector to split our data
# this must be applied to both the criteria and the label variable!
knn_train <- knndata2_n[train_set,]
knn_test <- knndata2_n[-train_set,]
knn_train_label <- knndata2[train_set, 1]
knn_test_label <- knndata2[-train_set, 1]
str(knn_train)
## 'data.frame': 400 obs. of 10 variables:
## $ male : num 1 1 0 1 1 0 1 0 0 1 ...
## $ age : num 0.448 0.862 0.586 0.862 0.241 ...
## $ exp : num 0.167 0.2 0.5 0.933 0.267 ...
## $ skills : num 0.222 0.111 1 0.111 0.222 ...
## $ busexp : num 0 0 1 1 1 0 1 1 0 0 ...
## $ educ : num 0 0 0.333 0 0 ...
## $ educpar: num 0 0 0 0.667 0.667 ...
## $ saving : num 0 0 0 0 0 0 0 0 0 0 ...
## $ city : num 1 1 0 1 0 0 1 1 0 1 ...
## $ marital: num 1 0 1 1 0 1 1 0 0 1 ...
str(knn_test)
## 'data.frame': 100 obs. of 10 variables:
## $ male : num 1 0 1 0 1 0 1 1 0 0 ...
## $ age : num 0.966 0.276 0.103 0.897 1 ...
## $ exp : num 0.0667 0.0333 0.1667 0.5333 0.8667 ...
## $ skills : num 0.222 0 0.889 1 0.111 ...
## $ busexp : num 0 0 0 1 1 0 0 1 0 0 ...
## $ educ : num 0 0 1 0 0 ...
## $ educpar: num 0 0 0 0 0.667 ...
## $ saving : num 0 0 0 0 0 1 0 0 0 0 ...
## $ city : num 0 1 1 1 0 1 0 0 1 1 ...
## $ marital: num 1 0 0 1 1 1 1 1 0 1 ...
str(knn_train_label); length(knn_train_label)
## Factor w/ 2 levels "Reject","Accept": 1 1 2 1 1 1 2 2 2 1 ...
## [1] 400
str(knn_test_label) ; length(knn_test_label)
## Factor w/ 2 levels "Reject","Accept": 1 1 1 2 1 2 1 1 1 1 ...
## [1] 100
===
The name of the technique, market basket analysis, seems to limit its application. But the principles of analyzing connection rules, with the core concepts of support, confidence and lift, are widely applicable, as evident from the Titanic data video (see 2) and in our example below.
As an example of market basket analysis we will use a small data set with characteristics of 10 people who have shown criminal behavior.
Psychologists have interviewed these individuals and coded the information from the conversations using keywords (such as "drugs" if there was a drug addiction; "divorced", and so on).
The question now is whether these characteristics (items) are related to each other. For example, are drug users relatively often divorced; or vice versa, do criminals who are divorced more often use drugs?
Note that this technique is geared toward finding associations. It is unsupervised, as we do not have a target variable to explain or predict. Nor do we have a theory that uses logical reasoning to link the LHS to the RHS.
That said, there often is a logic to the associations found. Consumers who buy bread, may also peanut butter and chocolate sprinkles to use on the bread. But the point is, we do not start analyzing the data to test a priori theories. We will keep associations even if we do not fully understand their logic, like in the beer and diapers example in the video!
We can record the information from the interviews in a database.
Such a data file (in Excel) can look like this:
In the first interview, for example, it turns out that the interviewee has divorced parents and has been guilty of shoplifting.
Shoplifting does not occur in any of the other interviews.
One conclusion is that shoplifting is always associated with divorced parents (although the evidence for this conclusion, with only one case of shoplifting, is pretty thin).
Conversely, there are several interviewees with divorced parents who have not been guilty of shoplifting.
Note that due to the lack of structure in the data set, it is not so easy to identify these types of relationships even in this small data set!
A number of things stand out in the database.
- The structure is different from what we are used to.
Columns normally have a fixed meaning. It contained a variable with an unambiguous meaning or label (e.g. age or gender). The columns in this file are the "loose" comments noted down by the psychologist during the interview. The drug comment appears as the first comment (e.g. for person 2) or the third comment (person 3). It is not relevant to this example, but in principle the interviewer can code comments twice or more in one and the same interview.
- The number of filled columns differs from one person to the next. While person 3 has 5 comments, person 1 has only two.
This way of storing data has great advantages.
Think of the thousands of products offered by supermarkets, out of which only a very small proportion is part of each transaction.
If every column were to represent a product (item), then every record in the data file would mainly consist of zeros or empty cells! An online store would have to create a line with as many columns as there are items in the assortment, even though each transaction includes only one or a few items.
In its quest to be all things to all people, Amazon has built an unbelievable catalog of more than 12 million products, books, media, wine, and services. If you expand this to Amazon Marketplace sellers, as well, the number is closer to more than 350 million products.
Source: www.bigcommerce.com
In our small example, too, this method of data storage has advantages: the interviewer can note down the comments during the interview, without having to worry about the sequence and the number of possible comments.
In the same vein, the cashier in the supermarket also scans the products in any "random" order in which they are placed on the belt!
Self test: try to see for yourself if you can detect some pattern in the data. Although there are only 10 people in the file and the number of comments has a maximum of 5, it is not an easy task! Algorithms help!