Chapter 9 k Nearest Neighbors
9.1 k nearest neighbors and memory-based learning
K nearest neighbors (kNN) is another example of a supervised model, a very simple supervised model. It may be the simplest, most intuitive model that is used for data mining. Despite the simplicity of the model, it can work quite well with large data sets. It is also one of the top 10 data mining tools in terms of popularity of usage. (“Top 10 Machine Learning Algorithms You Should Know in 2021,” n.d.)
kNN is an example of a family of algorithms known as instance-based or memory-based learning that classify new objects by their similarity to previously known objects. kNN does not require many assumptions about the input or predictor data or anything about error distributions. Formally, it is a non-parametric model. In parametric models (such as multiple regression), you must make assumptions about the distribution of the error term and then estimates are made of the parameters of the distribution. No such considerations are involved with kNN.
A training phase is used to determine the best value for k (the number of “neighbors” used to compute similarity), but otherwise no training is needed. It can be directly applied to all of the data. If you have a data set, you simply apply kNN. An important disadvantage of kNN is that the model must be run through all the data available each time a new observation needs to be classified. This can be time consuming for large data sets. With logistic regression or OLS, you only save the model and apply it to new data, which is much faster. Since there is no “model” per se in K-nearest neighbors, the model essentially is created each time an analysis is needed.
9.2 Typical applications
If we look at typical applications, the list is much like other predictive models such as logistic regression and decision trees, such as:
- Flagging fraudulent insurance claims.
- Predicting customer response to promotional offers.
- Selecting the most effective treatment for a medical condition.
- Classifying free-text responses.
- Recommending the next offer to a customer in retail settings.
- Searching for similar documents.
- Recommender systems.
9.3 What is kNN?
KNN is a method for classifying objects based on similarity. It is called a “lazy” algorithm, which means is that it does not use the training data points to do any generalization and is contrasted with “eager” algorithms. The differences are described in Table 9.1. In other words, there is no explicit training phase, or it is very minimal. Most of the lazy algorithms – especially kNN – make decisions based on the entire data set. Some of the distinctions between lazy and eager learners are shown here.
- With lazy learners
- The data is stored, not learned from it.
- Classifications are made as soon as a new observation is received.
- The model is richer since it does not rely on a single pre-specified model.
- Time to classify new observations can be considerable with large data sets.
- Learning time is non-existent (except for determining the optimal value of k) and the need for minor preprocessing.
- With eager learners
- A model is developed (learned) based on the training data.
- The model is used to classify new observations as they are received.
- The model depends upon a single function that is derived in the learning phase.
- Classification of new observations is very fast.
- Learning time can be considerable.
The algorithm works like this. A labeled data set with a known categorical target vector, Y (which may be binary or multichotomous) and a matrix X with p potential predictors or features is used. Each case is considered a point in a multidimensional feature space. This allows the computation of “distances” among the points (or cases) based on locations (values) in the feature space.
KNN is then used to classify a new observation that is described by p variables (X1, X2, …, Xp). To do so, the algorithm computes the distance of the new observation to every other observation in the data set. Euclidean distance is typically used, but other metrics such as the absolute value of differences, squared Euclidean distance, Jacard distance, Manhattan distance, or cosine distance are also used. In most cases it is important to standardize the X variables prior to determining distances to avoid having the variables with the largest numerical values dominating the computed distances.
The distances of the new observation to each row of the data set are computed and are ranked from smallest to largest. The k smallest distances (the “nearest neighbors”) selected. No assumptions about the form of the relationship between Y and the Xi’s is made; no parameters are estimated. A majority “vote” of its k neighbors is taken and the category with the smallest distance is selected as the predicted class. k – the number of nearest points or neighbors to consider in making an assignment to a category – is usually an odd number so that no ties are formed; if an even number is set for k, then random choices for Y class are made for ties.) Since a majority is used to make predictions, kNN can be applied to targets with two or more than two possible outcomes.
There is a trade-off implicit here. While there is only a minimal training phase used to determine the best k, deployment of kNN is not efficient in terms of processing time and memory requirements. More time is needed because all data points might take part in determining a prediction. Furthermore, all of the data must be stored and available for deployment.
9.4 A two-dimensional graphic example of kNN
A very simple example is used to illustrate how kNN works. Figure 9.1 shows a plot of characteristics of different types of homes. The two predictors are area in square feet and number of rooms. Both variables are standardized to zero mean and unit standard deviation.
Toward the center of the chart is an “unknown” type of home. The area and number of rooms for the unknown were standardized using the same mean and standard deviation as was computed for the known data. The distances of the unknown to every data point in the original set are represented by the arrows. The three closest known observations are circled, assuming for this example the k = 3. Two of the closest homes are flats and one is an apartment. Using the majority rule, the unknown is classified as a flat. (See Figure 9.1)
9.5 Example of kNN: Diagnosing heart disease
This database is part of a larger study of factors associated with heart disease collected at the Cleveland Clinic. The data set contains test results on 303 patients measured on 13 attributes plus a target variable indicating the presence or absence of heart disease. This data set has been used in several published studies on machine learning. The variables are listed in Figure 9.2 along with an indication of each as numeric or categorical. For each of the categorical variables, the levels are shown as well.
KNIME was used to determine the best value for k by finding the accuracy for values of k from 1 to 20. A plot of Accuracy versus k is shown in Figure 9.3, which indicates that the best k was 7.
|1||File Reader||Read the file heat_data.csv|
|2||Data Explorer||Obtain histograms and descriptive statistics.|
|3||Number To String||Create 70/30 split into training and test partitions.|
|4||Naïve Bayes Learner||Run naïve Bayes on training data.|
|5||Naïve Bayes Predictor||Use the naïve Bayes model to predict test data.|
|6||Scorer||Calculate performance metrics and confusion matrix.|
The confusion matrix for the kNN model with k = 7 is in Table 9.2.
Accuracy measures for the kNN analysis are in Table 9.3
9.6 kNN for continuous targets
While kNN is primarily a method for classification, it can also be used with continuous target variables much like ordinary least squares (OLS) regression. KNIME does not include a node for kNN regression, so a small R Snippet was created to use the package FNN. One advantage of kNN regression is that non-linear relationships can be easily captured. Ordinary regression requires transformations and/or adding predictors to capture non-linearities.
A simple data set with a single predictor (X) and a continuous target(Y) was created. A scatterplot of the data is shown in Figure 9.5. Note that the relationship is non-linear and slightly concave upward.
A KNIME workflow was created, shown in Figure 9.6, to compare kNN and OLS. For this small demonstration data, a test data subset was not created.
KNIME does not have a node to perform kNN regression, so this R code was inserted in Node 7 which contains an R Snippet. Note that this R code will have to be customized for new problems.
- mydata <- as.data.frame(knime.in)
- TrainData <- as.data.frame(mydata[,1])
- TrainTarget <- mydata[,2]
- TestData <- as.data.frame(mydata[,1])
- YKNN = knn.reg(train = TrainData, test = TestData, y = TrainTarget, k = 3)
- knime.out <- as.data.frame(cbind(mydata,YKNN$pred))
The metrics for the two models were:
|Mean absolute error||2.545||7.209|
|Root mean squared error||3.132||8.694|
|Mean absolute percentage error||0.645||1.937|
The metrics show that kNN more accurately captured the relationship in the data. OLS could be improved by creating a polynomial model, of course, but the point is that kNN regression did not require a model to be specified a priori.
Plots of the predicted Y (on the y-axis) and the actual Y values (on the x-axis) are show in Figures 9.7 and 9.8 for the two models. Note that the OLS model did not capture the non-linearity and created a downward concave result.
9.7 kNN for multiclass target variables
KNN is also effective with target variables that have more than two classes. An example data set was obtained from the UCI Machine Learning Repository to illustrate kNN with a multiclass target. The data set has 214 observations, a 6-level categorical target, Type of glass, and nine continuous predictors. The kNN workflow from KNIME is shown in Figure @[fig:kNN_GlassWorkflow]. The model was run with a 50/50 split between the training and test rows and k was set to two, which was found to result in the highest accuracy.
The confusion matrix for predicting type of glass in the test data set is shown in Table @ref[tab:GlassConfusionMatrix]. The accuracy varies by type of glass and the overall accuracy is just over 73%. While this may not seem very good, predicting to six classes with a small data set is not an easy task for the model.
|Variables||Buliding float||Buliding nonfloat||Vehicle windows float||Containers||Tableware||Headlamps|
|Vehicle windows float||1||2||5||0||0||0|