Chapter 5 Assignment

5.1 kNN Method

After studying the topic and getting some practice with the examples, it is time to try it yourself!

The kNN assignment uses a (small) file of passenger cars that were sold on the American market in the 1980s.

It concerns 74 brands and models, of American and foreign make.

We have information on price, head and trunk space, length, fuel consumption (kilometers per liter), and so on.

Let's use the kNN (nearest neighbor) method to predict whether it is an American or non-American car.

The tasks and questions are:

  • Read the file (cars.csv)
  • Describe the file
  • how many records?
  • How many variables?
  • What are the minimum, mean, and maximum values of those variables?
  • Prepare the data before applying the kNN method.
  • Determine the "label" variable!
  • Normalize the file (excluding the "label" variable!). The resulting values of all variables should be in the range from 0 to 1.
  • Create a training and test set. Use 80% of the records for the training set. The data may be sorted by important characteristics!
  • Determine k.
  • Train the model, and evaluate the model: how good are the predictions in the test set?
  • What are possible improvements in the model? Apply it!

Tip: the data are on GitHub.

You can download the data, or read the data directly using:

MyData <- read.csv("https://raw.githubusercontent.com/ssmresearch/hanminor/main/cars.csv")
head(MyData)
##   origin price mileage repair headspace trunkspace weight length turningcircle
## 1    usa  4099     8.8      3      6.25        308 1318.5  465.0         12.20
## 2    usa  4749     6.8      3      7.50        308 1507.5  432.5         12.20
## 3    usa  3799     8.8      3      7.50        336 1188.0  420.0         10.68
## 4    usa  4816     8.0      3     11.25        448 1462.5  490.0         12.20
## 5    usa  7827     6.0      4     10.00        560 1836.0  555.0         13.12
## 6    usa  5788     7.2      3     10.00        588 1651.5  545.0         13.12
##   gear_ratio
## 1       3.58
## 2       2.53
## 3       3.08
## 4       2.93
## 5       2.41
## 6       2.73

5.2 KMeans Clustering

Let's apply K-means clustering on the same data set.

You have to determine a number of still unknown clusters of, in this case, makes and models of cars.

There is no criterion that we can use as a training and test set!

The questions and assignments are:

  • Read the file (cars.csv) if you haven't done so already.
  • You have seen from the description in the previous assignment that the variable for origin (US, versus non-US) is a factor variable. We cannot calculate distances from a factor variable. Because we want to include it anyway, we have to make it a dummy (0/1) variable.
  • Normalize the data.
  • Determine the number of clusters using the (graphical) method described in the lesson material.
  • Determine the clustering, and add the cluster to the data set.
  • Describe the clusters in terms of all variables used in the clustering.
  • Characterize ("label)" the clusters.
  • Repeat the exercise with more or fewer clusters, and decide if the new solutions are better than the original solution!