Chapter 2 Introduction

In this chapter, we will discuss two commonly used methods to classify objects based on their characteristics.

Classification is something we all do on a daily basis.

For example, we all tend to put the people we meet in groups.

When we talk about nerds, in our minds we conjure images, like:

  • a smart-looking young man
  • wearing glasses
  • dressed in a non-fashionable manner
  • a far from modern hairstyle.

A typical image Google returns when searching for "nerd" looks like:

Whether someone who looks like the guy in the picture actually is a nerd in the sense of,l say, excelling in IT is a different story, but apparently it is the case that - at least in our minds - certain personal and behavioral characteristics go together.

OK, let's itroduce the two methods.

2.1 The k-NN Method

The first method is the k-NN method.

In this abbreviation, NN stands for nearest neighbor.

In short, we label someone a nerd, if most of the people who resemble our subject are considered nerds.

Resemblance than uses characteristics like age and gender; way of dressing; wearing glasses; and so on.

Is this a form of learning? In fact, we don't learn anything about the process here; we just persist in our existing habit of classifying people without bothering whether that habit does any good. We therefore refer to the k-NN method as a lazy learner.

Note that in some cases, this line of thinking leads to unethical practices. Ethnic profiling is an example.

The video below has some hidden truths about our habits in classifying people!

2.2 K-Means Clustering

The second, and related method, is more elegant in that we don't know in advance which groups there are. We now group objects based on a number of selected, relevant characteristics.

The goal is to form groups that are homogeneous and distinctly different from other groups; then we will also name those groups.

The importance of classification cannot be overstated. The number of applications is sheer endless.

  • Companies use customer data to segment their market.
  • In the medical world, patterns in genetic data can be used to detect diseases.
  • There is a lot to do about ethnic profiling by the police!

We will illustrate classification techniques with popular algorithms in R packages.