Part 5 Support Vector Machines

I mentioned two chapters back that I constructed two variants of this classification algorithm: one with a radial kernel and one with a linear kernel.

If none of the above makes sense, I will explain what these are in this chapter of the document.

5.1 What is a Support Vector Machine?

Support Vector Machines (i.e., SVMs) are a kind of supervised machine learning algorithm that is used for classification (and also regression) problems (e.g., given a set of features, can you predict whether or not sample A is of which class of data).

The way a SVM does classification is via hyperplanes - a plane in one less the dimension of the data (e.g., if your data had three dimensions, a SVM would attempt to divide your data via 2D lines). But, to be more specific, a SVM works by finding a hyperplane in your data’s dimension that distinctly classifies all of the data points into two or more categories.

I think we should start off with a simple example of how an SVM could be done.

5.2 Linearly Separable Data

Suppose that I want to classify Iris species into two species: setosa and virginica based on the length and widths of their petals:

It is apparent from the above graph that one way I could go about doing this with a support vector machine is to draw diagonal lines like so:

However, which line actually gets picked in the end? The line that maximizes the distances between the two nearest data points of each class will end up getting picked - so, our support vector machine might end up looking something like this:

Granted - this may not be the most accurate depiction of a support vector machine, but the diagonal line is called the decision boundary. Anything that falls below the decision boundary is determined to be of class setosa; anything that falls above is of class virginica.

Furthermore, the points that lie the closest to the decision boundary are called the support vectors. The distance between the support vectors and the decision boundary is called the margin.

The main idea is that the further away the support vectors are from the decision boundary, the higher the probability that the SVM will correctly classify the points into their correct classes. The support vectors themselves are crucial for this step: since the SVM algorithm aims to maximize the distance between the decision boundary and the support vectors, a change in the support vectors’ position can shift the decision boundary!

5.2.1 What if we had data that looked like this (see below)?

There are a couple of pink points in the above plot that look like they’re in the region of the scatterplot with green points.

Nevertheless, such points are considered “outliers” in the support vector machine algorithm - one of the perks of using a SVM is that the algorithm is resistant to outliers.

That said, is the roughly what the linear SVM would look like:

So, the linear SVM performs the same as before (i.e., data without outliers), albeit for each point in the pink area that crosses the decision boundary (i.e., the blue line), a penalty term is added to the hyperplane’s equation.

If you are interested in what these penalties are, I have included a fair chunk of the mathematics behind SVMs at the back of this chapter.

5.3 Non-Linearly Separable Data

Suppose that we have data that looks like this:

It is clear that we cannot use a linear decision boundary to divide the data. However, the points still look like they’re clustered well - hence, what the SVM algorithm might do is to add a third dimension.

Up until now, there were two dimensions: \(X\) and \(Y\). In the SVM algorithm, the kernel function transforms the data into a higher dimension - in this case, the third dimension - \(Z\).

For the sake of simplicity, let’s suppose that the kernel maps the data via the equation of a paraboloid:

\[\begin{equation} z = \frac{x^2}{a^2} + \frac{y^2}{b^2} \tag{5.1} \end{equation}\]

Note that equation (5.1) is also the equation of a circle for \(a = 1\) and \(b = 1\)!

Nonetheless, when projected onto three dimensions, the data looks something like this:

From here, it looks like we can draw a hyperplane at \(Z = 5\) or something along the lines. Hence, substituting \(Z = 5\) for equation (5.1), we get the circular decision boundary:

\[\begin{equation} x^2 + y^2 = 5 \end{equation}\]

Hence, the decision boundary looks something like this on the 2D scatterplot (not accurate - just a rough drawing I quickly whipped up in base R):

5.4 Kernel Trick

While we can use equations like (5.1) to map our data to higher dimensions, doing so is computationally expensive and also time-consuming, especially if you have a large amount of data to deal with.

Hence, enter the kernel trick: a way of computing the dot product (in case you’ve forgotten your high-school Linear Algebra) between two vectors \(x\) and \(y\) as a faster alternative to mapping the data like I did in equation (5.1).

There are also different kernels that can be used - this is also what the kernel parameter in tune.svm() does: it determines which kernel e1071 will apply to the data. I will talk about the mathematics behind the radial and the linear kernel trick in the Kernels section.

Nevertheless, let’s go back to the example used in the previous sub-section (i.e., the “Weird Data”). The transformation to the third dimension \(Z\) is denoted by equation (5.1).

Hence, suppose that we have two vectors \(a\) and \(b\) such that:

\[\begin{align} a = \left[ \begin{array}{c} x_a \\ y_a \\ z_a \end{array} \right] && b = \left[ \begin{array}{c} x_b \\ y_b \\ z_b \end{array} \right] \end{align}\]

Hence, the dot product \(a \cdot b\) is:

\[\begin{align} a \cdot b &= x_ax_b + y_ay_b + z_az_b \\ &= x_ax_b + y_ay_b + ({x_a}^2 + {y_a}^2) \times ({x_b}^2 + {y_b}^2) \end{align}\]

Thereafter, the support vector machine will then determine the best hyperplane to construct a decision boundary with, albeit with the dot product.