10.2 Support Vector Classifier
The maximal margin classifier can be generalized to non-separable cases using a so-called soft margin. The generalization is called the support vector classifier. The soft margin allows some misclassification in the interest of greater robustness to individual observations. The support vector classifier optimizes
\[y_i (x_i^{'} \beta + \beta_0) \ge M(1 - \xi_i)\]
where the \(\xi_i\) are positive slack variables whose sum is bounded by some constant tuning parameter \(\sum{\xi_i} \le \Xi\). The slack variable values indicate where the observation lies: \(\xi_i = 0\) observations lie on the correct side of the margin; \(\xi_i > 0\) observation lie on the wrong side of the margin; \(\xi_i > 1\) observations lie on the wrong side of the hyperplane. \(\Xi\) sets the tolerance for margin violation. If \(\Xi = 0\), then all observations must reside on the correct side of the margin, as in the maximal margin classifier. \(\Xi\) controls the bias-variance trade-off: as \(\Xi\) increases, the margin widens and allows more violations, increasing bias and decreasing variance.
The support vector classifier is usually defined by dropping the \(||\beta|| = 1\) constraint, and defining \(M = 1 / ||\beta||\). The optimization problem then becomes
\[ \min ||\beta|| \hspace{2mm} s.t. \hspace{2mm} \begin{cases} y_i(x_i^T\beta + \beta_0) \ge 1 - \xi_i, \hspace{2mm} \forall i & \\ \xi_i \ge 0, \hspace{2mm} \sum \xi_i \le \Xi. \end{cases} \]
This is a quadratic equation with linear inequality constraints, so it is a convex optimization problem which can be solved using Lagrange multipliers. Re-express the optimization problem as
\[ \min_{\beta_0, \beta} \frac{1}{2}||\beta||^2 = C\sum_{i = 1}^N \xi_i \\ s.t. \xi_i \ge 0, \hspace{2mm} y_i(x_i^T\beta + \beta_0) \ge 1 - \xi_i, \hspace{2mm} \forall i \]
where the “cost” parameter \(C\) replaces the constant and penalizes large residuals. This optimization problem is equivalent to another optimization problem, the familiar loss + penalty formulation:
\[\min_{\beta_0, \beta} \sum_{i=1}^N{[1 - y_if(x_i)]_+} + \frac{\lambda}{2} ||\beta||^2 \]
where \(\lambda = 1 / C\) and \([1 - y_if(x_i)]_+\) is a “hinge” loss function with \(f(x_i) = sign[Pr(Y = +1|x) - 1 / 2]\).
The parameter estimates can be written as functions of a set of unknown parameters \((\alpha_i)\) and data points. The solution to the optimization problem requires only the inner products of the observations, represented as \(\langle x_i, x_j \rangle\),
\[f(x) = \beta_0 + \sum_{i = 1}^n {\alpha_i \langle x, x_i \rangle}\] The solution has the interesting property that only observations on or within the margin affect the hyperplane. These observations are known as support vectors. As the constant increases, the number of violating observations increase, and thus the number of support vectors increases. This property makes the algorithm robust to the extreme observations far away from the hyperplane.
The parameter estimators for \(\alpha_i\) are nonzero only for the support vectors in the solution—that is, if a training observation is not a support vector, then its \(\alpha_i\) equals zero.
The only shortcoming with the algorithm is that it presumes a linear decision boundary.