As shown in the other post, the support vector machine (SVM) method aims at separating the two types of data points with a hyperplane \(z\equiv w^Tx+b=0\).

The hyperplane is uniquely determined by maximizing the distance from the hyperplane to the closest points, the so-called support vectors. It can be written as

\[\frac{\min\|w\|^2}{2}, \quad s.t. \quad y^{(i)} z^{(i)}\ge 1,\]

where \(z^{(i)} = w^Tx^{(i)}+b\) characterizes the distance from the data point to the separation plane, the label \(y^{(i)}\) takes the value of \(\pm1\), and the unknowns are the weights \(w\) and bias \(b\). Recall the data points \(x^{(i)}\in\mathbb R^n\) and there are \(m\) data points in the training set.

This model only works if the data are linearly separable. When they are not, such as in Fig. 1, two more ideas need to be developed:

  • soft margin
  • kernel trick

The soft margin allows for mislabels, and the kernel trick expand the feature vector dimension in the hope that the data become linearly separable in some higher dimensional space.

I will introduce these two ideas one by one in the following sections.

Figure 1. Example of data points that are not linearly separable.

soft margin

To allow label errors, we need to introduce slack variables \(\xi\in\mathbb R^m\) in the constraints:

\[y^{(i)} z^{(i)} \ge 1-\xi_i, \quad\text{with } \xi_i\ge0\]

Although errors are allowed, they are not preferred, thus a penalty term is added to the cost function

\[\min_{w,b,\xi}\frac{1}{2}\|w\|^2 + C\sum_{i=1}^m\xi_i\]

The regularization parameter \(C>0\) needs to be determined by try and error.

To solve this optimization problem, we use the generalized Lagrangian multiplier method. The Lagrangian is given by

\[\mathcal L = \frac{1}{2}w^Tw + C\sum_{i=1}^m\xi_i + \sum_{i=1}^m\alpha_i\left(1-\xi_i-y^{(i)}z^{(i)}\right)-\sum_{i=1}^m r_i\xi_i\]

with positive multipliers \(\alpha_i\) and \(r_i\).

The KKT condition requires

\[\frac{\partial\mathcal L}{\partial w}=0, \frac{\partial\mathcal L}{\partial b}=0, \frac{\partial\mathcal L}{\partial\xi}=0\]

which give rise to

\[w=\sum_{i=1}^m \alpha_i y^{(i)}x^{(i)} \\ \sum_{i=1}^m \alpha_iy^{(i)}=0 \\ C-\alpha_i=r_i\ge0\]

It shows that the normal vector \(w\) of the separation plane is a linear combination of the data points \(x^{(i)}\). The data points \(x^{(i)}\) with non-zero \(\alpha_i\) are called the support vectors. Ideally, the number of support vectors is small.

The KKT complementary condition also requires

\[\alpha_i\left(1-\xi_i-y^{(i)}z^{(i)}\right)=0 \\ r_i\xi_i =0\]

If \(r_i=0\), then \(\alpha_i=C>0\). As a result of the first condition, \(y^{(i)}z^{(i)}\le 1.\)

If \(\xi_i=0\), then \(\alpha_i\le C\). There are two cases from here. In the first case, if \(0<\alpha_i\le C\), then the first condition requires \(y^{(i)}z^{(i)}=1.\) In the second case, if \(\alpha_i=0\), then \(y^{(i)}z^{(i)}\ge1.\)

To summarize, the complementary conditions give rise to

\[\alpha_i=0 \rightarrow y^{(i)}z^{(i)}\ge1 \\ \alpha_i=C \rightarrow y^{(i)}z^{(i)}\le1 \\ 0<\alpha_i<C \rightarrow y^{(i)}z^{(i)}=1\]

The first type of points do not contribute in \(w\), thus do not affect the decision boundary. The third type of points are the more conventional support vectors, whereas the second type of points are the erroreneous points falling on the wrong side.

Plugging the expression of \(w\) back into the Lagrangian, we have

\[\mathcal L(\alpha)=\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i,j=1}^m\alpha_i\alpha_jy^{(i)}y^{(j)}\left<x^{(i)},x^{(j)}\right>\]

where the brackets denote inner product. Thus the dual problem is a quadratic programming

\[\max_{\alpha} \mathcal L \\ s.t. \sum_{i=1}^m\alpha_iy^{(i)}=0,\quad 0\le\alpha_i\le C\]

The sequential minimal optimization (SMO) is the popular method to solve the dual problem. It is a type of block-coordinate-descent method and optimizes two multipliers \((\alpha_i, \alpha_j)\) at a time to satisfy the constraint. After solving the dual problem, \(w\) can be formed from \(\alpha\). And \(b\) can be calculated from the set of support vectors from its geometric meaning.

With a new data point \(x^{(k)}\), one can determine its label by computing the sign of

\[z^{(k)}=\sum_{i=1}^{\ell}\alpha_iy^{(i)}\left<x^{(i)},x^{(k)}\right>+b,\]

where \(\ell\) is the number of support vectors, in other words, the number of non-zero \(\alpha_i\).

In general, it is preferable to use SVM with soft margin even if the data are linearly separable. This is because a small number of outliers may cause significant bias in the decision boundary. One such example is shown in Fig. 2.

Figure 2. Although the data points are linearly separable, the existence of one outlier may change the SVM result by a lot. SVM with soft margin is more robust to outliers.

kernel trick

In practice, there are cases where the data are not linearly separable but can be separated by a nonlinear surface. Among these cases, some of them can be linearly separable when the feature vector \(x^{(i)}\) is expanded in some way. An example is shown in Fig. 3.

(x) -1 0 1 (x, x 2 )

Figure 3. Transformation \(x^{(i)}\) from 1D to 2D makes linear separation possible.

The new feature vector is then some vector function of the original feature vector, i.e., \(\phi(x^{(i)})\). When the form of \(\phi(x)\) is determined, all preceding discussions hold with the replacement of \(x^{(i)}\) by \(\phi(x^{(i)})\).

Note in the calculations, we only need to compute the inner product

\[K(x^{(i)}, x^{(j)}) = \left<\phi(x^{(i)}),\phi(x^{(j)})\right>\]

Thus the calculation of \(\phi(x)\) may be totally avoided for the sake of efficiency.

In practice, a low degree polynomial kernel or radial basis kernel (RBF) kernel with a reasonable width is a good initial try.

Gaussian (RBF) kernel

\[K(x, z) = \exp\left(-\frac{\|x-z\|}{2\sigma^2} \right)\]

Polynomial kernel

\[K(x, z) = (x^Tz+c )^d, \quad d\in\mathbb{N}\]