FULL SOLUTION

Support Vector Machines

Suppose a binary classification problem with a data set $\mathcal{D}$ that contains $n$ training examples in a $p$-dimensional space of the form:

$\mathcal{D}=\{(\mathbf{x}_i,y_i)|\mathbf{x}_i \in \mathbb{R}^p,y_i \in \{-1,1\}\}_{i=1}^{n}$

where $y_i$ indicates the class of the training sample $\mathbf{x}_i$.

The primal optimization problem to solve a SVM when we use soft margin classification to allow samples inside the classification margin and using a kernel functions is (Cortes, 1995):

$\min_{\mathbf{w},b} \frac{1}{2}\Vert\mathbf{w}\Vert^2+C\sum_{i=1}^n{\xi_i}$
$y_i(\sum_{j=1}^n\alpha_jy_jK(\mathbf{x}_j\mathbf{x}_i)+b)\ge 1-\xi_i,\forall i=1,2,...,n$
$\xi_i \ge 0,\forall i=1,2,...,n$
$\mathbf{w}=\sum_{i=1}^{n}y_i\alpha_i\phi(\mathbf{x}_i)$

where $C$ is a penalty parameter to tradeoff the misclassification error, $\phi(\cdot)$ is a nonlinear transformation (usually unknown) where inner products can be computed using a kernel function $\phi(\mathbf{x}_i)\phi(\mathbf{x}_j)=K(\mathbf{x}_i,\mathbf{x}_j)$ and $\alpha_i$ is the Lagrange multiplier associated to the constraint of the sample $\mathbf{x}_i$.

If the kernel is a radial basis function, also called Gaussian kernel $K(\mathbf{x}_i,\mathbf{x}_j)=exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2)$, the corresponding feature space is a Hilbert space of infinite dimensions.

This optimization problem is equivalent to solving an $n$-dimensional convex Quadratic Programming (QP) problem.

The solution is non-parametric, the size of the classification function (that is the number of Support Vectors) is not predetermined and is modeled based on the training samples. To keep the complexity under control a semi-parametric models can be adopted (see our Budgeted solution).

IRWLS to train a SVM

The cost function of the SVM can be rearranged in the form of a Weighted Least Squares problem (Perez Cruz, 2001). That matricial formulation, with a complexity that can be kept under control, can be efficiently parallelized as shown in (Díaz Morales, 2016b).

IRWLS rearranges the cost function to be independent of $\xi_i$ (Perez Cruz, 2001):

$\min_{\boldsymbol{\beta},b} \frac{1}{2}\Vert\mathbf{w}\Vert^2+\sum_{i}a_{i}e_i^2$
$e_i=y_i-\left(\sum_{j=1}^n\alpha_jy_jK(\mathbf{x}_i,\mathbf{x}_j)+b\right)$
$a_i=\begin{cases}0,y_ie_i \le 0\\\frac{C}{e_iy_i},y_ie_i \ge 0\label{vara}\end{cases}$

The weights are then obtained by minimizing the equation considering that $a_i$ does not depend on $\mathbf{w}$, then recalculating $a_i$ using the new $\mathbf{w}$ and repeating the procedure until convergence.

In every iteration, the size of the quadratic problem to be solved is equal to the number of elements in the training set. This is not affordable in most of the cases because it is necessary to store in memory a matrix that contains the kernel function of every pair of training samples in order to solve the linear system that obtains in every iteration the weights $\mathbf{w}$. To make this procedure practical two solution have been proposed in the literature.

Training samples differentiation: The linear system does not need to work with the full training set, just with a reduced set that contains the unbounded Support Vectors. It works dividing the training samples into three sets. $\mathcal{S}_1$ contains the unbounded SVs, $\mathcal{S}_2$ contains samples whose $\alpha_i=0$ and $\mathcal{S}_3$ contains bounded SVs. This method, although reduces the complexity of the training procedure, is not enough when the number of unbounded Support Vectors is very high.
Using a working set: To use this procedure inside a chunking scheme, we need to use a division of the training elements into two different groups. The samples are divided at every iteration into a working set $\mathcal{S}_W$ that is fixed in size, whose elements have the weights that are going to be updated in an iteration, and an inactive set $S_{IN}$, which contains the elements whose weights remain constant. The size of the quadratic problem is the working set size, allowing us to keep the complexity under control.

This software makes use of both methods to solve a full SVM: we use training samples differentiation and there is an outer loop that selects a working set in every iteration and an inner loop that updates the weights of the working set using the IRWLS procedure.

To parallelize this software, a shared memory parallelization has been performed as described in (Díaz Morales, 2016b).