To parallelize this software, a shared memory parallelization has been performed as described in (Díaz Morales, 2016a)(Díaz Morales, 2016b).
We have implemented the algorithms to run on a shared memory multicore environment. In our schema, different threads usually perform different tasks and the method to solve the linear system is recursive so this is a bad scenario for a GPU framework.
Modern processors have several cache memories organized by levels (L1, L2,..., RAM). When programming high performance applications, it is important to take into account this architecture to avoid bottlenecks: different cores accessing the RAM at the same time, different copies of the data could exist in each core cache, if a core modifies its cache copy, then the system will need to update all caches and RAM to keep consistency, etc. It is also useful to consider that cache is more efficient reading continuous memory data.
In order to avoid these constraints as far as possible we have adopted the following strategies. Firstly, variables that are not going to be used by other cores are saved outside the shared memory to avoid cache updating. Secondly, every core works in a different memory area. Finally, the datasets are stored in a continuous memory block using a row major order schema.
A design pattern called Map Reduce (do not confuse the design pattern with the associated implementation for processing large datasets on clusters) has been widely used. This structure has been shown to be very useful for many tasks in the machine learning area. A Map Reduce algorithm is composed of a Map() procedure that operates on a portion of the input data and a Reduce() function that performs a summary operation.
In SGMA, the selection of the best basis element has been implemented using a map function to evaluate the error decrease and a reduce function that finds the candidate that has obtained the highest value.
In the IRWLS procedure, the steps to update the errors or the variables have also been implemented using this design pattern because the operations are independent among the different input elements.
We are working in a shared memory environment, to parallelize the matrix operations we have followed a quadtree schema, iteratively dividing the matrices into four submatrices and performing the operations on the submatrices. This schema allows us to use a number of cores equal to a power of 2 dividing recursively the matrices.
The parallelization of matrix multiplications can be done very efficiently in a multicore environment creating a different subtask for every position of the resulting matrix. Using a quadtree schema the operations are split like this:
$\left[ \begin{array}{cc} \mathbf{A_1} & \mathbf{A_2} \\ \mathbf{A_3} & \mathbf{A_4} \end{array} \right]= \left[ \begin{array}{cc} \mathbf{B_1} & \mathbf{B_2} \\ \mathbf{B_3} & \mathbf{B_4} \end{array} \right] \left[ \begin{array}{cc} \mathbf{C_1} & \mathbf{C_2} \\ \mathbf{C_3} & \mathbf{C_4} \end{array} \right]$
$\text{Task 1}\begin{cases}\mathbf{A_1}=\mathbf{B_1}\mathbf{C_1}+\mathbf{B_2}\mathbf{C_3}\\
\mathbf{A_2}=\mathbf{B_1}\mathbf{C_2}+\mathbf{B_2}\mathbf{C_4}\end{cases}$
$\text{Task 2}\begin{cases} \mathbf{A_3}=\mathbf{B_3}\mathbf{C_1}+\mathbf{B_4}\mathbf{C_3}\\
\mathbf{A_4}=\mathbf{B_3}\mathbf{C_2}+\mathbf{B_4}\mathbf{C_4}\end{cases}$
If one matrix is triangular, the procedure should be divided into subtasks with the same run time. For example, if $\mathbf{C}$ is a lower triangular matrix, then $\mathbf{C_1}$ and $\mathbf{C_4}$ are lower triangular (their products have half of the operations) and $\mathbf{C_2}=\mathbf{0}$:
$\text{Task 1}\begin{cases}\mathbf{A_1}=\mathbf{B_1}\mathbf{C_1}+\mathbf{B_2}\mathbf{C_3}\\
\mathbf{A_2}=\mathbf{B_2}\mathbf{C_4}\end{cases}$
$\text{Task 2}\begin{cases} \mathbf{A_3}=\mathbf{B_3}\mathbf{C_1}+\mathbf{B_4}\mathbf{C_3}\\
\mathbf{A_4}=\mathbf{B_4}\mathbf{C_4}\end{cases}$
The parallelization of additions and subtractions is a trivial matter and the operations can be directly parallelized:
$\left[ \begin{array}{cc} \mathbf{A_1} & \mathbf{A_2} \\ \mathbf{A_3} & \mathbf{A_4} \end{array} \right]= \left[ \begin{array}{cc} \mathbf{B_1} & \mathbf{B_2} \\ \mathbf{B_3} & \mathbf{B_4} \end{array} \right] +\left[ \begin{array}{cc} \mathbf{C_1} & \mathbf{C_2} \\ \mathbf{C_3} & \mathbf{C_4} \end{array} \right]$
$\text{Task 1}\begin{cases}\mathbf{A_1}=\mathbf{B_1}+\mathbf{C_1}\\
\mathbf{A_2}=\mathbf{B_2}+\mathbf{C_2}\end{cases}$
$\text{Task 2}\begin{cases} \mathbf{A_3}=\mathbf{B_3}+\mathbf{C_3}\\
\mathbf{A_4}=\mathbf{B_4}+\mathbf{C_4}\end{cases}$
Another operation that we have to deal with is the inverse of a triangular matrix. Having a triangular matrix $\mathbf{L}$, its inverse $\mathbf{L^{-1}}$ is triangular and $\mathbb{I}=\mathbf{L}\mathbf {L^{-1}}$, subdividing the matrices we have:
$\left[ \begin{array}{cc} \mathbb{I} & \mathbf{0} \\ \mathbf{0} & \mathbb{I} \end{array} \right]= \left[ \begin{array}{cc} \mathbf{L_1} & \mathbf{0} \\ \mathbf{L_3} & \mathbf{L_4} \end{array} \right] \left[ \begin{array}{cc} (\mathbf{L^{-1})_1} & \mathbf{0} \\ (\mathbf{L^{-1})_3} & (\mathbf{L^{-1})_4} \end{array} \right]$
$\text{Task 1}\begin{cases}(\mathbf{L^{-1})_1}=\mathbf{L_1}^{-1}\end{cases}$
$\text{Task 2}\begin{cases}(\mathbf{L^{-1})_4}=\mathbf{L_4}^{-1}\end{cases}$
$\mathbf{(L^{-1})_3}=-(\mathbf{L^{-1})_4}\mathbf{L_3}(\mathbf{L^{-1})_1}\Rightarrow \text{Parallelizable}$
In this case, ($\mathbf{L^{-1})_1}$ and ($\mathbf{L^{-1})_4}$ can be computed in parallel at the same time and ($\mathbf{L^{-1})_3}$ can be parallelized using the products described previously.
Since the IRWLS procedure solves just one linear system in every iteration, it is not computationally efficient to invert the matrix to solve the linear system that obtains the weights. In the case of IRWLS the matrix to be inverted contains kernel evaluation of every pair of elements. This means that the matrix is positive definite.
The Cholesky factorization is a decomposition of a positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, it is efficient for many numerical solutions and is twice as efficient as the LU decomposition for solving systems of linear equations:
$\mathbf{A}=\mathbf{L_A}\mathbf{L_A}^*$
$\mathbf{L_A}=Chol(\mathbf{A})$
Once the factorization has been done, it is possible to solve very efficiently a linear system using backsubstitution with no need to compute the inverse matrix.
To perform the parallel Cholesky factorization using the quadtree schema we have previously divided the matrix into four submatrices:
$\left[ \begin{array}{cc} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{array} \right]= \left[ \begin{array}{cc} \mathbf{L_1} & \mathbf{L_2} \\ \mathbf{L_3} & \mathbf{L_4} \end{array} \right] \left[ \begin{array}{cc} \mathbf{L_1^*} & \mathbf{L_3^*} \\ \mathbf{L_2^*} & \mathbf{L_4^*} \end{array} \right]$
The decomposition obtains a lower triangular matrix and:
$\mathbf{L_2} = \mathbf{0}$
$\mathbf{A}=\mathbf{L_1}\mathbf{L_1^*}\Rightarrow \mathbf{L_1} = Chol(\mathbf{A})$
$\mathbf{C}=\mathbf{L_3}\mathbf{L_1^*}\Rightarrow \mathbf{L_3}=\mathbf{C}\mathbf{L_1^*}^{-1}$
$\mathbf{D}=\mathbf{L_3}\mathbf{L_3^*}+\mathbf{L_4}\mathbf{L_4^*}\Rightarrow \mathbf{L_4} = Chol(\mathbf{D}-\mathbf{L_3}\mathbf{L_3}^*)$
Our implementation to perform the Cholesky decomposition in parallel consists on a recursive implementation dividing the matrix submatrices and solving the matrix operations in parallel. Once the factorization has been done, the backsubstitution procedure solves the linear system in much less time than the decomposition itself.
The algorithms of this library (to solve full and budgeted SVMs) have been implemented using the procedures and considerations analyzed this section. The few portions of non-parallel code take negligible run time, meaning that these models are a good schema to make SVMs practical in an increasing number of scenarios.