The International Conference for High Performance Computing, Networking, Storage and Analysis
Performance of Sparse Matrix-Multiple Vectors Multiplication on Multicore and GPUs.
Authors: Walid Abu-Sufah (University of Illinois at Urbana-Champaign and University of Jordan), Khalid Ahmad (University of Jordan)
Abstract: Sparse matrix-vector and multiple-vector multiplications (SpMV and SpMM) are performance bottlenecks in numerous applications. We implemented two SpMM kennels to integrate in our library of auto-tuned kernels for GPUs. Our kernels use registers to exploit data reuse in SpMM. DIA-SpMM targets structured matrices and ELL-SpMM targets matrices with uniform row lengths. Work is continuing on SpMM kernels for unstructured matrices.
Executing on NVIDIA Kepler Tesla K40m, DIA-SpMM is 2.4x faster than NVIDIA CUSP DIA-SpMV. ELL-SpMM is 2.8x faster than CUSP ELL-SpMV.
DIA-SpMM is 5.2x faster than the highly optimized NVIDIA CUSPARSE CSR-SpMV. The maximum speedup is 6.5x. ELL-SpMM is 3.9x faster than CUSPARSE CSR-SpMV. The maximum speedup is 8.3x.
DIA-SpMM is 2x faster than CUSPARSE CSR-SpMM. ELL-SpMM is 1.6x faster.
For structured matrices, DIA-SpMM on the K40m GPU is 7.2x faster than Intel MKL CSR-SpMV on a dual socket 10-core Intel Ivy Bridge E5-2690. The maximum speedup is 12.3x.