The International Conference for High Performance Computing, Networking, Storage and Analysis |

Authors: Walid Abu-Sufah (University of Illinois at Urbana-Champaign and University of Jordan), Khalid Ahmad (University of Jordan)

Abstract: Sparse matrix-vector and multiple-vector multiplications (SpMV and SpMM) are performance bottlenecks in numerous applications. We implemented two SpMM kennels to integrate in our library of auto-tuned kernels for GPUs. Our kernels use registers to exploit data reuse in SpMM. DIA-SpMM targets structured matrices and ELL-SpMM targets matrices with uniform row lengths. Work is continuing on SpMM kernels for unstructured matrices. Executing on NVIDIA Kepler Tesla K40m, DIA-SpMM is 2.4x faster than NVIDIA CUSP DIA-SpMV. ELL-SpMM is 2.8x faster than CUSP ELL-SpMV. DIA-SpMM is 5.2x faster than the highly optimized NVIDIA CUSPARSE CSR-SpMV. The maximum speedup is 6.5x. ELL-SpMM is 3.9x faster than CUSPARSE CSR-SpMV. The maximum speedup is 8.3x. DIA-SpMM is 2x faster than CUSPARSE CSR-SpMM. ELL-SpMM is 1.6x faster. For structured matrices, DIA-SpMM on the K40m GPU is 7.2x faster than Intel MKL CSR-SpMV on a dual socket 10-core Intel Ivy Bridge E5-2690. The maximum speedup is 12.3x.

Poster: pdf

Two-page extended abstract: pdf

Poster Index