Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

作者: Rajesh Nishtala , Richard Vuduc , James W. Demmel , Shoaib Kamil , Katherine A. Yelick

DOI: 10.5555/762761.762822

关键词:

摘要: We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one the most important computational kernels in scientific applications. This paper addresses fundamental questions what limits exist on such how closely tuned approaches these limits.Specifically, we develop upper lower bounds (Mflop/s) SpM V when using our previously proposed register blocking optimization. These are based non-zero pattern matrix cost basic memory operations, as cache hits misses. evaluate implementations with respect to hardware counter 4 different platforms a test set 44 matrices. find that can often get within 20% bound, particularly class matrices from finite element modeling (FEM) problems; non-FEM matrices, improvements 2× still possible. Lastly, present new heuristic selects optimal or near-optimal block sizes (the key tuning parameters) more accurately than previous heuristic. Using heuristic, show much 2.5× over an untuned implementation.Collectively, results suggest future improvements, beyond those have already demonstrated for V, will come two sources: (1) consideration higher-level structures (e.g., exploiting symmetry, reordering, multiple sizes), (2) optimizing opportunity reuse matrix-multiple vector multiply, multiplication ATA vector).

参考文章(39)
Roldan Pozo, Karin A Remington, NIST sparse BLAS user's guide National Institute of Standards and Technology. ,(2001) , 10.6028/NIST.IR.6744
Dora Blanco Heras, Vicente Blanco Pérez, José Carlos Cabaleiro Domínguez, Francisco Fernández Rivera, Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study ieee international conference on high performance computing data and analytics. pp. 201- 210 ,(1999) , 10.1007/BFB0100581
Paul Vinson Stodghill, A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. ,(1997)
Eun-Jin Im, Katherine Yelick, Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY international conference on computational science. pp. 127- 136 ,(2001) , 10.1007/3-540-45545-0_22
Markus Püschel, Bryan Singer, Manuela Veloso, José M. F. Moura, Fast Automatic Generation of DSP Algorithms international conference on computational science. pp. 97- 106 ,(2001) , 10.1007/3-540-45545-0_19
William Pugh, Tatiana Shpeisman, SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations Languages and Compilers for Parallel Computing. pp. 213- 229 ,(1999) , 10.1007/3-540-48319-5_14
Todd L. Veldhuizen, Arrays in Blitz Lecture Notes in Computer Science. pp. 223- 230 ,(1998) , 10.1007/3-540-49372-7_24
Rafael Hector Saavedra-Barrera, CPU performance evaluation and execution time prediction using narrow spectrum benchmarking CPU performance evaluation and execution time prediction using narrow spectrum benchmarking. ,(1992)