作者: Rajesh Nishtala , Richard Vuduc , James W. Demmel , Shoaib Kamil , Katherine A. Yelick
关键词:
摘要: We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one the most important computational kernels in scientific applications. This paper addresses fundamental questions what limits exist on such how closely tuned approaches these limits.Specifically, we develop upper lower bounds (Mflop/s) SpM V when using our previously proposed register blocking optimization. These are based non-zero pattern matrix cost basic memory operations, as cache hits misses. evaluate implementations with respect to hardware counter 4 different platforms a test set 44 matrices. find that can often get within 20% bound, particularly class matrices from finite element modeling (FEM) problems; non-FEM matrices, improvements 2× still possible. Lastly, present new heuristic selects optimal or near-optimal block sizes (the key tuning parameters) more accurately than previous heuristic. Using heuristic, show much 2.5× over an untuned implementation.Collectively, results suggest future improvements, beyond those have already demonstrated for V, will come two sources: (1) consideration higher-level structures (e.g., exploiting symmetry, reordering, multiple sizes), (2) optimizing opportunity reuse matrix-multiple vector multiply, multiplication ATA vector).