Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

作者: Ziang Hu , Juan del Cuvillo , Weirong Zhu , Guang R. Gao

DOI: 10.1007/11823285_14

关键词: CPU cacheMemory hierarchyShared memoryComputer architectureParallel computingShared memory architectureMultiplicationMemory bandwidthMemory architectureSparse matrixMatrix multiplicationComputer science

摘要: This paper presents a study of performance optimization dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published how to optimize applications shared memory architecture with multi-level caches, little reported the applicability existing methods new generation multi-core architectures like C64. For such more economical use on-chip storage resources appears discourage while providing tremendous bandwidth per area. This an in-depth case collection well known and tries re-engineer them address challenges opportunities provided by this emerging class architectures. Our demonstrates that efficiently exploiting hierarchy is key achieving good performance. The main contributions include: (a) identifying set optimizations for C64-like architectures, (b) exploring practical order optimizations, which yields multiplication.

参考文章(22)
Jingling Xue, Loop tiling for parallelism ,(2000)
Michael Wolfe, Iteration Space Tiling for Memory Hierarchies siam conference on parallel processing for scientific computing. pp. 357- 361 ,(1987)
Nikolas Gloy, Michael D. Smith, Procedure placement using temporal-ordering information ACM Transactions on Programming Languages and Systems. ,vol. 21, pp. 977- 1027 ,(1999) , 10.1145/330249.330254
Ken Kennedy, Ulrich Kremer, Automatic data layout for distributed-memory machines ACM Transactions on Programming Languages and Systems. ,vol. 20, pp. 869- 916 ,(1998) , 10.1145/291891.291901
Chen Ding, Ken Kennedy, Improving cache performance in dynamic applications through data and computation reorganization at run time Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation - PLDI '99. ,vol. 34, pp. 229- 241 ,(1999) , 10.1145/301618.301670
Amy W. Lim, Monica S. Lam, Maximizing parallelism and minimizing synchronization with affine transforms symposium on principles of programming languages. pp. 201- 214 ,(1997) , 10.1145/263699.263719
L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, Todd Waterman, Finding effective compilation sequences languages, compilers, and tools for embedded systems. ,vol. 39, pp. 231- 239 ,(2004) , 10.1145/997163.997196
Chen Ding, M. Orlovich, The Potential of Computation Regrouping for Improving Locality conference on high performance computing (supercomputing). pp. 13- 13 ,(2004) , 10.1109/SC.2004.58
Jennifer M. Anderson, Monica S. Lam, Global optimizations for parallelism and locality on scalable parallel machines Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation - PLDI '93. ,vol. 28, pp. 112- 125 ,(1993) , 10.1145/155090.155101