Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

作者： Ziang Hu , Juan del Cuvillo , Weirong Zhu , Guang R. Gao

关键词: CPU cache 、 Memory hierarchy 、 Shared memory 、 Computer architecture 、 Parallel computing 、 Shared memory architecture 、 Multiplication 、 Memory bandwidth 、 Memory architecture 、 Sparse matrix 、 Matrix multiplication 、 Computer science

摘要: This paper presents a study of performance optimization dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published how to optimize applications shared memory architecture with multi-level caches, little reported the applicability existing methods new generation multi-core architectures like C64. For such more economical use on-chip storage resources appears discourage while providing tremendous bandwidth per area. This an in-depth case collection well known and tries re-engineer them address challenges opportunities provided by this emerging class architectures. Our demonstrates that efficiently exploiting hierarchy is key achieving good performance. The main contributions include: (a) identifying set optimizations for C64-like architectures, (b) exploring practical order optimizations, which yields multiplication.

参考文章(22)

John R. Allen, Ken Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-based Approach ,(2001)

Jingling Xue, Loop tiling for parallelism ,(2000)

Michael Wolfe, Iteration Space Tiling for Memory Hierarchies siam conference on parallel processing for scientific computing. pp. 357- 361 ,(1987)

Nikolas Gloy, Michael D. Smith, Procedure placement using temporal-ordering information ACM Transactions on Programming Languages and Systems. ,vol. 21, pp. 977- 1027 ,(1999) , 10.1145/330249.330254

Ken Kennedy, Ulrich Kremer, Automatic data layout for distributed-memory machines ACM Transactions on Programming Languages and Systems. ,vol. 20, pp. 869- 916 ,(1998) , 10.1145/291891.291901

Chen Ding, Ken Kennedy, Improving cache performance in dynamic applications through data and computation reorganization at run time Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation - PLDI '99. ,vol. 34, pp. 229- 241 ,(1999) , 10.1145/301618.301670

Amy W. Lim, Monica S. Lam, Maximizing parallelism and minimizing synchronization with affine transforms symposium on principles of programming languages. pp. 201- 214 ,(1997) , 10.1145/263699.263719

L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, Todd Waterman, Finding effective compilation sequences languages, compilers, and tools for embedded systems. ,vol. 39, pp. 231- 239 ,(2004) , 10.1145/997163.997196

Chen Ding, M. Orlovich, The Potential of Computation Regrouping for Improving Locality conference on high performance computing (supercomputing). pp. 13- 13 ,(2004) , 10.1109/SC.2004.58

10.

Jennifer M. Anderson, Monica S. Lam, Global optimizations for parallelism and locality on scalable parallel machines Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation - PLDI '93. ,vol. 28, pp. 112- 125 ,(1993) , 10.1145/155090.155101

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

来源期刊

我的账户

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

来源期刊

相似文章 10

我的账户