作者: Ziang Hu , Juan del Cuvillo , Weirong Zhu , Guang R. Gao
DOI: 10.1007/11823285_14
关键词: CPU cache 、 Memory hierarchy 、 Shared memory 、 Computer architecture 、 Parallel computing 、 Shared memory architecture 、 Multiplication 、 Memory bandwidth 、 Memory architecture 、 Sparse matrix 、 Matrix multiplication 、 Computer science
摘要: This paper presents a study of performance optimization dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published how to optimize applications shared memory architecture with multi-level caches, little reported the applicability existing methods new generation multi-core architectures like C64. For such more economical use on-chip storage resources appears discourage while providing tremendous bandwidth per area. This an in-depth case collection well known and tries re-engineer them address challenges opportunities provided by this emerging class architectures. Our demonstrates that efficiently exploiting hierarchy is key achieving good performance. The main contributions include: (a) identifying set optimizations for C64-like architectures, (b) exploring practical order optimizations, which yields multiplication.