Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

作者: Arya Mazaheri , Felix Wolf , Ali Jannesari

DOI: 10.1145/3225058.3225142

关键词:

摘要: A critical factor for developing robust shared-memory applications is the efficient use of cache and communication between threads. Inappropriate data structures, algorithm design, inefficient thread affinity may result in superfluous threads/cores severe performance problems. For this reason, state-of-the-art profiling tools focus on behavior to present different metrics that enable programmers write cache-friendly programs. The shared a pair threads should be reused with reasonable distance preserve locality. However, existing do not take into account locality events mainly analyzing amount instead. In paper, we introduce new method analyze bottlenecks arise from data-access patterns interactions each code region. We propose hardware-independent characterize provide suggestions applying appropriate optimizations specific evaluated our approach SPLASH Rodinia benchmark suites. Experimental results validate effectiveness by finding issues due structures and/or poor implementations. By suggested optimizations, improved benchmarks up 56%. Furthermore, varying input size demonstrated ability assess usage scalability given application terms its inherent communication.

参考文章(39)
Xin Yuan, Ahmad Faraj, Communication Characteristics in the NAS Parallel Benchmarks. IASTED PDCS. pp. 724- 729 ,(2002)
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Mohammad S. Alhakeem, Philippe O. A. Navaux, Hans-Ulrich Heiß, Locality and Balance for Communication-Aware Thread Mapping in Multicore Systems Lecture Notes in Computer Science. pp. 196- 208 ,(2015) , 10.1007/978-3-662-48096-0_16
Zhen Li, Ali Jannesari, Felix Wolf, An Efficient Data-Dependence Profiler for Sequential and Parallel Programs international parallel and distributed processing symposium. pp. 484- 493 ,(2015) , 10.1109/IPDPS.2015.41
Yunlian Jiang, Eddy Z. Zhang, Kai Tian, Xipeng Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors compiler construction. pp. 264- 282 ,(2010) , 10.1007/978-3-642-11970-5_15
Derek L. Schuff, Milind Kulkarni, Vijay S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization international conference on parallel architectures and compilation techniques. pp. 53- 64 ,(2010) , 10.1145/1854273.1854286
David Eklov, Erik Hagersten, StatStack: Efficient modeling of LRU caches international symposium on performance analysis of systems and software. pp. 55- 65 ,(2010) , 10.1109/ISPASS.2010.5452069
Arun Raman, Ayal Zaks, Jae W. Lee, David I. August, Parcae: a system for flexible parallel execution programming language design and implementation. ,vol. 47, pp. 133- 144 ,(2012) , 10.1145/2254064.2254082
Matthias Diener, Eduardo H.M. Cruz, Laércio L. Pilla, Fabrice Dupros, Philippe O.A. Navaux, Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping Performance Evaluation. ,vol. 88, pp. 18- 36 ,(2015) , 10.1016/J.PEVA.2015.03.001
Meng-Ju Wu, Minshu Zhao, Donald Yeung, Studying multicore processor scaling via reuse distance analysis international symposium on computer architecture. ,vol. 41, pp. 499- 510 ,(2013) , 10.1145/2485922.2485965