Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

作者： Arya Mazaheri , Felix Wolf , Ali Jannesari

关键词:

摘要: A critical factor for developing robust shared-memory applications is the efficient use of cache and communication between threads. Inappropriate data structures, algorithm design, inefficient thread affinity may result in superfluous threads/cores severe performance problems. For this reason, state-of-the-art profiling tools focus on behavior to present different metrics that enable programmers write cache-friendly programs. The shared a pair threads should be reused with reasonable distance preserve locality. However, existing do not take into account locality events mainly analyzing amount instead. In paper, we introduce new method analyze bottlenecks arise from data-access patterns interactions each code region. We propose hardware-independent characterize provide suggestions applying appropriate optimizations specific evaluated our approach SPLASH Rodinia benchmark suites. Experimental results validate effectiveness by finding issues due structures and/or poor implementations. By suggested optimizations, improved benchmarks up 56%. Furthermore, varying input size demonstrated ability assess usage scalability given application terms its inherent communication.

uni-trier.de 本地加速

acm.org PDF 下载加速

sci-hub.se PDF 下载加速

参考文章(39)

Xin Yuan, Ahmad Faraj, Communication Characteristics in the NAS Parallel Benchmarks. IASTED PDCS. pp. 724- 729 ,(2002)

Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Mohammad S. Alhakeem, Philippe O. A. Navaux, Hans-Ulrich Heiß, Locality and Balance for Communication-Aware Thread Mapping in Multicore Systems Lecture Notes in Computer Science. pp. 196- 208 ,(2015) , 10.1007/978-3-662-48096-0_16

Zhen Li, Ali Jannesari, Felix Wolf, An Efficient Data-Dependence Profiler for Sequential and Parallel Programs international parallel and distributed processing symposium. pp. 484- 493 ,(2015) , 10.1109/IPDPS.2015.41

Yunlian Jiang, Eddy Z. Zhang, Kai Tian, Xipeng Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors compiler construction. pp. 264- 282 ,(2010) , 10.1007/978-3-642-11970-5_15

Derek L. Schuff, Milind Kulkarni, Vijay S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization international conference on parallel architectures and compilation techniques. pp. 53- 64 ,(2010) , 10.1145/1854273.1854286

David Eklov, Erik Hagersten, StatStack: Efficient modeling of LRU caches international symposium on performance analysis of systems and software. pp. 55- 65 ,(2010) , 10.1109/ISPASS.2010.5452069

Arun Raman, Ayal Zaks, Jae W. Lee, David I. August, Parcae: a system for flexible parallel execution programming language design and implementation. ,vol. 47, pp. 133- 144 ,(2012) , 10.1145/2254064.2254082

I. Lee, Characterizing communication patterns of NAS-MPI benchmark programs southeastcon. pp. 158- 163 ,(2009) , 10.1109/SECON.2009.5174068

Matthias Diener, Eduardo H.M. Cruz, Laércio L. Pilla, Fabrice Dupros, Philippe O.A. Navaux, Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping Performance Evaluation. ,vol. 88, pp. 18- 36 ,(2015) , 10.1016/J.PEVA.2015.03.001

10.

Meng-Ju Wu, Minshu Zhao, Donald Yeung, Studying multicore processor scaling via reuse distance analysis international symposium on computer architecture. ,vol. 41, pp. 499- 510 ,(2013) , 10.1145/2485922.2485965

Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

来源期刊

我的账户

Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

来源期刊

相似文章 2

ComDetective: a lightweight communication detection tool for threads

Characterizing the Sharing Behavior of Applications Using Software Transactional Memory.

我的账户