作者: Arya Mazaheri , Felix Wolf , Ali Jannesari
关键词:
摘要: A critical factor for developing robust shared-memory applications is the efficient use of cache and communication between threads. Inappropriate data structures, algorithm design, inefficient thread affinity may result in superfluous threads/cores severe performance problems. For this reason, state-of-the-art profiling tools focus on behavior to present different metrics that enable programmers write cache-friendly programs. The shared a pair threads should be reused with reasonable distance preserve locality. However, existing do not take into account locality events mainly analyzing amount instead. In paper, we introduce new method analyze bottlenecks arise from data-access patterns interactions each code region. We propose hardware-independent characterize provide suggestions applying appropriate optimizations specific evaluated our approach SPLASH Rodinia benchmark suites. Experimental results validate effectiveness by finding issues due structures and/or poor implementations. By suggested optimizations, improved benchmarks up 56%. Furthermore, varying input size demonstrated ability assess usage scalability given application terms its inherent communication.