A data-centric profiler for parallel programs

作者: Xu Liu , John Mellor-Crummey

DOI: 10.1145/2503210.2503297

关键词: Computer scienceDistributed computingDatabase-centric architectureProfiling (computer programming)LocalityLatency (engineering)Scalability

摘要: It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools support data-centric profiling of scalable parallel programs. Our tool uses hardware counters directly measure memory access latency and attributes metrics both variables instructions. Different provide insight into different aspects locality (or lack thereof). Unlike prior analysis, our employs measurement, presentation methods that enable it analyze behavior programs with low runtime space overhead. We demonstrate utility HPCToolkit's new analysis capabilities case studies five well-known benchmarks. In each benchmark, bottlenecks caused by poor non-trivial optimizations enabled guidance.

参考文章(25)
Vivien Quéma, Baptiste Lepers, Renaud Lachaize, MemProf: a memory profiler for NUMA multicore systems usenix annual technical conference. pp. 5- 5 ,(2012)
Derek L. Schuff, Milind Kulkarni, Vijay S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization international conference on parallel architectures and compilation techniques. pp. 53- 64 ,(2010) , 10.1145/1854273.1854286
R. Bruce Irvin, Barton P. Miller, Mapping performance data for high-level and data views of parallel program performance international conference on supercomputing. pp. 69- 77 ,(1996) , 10.1145/237578.237587
K. Beyls, E.H. D'Hollander, Refactoring for Data Locality IEEE Computer. ,vol. 42, pp. 62- 71 ,(2009) , 10.1109/MC.2009.57
Nathan Froyd, John Mellor-Crummey, Rob Fowler, Low-overhead call path profiling of unmodified, optimized code Proceedings of the 19th annual international conference on Supercomputing - ICS '05. pp. 81- 90 ,(2005) , 10.1145/1088149.1088161
Ashay Rane, James Browne, Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics international conference on parallel architectures and compilation techniques. pp. 147- 156 ,(2012) , 10.1145/2370816.2370838
Yutao Zhong, Wentao Chang, Sampling-based program locality approximation Proceedings of the 7th international symposium on Memory management - ISMM '08. pp. 91- 100 ,(2008) , 10.1145/1375634.1375648
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, Kevin Skadron, Rodinia: A benchmark suite for heterogeneous computing ieee international symposium on workload characterization. pp. 44- 54 ,(2009) , 10.1109/IISWC.2009.5306797
Margaret Martonosi, Anoop Gupta, Thomas Anderson, MemSpy: analyzing memory system bottlenecks in programs measurement and modeling of computer systems. ,vol. 20, pp. 1- 12 ,(1992) , 10.1145/133057.133079
B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, T. Newhall, The Paradyn parallel performance measurement tool IEEE Computer. ,vol. 28, pp. 37- 46 ,(1995) , 10.1109/2.471178