Scalable communication event tracing via clustering

作者: Amir Bahmani , Frank Mueller

DOI: 10.1016/J.JPDC.2017.06.008

关键词: TRACE (psycholinguistics)Distributed computingScalabilitySet (abstract data type)Partition coefficientComputer scienceTime complexityNode (networking)Parallel computingCluster analysis

摘要: Abstract Communication traces help developers of high-performance computing (HPC) applications understand and improve their codes. When run on large-scale HPC facilities, the scalability tracing tools becomes a challenge. To address this problem, can be clustered into groups processes that exhibit similar behavior. Instead collecting trace information each individual node, it then suffices to collect small set representative nodes, namely one per cluster. However, clustering algorithms themselves need have low overhead, scalable, adapt application characteristics. We devised an adaptive algorithm for called ACURDION MPI communication code with O(log P) time complexity. First, identifies parameters differ across by using logarithmic Adaptive Signature Building. Second, clusters based those parameters. Experiments show just nine nodes/clusters capture behavior all nodes wide benchmarks codes while retaining sufficient accuracy events In summary, improves automation over prior approaches.

参考文章(32)
J. Ziv, A. Lempel, A universal algorithm for sequential data compression IEEE Transactions on Information Theory. ,vol. 23, pp. 337- 343 ,(1977) , 10.1109/TIT.1977.1055714
Xing Wu, Frank Mueller, ScalaExtrap: trace-based communication extrapolation for spmd programs acm sigplan symposium on principles and practice of parallel programming. ,vol. 46, pp. 113- 122 ,(2011) , 10.1145/1941553.1941569
Jidong Zhai, Wenguang Chen, Weimin Zheng, PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node acm sigplan symposium on principles and practice of parallel programming. ,vol. 45, pp. 305- 314 ,(2010) , 10.1145/1693453.1693493
Shigeru Ishizuki, Shuji Yamamura, Hiroaki Honda, Mutsumi Aoyagi, Hisashige Ando, Kazuaki J. Murakami, Koji Inoue, Yunqing Yu, Yuichi Inadomi, Hidetomo Shibamura, Ryutaro Susukita, Hidemi Komatsu, Motoyoshi Kurokawa, Yasunori Kimura, Performance prediction of large-scale parallell system and application using macro-level simulation ieee international conference on high performance computing data and analytics. pp. 20- ,(2008) , 10.5555/1413370.1413391
Juan Gonzalez, Judit Gimenez, Jesus Labarta, Automatic detection of parallel applications computation phases international parallel and distributed processing symposium. pp. 1- 11 ,(2009) , 10.1109/IPDPS.2009.5161027
Ian Karlin, Abhinav Bhatele, Jeff Keasler, Bradford L. Chamberlain, Jonathan Cohen, Zachary Devito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards, Martin Schulz, Charles H. Still, Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application international parallel and distributed processing symposium. pp. 919- 932 ,(2013) , 10.1109/IPDPS.2013.115
Sameer S. Shende, Allen D. Malony, The Tau Parallel Performance System ieee international conference on high performance computing data and analytics. ,vol. 20, pp. 287- 311 ,(2006) , 10.1177/1094342006064482
German Llort, Juan Gonzalez, Harald Servat, Judit Gimenez, Jesus Labarta, On-line detection of large-scale parallel application's structure international parallel and distributed processing symposium. pp. 1- 10 ,(2010) , 10.1109/IPDPS.2010.5470350
Michael D. Bond, Kathryn S. McKinley, Probabilistic calling context conference on object-oriented programming systems, languages, and applications. ,vol. 42, pp. 97- 112 ,(2007) , 10.1145/1297027.1297035
Xing Wu, Frank Mueller, Elastic and scalable tracing and accurate replay of non-deterministic events Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13. pp. 59- 68 ,(2013) , 10.1145/2464996.2465001