Lessons learned at 208K: towards debugging millions of cores

作者： Bronis R. de Supinski , Ben Liblit , Matthew Legendre , Dorian C. Arnold , Dong H. Ahn

关键词: Process (engineering) 、 Stack trace 、 System software 、 Petascale computing 、 Distributed computing 、 Computer science 、 InfiniBand 、 File system 、 Debugging 、 Scalability

摘要: Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures analysis algorithms collect process application data. In addition, at such scales, each tool itself become a large parallel - already, debugging the full Blue-Gene/L (BG/L) installation Lawrence Livermore National Laboratory requires employing 1664 daemons. To reach sizes beyond, must communication infrastructure manage their own processes efficiently. Some system resources, as file system, also bottlenecks. this paper, we petascale development, using stack trace (STAT) case study. STAT is lightweight gathers merges traces from identify equivalence classes. We results gathered thousands tasks on an Infiniband cluster up 208 K BG/L current scalability issues well be faced petascale. then implemented solutions these show resulting improvements. discuss future plans meet demands machines.

参考文章(18)

Hans Meuer, E. Strohmaier, J. Dongarra, Horst Simon, Top500 Supercomputer Sites University of Tennessee. ,(1997)

B R de Supinski, D C Arnold, D H Ahn, G L Lee, M W Schulz, B P Miller, Benchmarking the Stack Trace Analysis Tool for BlueGene/L parallel computing. pp. 621- 628 ,(2007)

Markus Geimer, Felix Wolf, Björn Kuhlmann, Farzona Pulatova, Brian J. N. Wylie, Scalable Collation and Presentation of Call-Path Profile Data with CUBE parallel computing. pp. 645- 652 ,(2007)

Robert Bell, Allen D. Malony, Sameer Shende, ParaProf : A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis european conference on parallel processing. pp. 17- 26 ,(2003) , 10.1007/978-3-540-45209-6_7

Markus Geimer, Felix Wolf, Brian J. N. Wylie, Bernd Mohr, Scalable Parallel Trace-Based Performance Analysis Recent Advances in Parallel Virtual Machine and Message Passing Interface. ,vol. 4, pp. 303- 312 ,(2006) , 10.1007/11846802_43

Hans-Christian Hoppe, Wolfgang E. Nagel, Karl Solchenbach, Michael Weber, Alfred Arnold, VAMPIR: Visualization and Analysis of MPI Resources ,(2010)

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende, TA UoverSupermon : low-overhead online parallel performance monitoring european conference on parallel processing. pp. 85- 96 ,(2007) , 10.1007/978-3-540-74466-5_11

Susanne M. Balle, Bevin R. Brett, Chih-Ping Chen, David LaFrance-Linden, Extending a traditional debugger to debug massively parallel applications Journal of Parallel and Distributed Computing. ,vol. 64, pp. 617- 628 ,(2004) , 10.1016/J.JPDC.2004.03.012

Martin Schulz, Dong Ahn, Andrew Bernat, Bronis R. de Supinski, Steven Y. Ko, Gregory Lee, Barry Rountree, Scalable dynamic binary instrumentation for Blue Gene/L ACM SIGARCH Computer Architecture News. ,vol. 33, pp. 9- 14 ,(2005) , 10.1145/1127577.1127581

10.

Don Maghrak, Martin Schulz, Jim Galarowicz, Scott Cranford, David Montoya, William Hachfeld, Open | SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis Scientific Programming. ,vol. 16, pp. 105- 121 ,(2008) , 10.3233/SPR-2008-0256

Lessons learned at 208K: towards debugging millions of cores

来源期刊

我的账户

Lessons learned at 208K: towards debugging millions of cores

来源期刊

相似文章 10

我的账户