Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

作者: George Bosilca , Aurelien Bouteiller , Thomas Herault , Pierre Lemarinier , Jack J. Dongarra

DOI: 10.1007/978-3-642-15646-5_20

关键词: Message loggingComputer networkNetwork performanceEngineeringFault toleranceDistributed computingPayload (computing)LoggingScalability

摘要: With the number of computing elements spiraling to hundred thousands in modern HPC systems, failures are common events. Few applications nevertheless fault tolerant; most need for a seamless recovery framework. Among automatic tolerant techniques proposed MPI, message logging is preferable its scalable recovery. The major challenge protocols performance penalty on communications during failure-free periods, mostly coming from payload copy introduced each message. In this paper, we investigate different approaches and compare their impact network performance.

参考文章(14)
Jack J. Dongarra, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, Graham E. Fagg, Kevin London, Edgar Gabriel, Zhizhong Chen, Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems ,(2004)
Forum Mpi, MPI: A Message-Passing Interface Oregon Graduate Institute School of Science & Engineering. ,(1994)
P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, F. Cappello, Improved message logging versus improved coordinated checkpointing for fault tolerant MPI international conference on cluster computing. pp. 115- 124 ,(2004) , 10.1109/CLUSTR.2004.1392609
R.E. Strom, D.F. Bacon, S.A. Yemini, Volatile logging in n-fault-tolerant distributed systems ieee international symposium on fault tolerant computing. pp. 44- 49 ,(1988) , 10.1109/FTCS.1988.5295
Aurelien Bouteiller, George Bosilca, Jack Dongarra, Retrospect: deterministic replay of MPI applications for interactive distributed debugging PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface. pp. 297- 306 ,(2007) , 10.1007/978-3-540-75416-9_41
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, Timothy S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Lecture Notes in Computer Science. pp. 97- 104 ,(2004) , 10.1007/978-3-540-30218-6_19
D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing international conference on distributed computing systems. pp. 100- 107 ,(1996) , 10.1109/ICDCS.1996.507906
T. Stricker, T. Gross, Optimizing memory system performance for communication in parallel computers international symposium on computer architecture. ,vol. 23, pp. 308- 319 ,(1995) , 10.1145/223982.224442
Hans Werner Meuer, The TOP500 Project: Looking Back Over 15 Years of Supercomputing Experience Informatik Spektrum. ,vol. 31, pp. 203- 222 ,(2008) , 10.1007/S00287-008-0240-6
Jack Dongarra, George Bosilca, Aurelien Bouteiller, Redesigning the message logging model for high performance international supercomputing conference. ,vol. 22, pp. 2196- 2211 ,(2010) , 10.1002/CPE.V22:16