SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

作者: Thomas Ropars , Tatiana V. Martsinkevich , Amina Guermouche , André Schiper , Franck Cappello

DOI: 10.1145/2503210.2503271

关键词:

摘要: The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: identify property common many HPC applications, namely channel-determinism, and introduce partial order relation, called always-happens-before between events such applications. Leveraging these two concepts, we protocol that combines an unprecedented set features. Our SPBC in hierarchical way coordinated message logging. It is first provides containment without logging information reliably apart process checkpoints, this, penalizing recovery performance. Experiments run representative workloads demonstrate good performance our during both, failure-free execution recovery.

参考文章(32)
Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra, Correlated set coordination in fault tolerant message logging protocols international conference on parallel processing. ,vol. 6853, pp. 51- 64 ,(2011) , 10.1007/978-3-642-23397-5_6
Thomas Ropars, Amina Guermouche, Bora Uçar, Esteban Meneses, Laxmikant V. Kalé, Franck Cappello, On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications international conference on parallel processing. ,vol. 6852, pp. 567- 578 ,(2011) , 10.1007/978-3-642-23400-2_53
Sam Toueg, Richard Koo, Checkpointing and rollback-recovery for distributed systems fall joint computer conference. pp. 1150- 1158 ,(1986) , 10.5555/324493.325074
D.B. Johnson, Willy Zwaenepoel, Sender-Based Message Logging ieee international symposium on fault tolerant computing. ,(1987)
Message P Forum, None, MPI: A Message-Passing Interface Standard University of Tennessee. ,(1994)
Berna L. Massingill, Beverly A. Sanders, Timothy G. Mattson, Patterns for parallel programming ,(2004)
Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, Rajeev Thakur, Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Recent Advances in the Message Passing Interface. pp. 11- 20 ,(2010) , 10.1007/978-3-642-15646-5_2
Leslie Lamport, , Time, clocks, and the ordering of events in a distributed system Concurrency and Computation: Practice and Experience. pp. 179- 196 ,(2019) , 10.1145/3335772.3335934
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18