作者: Thomas Ropars , Tatiana V. Martsinkevich , Amina Guermouche , André Schiper , Franck Cappello
关键词:
摘要: The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: identify property common many HPC applications, namely channel-determinism, and introduce partial order relation, called always-happens-before between events such applications. Leveraging these two concepts, we protocol that combines an unprecedented set features. Our SPBC in hierarchical way coordinated message logging. It is first provides containment without logging information reliably apart process checkpoints, this, penalizing recovery performance. Experiments run representative workloads demonstrate good performance our during both, failure-free execution recovery.