作者: George Bosilca , Aurelien Bouteiller , Thomas Herault , Pierre Lemarinier , Jack J. Dongarra
DOI: 10.1007/978-3-642-15646-5_20
关键词: Message logging 、 Computer network 、 Network performance 、 Engineering 、 Fault tolerance 、 Distributed computing 、 Payload (computing) 、 Logging 、 Scalability
摘要: With the number of computing elements spiraling to hundred thousands in modern HPC systems, failures are common events. Few applications nevertheless fault tolerant; most need for a seamless recovery framework. Among automatic tolerant techniques proposed MPI, message logging is preferable its scalable recovery. The major challenge protocols performance penalty on communications during failure-free periods, mostly coming from payload copy introduced each message. In this paper, we investigate different approaches and compare their impact network performance.