作者: Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Vishal Sahay , Andrew Lumsdaine
关键词: Scheduling (computing) 、 Scalability 、 Fault tolerance 、 Computer science 、 LAM/MPI 、 Distributed computing
摘要: As high performance clusters continue to grow in size and popularity, issues of fault tolerance reliability are becoming limiting factors on application scalability. To address these issues, we present the design implementation a system for providing coordinated checkpointing rollback recovery MPI-based parallel applications. Our approach integrates Berkeley Lab BLCR kernel-level process checkpoint with LAM MPI through defined checkpoint/restart interface. Checkpointing is transparent application, allowing be used cluster maintenance scheduling reasons as well tolerance. Experimental results show negligible communication impact due incorporation support capabilities into LAM/MPI.