The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

作者: Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Vishal Sahay , Andrew Lumsdaine

DOI: 10.1177/1094342005056139

关键词: Scheduling (computing)ScalabilityFault toleranceComputer scienceLAM/MPIDistributed computing

摘要: As high performance clusters continue to grow in size and popularity, issues of fault tolerance reliability are becoming limiting factors on application scalability. To address these issues, we present the design implementation a system for providing coordinated checkpointing rollback recovery MPI-based parallel applications. Our approach integrates Berkeley Lab BLCR kernel-level process checkpoint with LAM MPI through defined checkpoint/restart interface. Checkpointing is transparent application, allowing be used cluster maintenance scheduling reasons as well tolerance. Experimental results show negligible communication impact due incorporation support capabilities into LAM/MPI.

参考文章(43)
James Vaigl, Greg Burns, Raja Daoud, LAM: An Open Cluster Environment for MPI ,(2002)
Augusto Ciuffoletti, Luca Simoncini, D. Briatico, A DISTRIBUTED DOMINO­EFFECT FREE RECOVERY ALGORITHM Symposium on Reliability in Distributed Software and Database Systems. pp. 207- 215 ,(1984)
Forum Mpi, MPI: A Message-Passing Interface Oregon Graduate Institute School of Science & Engineering. ,(1994)
William Gropp, The MPI-2 extensions MIT Press. ,(1998)
Michael Litzkow, Marvin Solomon, The evolution of Condor checkpointing international conference on mobile technology, applications, and systems. pp. 163- 164 ,(1999)
Sam Toueg, Richard Koo, Checkpointing and rollback-recovery for distributed systems fall joint computer conference. pp. 1150- 1158 ,(1986) , 10.5555/324493.325074
Frederick Douglis, Richard Wheeler, Dejan Milojiči cacute, Mobility: Processes, Computers, and Agents ,(1999)
William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface ,(1994)