The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

作者： Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Vishal Sahay , Andrew Lumsdaine

DOI: 10.1177/1094342005056139

关键词: Scheduling (computing) 、 Scalability 、 Fault tolerance 、 Computer science 、 LAM/MPI 、 Distributed computing

摘要: As high performance clusters continue to grow in size and popularity, issues of fault tolerance reliability are becoming limiting factors on application scalability. To address these issues, we present the design implementation a system for providing coordinated checkpointing rollback recovery MPI-based parallel applications. Our approach integrates Berkeley Lab BLCR kernel-level process checkpoint with LAM MPI through defined checkpoint/restart interface. Checkpointing is transparent application, allowing be used cluster maintenance scheduling reasons as well tolerance. Experimental results show negligible communication impact due incorporation support capabilities into LAM/MPI.

sagepub.com LINK 下载加速

sci-hub.se PDF 下载加速

参考文章(43)

James Vaigl, Greg Burns, Raja Daoud, LAM: An Open Cluster Environment for MPI ,(2002)

Augusto Ciuffoletti, Luca Simoncini, D. Briatico, A DISTRIBUTED DOMINOEFFECT FREE RECOVERY ALGORITHM Symposium on Reliability in Distributed Software and Database Systems. pp. 207- 215 ,(1984)

William D. Gropp, MPI: The Complete Reference , Vol. 2 - The MPI-2 Extensions ,(1998)

Forum Mpi, MPI: A Message-Passing Interface Oregon Graduate Institute School of Science & Engineering. ,(1994)

William D. Gropp, Ewing L. Lusk, Skjellum using mpi: portable parallel programming with the message-passing interface ,(1994)

William Gropp, The MPI-2 extensions MIT Press. ,(1998)

Michael Litzkow, Marvin Solomon, The evolution of Condor checkpointing international conference on mobile technology, applications, and systems. pp. 163- 164 ,(1999)

Sam Toueg, Richard Koo, Checkpointing and rollback-recovery for distributed systems fall joint computer conference. pp. 1150- 1158 ,(1986) , 10.5555/324493.325074

Frederick Douglis, Richard Wheeler, Dejan Milojiči cacute, Mobility: Processes, Computers, and Agents ,(1999)

10.

William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface ,(1994)

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

来源期刊

我的账户

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

来源期刊

相似文章 10

我的账户