A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

作者: Angelo Duarte , Dolores Rexachs , Emilio Luque , None

DOI:

关键词:

摘要:

参考文章(10)
Graham E. Fagg, Jack J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface. pp. 346- 353 ,(2000) , 10.1007/3-540-45255-9_47
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, Timothy S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Lecture Notes in Computer Science. pp. 97- 104 ,(2004) , 10.1007/978-3-540-30218-6_19
S. Rao, L. Alvisi, H.M. Vin, Egida: an extensible toolkit for low-overhead fault-tolerance ieee international symposium on fault tolerant computing. pp. 48- 55 ,(1999) , 10.1109/FTCS.1999.781033
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, Eric Roman, The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing ieee international conference on high performance computing data and analytics. ,vol. 19, pp. 479- 493 ,(2005) , 10.1177/1094342005056139
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B Johnson, A survey of rollback-recovery protocols in message-passing systems ACM Computing Surveys. ,vol. 34, pp. 375- 408 ,(2002) , 10.1145/568522.568525
SOULLA LOUCA, NEOPHYTOS NEOPHYTOU, ADRIANOS LACHANAS, PARASKEVAS EVRIPIDOU, MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPI Parallel Processing Letters. ,vol. 10, pp. 371- 382 ,(2000) , 10.1142/S0129626400000342
K. Mani Chandy, Leslie Lamport, Distributed snapshots: determining global states of distributed systems ACM Transactions on Computer Systems. ,vol. 3, pp. 63- 75 ,(1985) , 10.1145/214451.214456
Bouteiller, Lemarinier, Krawezik, Capello, Coordinated checkpoint versus message log for fault tolerant MPI international conference on cluster computing. pp. 242- 250 ,(2003) , 10.1109/CLUSTR.2003.1253321
G. Stellner, CoCheck: checkpointing and process migration for MPI international conference on parallel processing. pp. 526- 531 ,(1996) , 10.1109/IPPS.1996.508106
Adnan Agbaria, Roy Friedman, Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations Cluster Computing. ,vol. 6, pp. 227- 236 ,(2003) , 10.1023/A:1023540604208