Toward Message Passing Failure Management

作者: Wesley B. Bland

DOI:

关键词: Construct (python library)Distributed computingProtocol (object-oriented programming)SupercomputerMessage Passing InterfaceProgramming paradigmMessage passingFault toleranceComputer scienceOverhead (computing)

摘要: As machine sizes have increased and application runtimes lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, algorithm based tolerance, the type of recovery being performed changed, but programming model in which it executes remained virtually static since publication original Message Passing Interface (MPI) Standard 1992. Since that time, applications used a message passing paradigm communicate between processes, they could not perform process within an MPI implementation due limitations Standard. This dissertation describes new protocol using exiting called Checkpoint-on-Failure limited current framework MPI, proposes platform titled User Level Failure Mitigation (ULFM) build more complete complex solutions with true tolerant implementation. We will demonstrate overhead involved these give examples libraries construct other mechanisms on constructs provided ULFM.

参考文章(47)
Jeffrey M. Squyres, Andrew Lumsdaine, A Component Architecture for LAM/MPI Recent Advances in Parallel Virtual Machine and Message Passing Interface. pp. 379- 387 ,(2003) , 10.1007/978-3-540-39924-7_52
Jack J. Dongarra, L. S. Blackford, J. Demmel, A. Petitet, I. Dhillon, D. Walker, R. C. Whaley, G. Henry, J. Choi, E. D'Azeuedo, S. Hammarling, A. Cleary, K. Stanley, ScaLAPACK user's guide Society for Industrial and Applied Mathematics. ,(1997)
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, Timothy S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Lecture Notes in Computer Science. pp. 97- 104 ,(2004) , 10.1007/978-3-540-30218-6_19
S. Rao, L. Alvisi, H.M. Vin, Egida: an extensible toolkit for low-overhead fault-tolerance ieee international symposium on fault tolerant computing. pp. 48- 55 ,(1999) , 10.1109/FTCS.1999.781033
J. Dongarra, J. Demmel, C. Bischof, A. McKenney, Z. Bai, D. Sorensen, A. Greenbaum, E. Anderson, S. Hammarling, J. Du Croz, LAPACK: a portable linear algebra library for high-performance computers conference on high performance computing (supercomputing). pp. 2- 11 ,(1990) , 10.5555/110382.110385
Philip A. Bernstein, Nathan Goodman, Concurrency Control in Distributed Database Systems ACM Computing Surveys. ,vol. 13, pp. 185- 221 ,(1981) , 10.1145/356842.356846
Jason Duell, The design and implementation of Berkeley Lab's linuxcheckpoint/restart Lawrence Berkeley National Laboratory. ,(2005) , 10.2172/891617
Franck Cappello, Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities ieee international conference on high performance computing data and analytics. ,vol. 23, pp. 212- 226 ,(2009) , 10.1177/1094342009106189
J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, I. S. Duff, A set of level 3 basic linear algebra subprograms ACM Transactions on Mathematical Software. ,vol. 16, pp. 1- 17 ,(1990) , 10.1145/77626.79170
Graham E Fagg, Antonin Bukovsky, Jack J Dongarra, HARNESS and fault tolerant MPI parallel computing. ,vol. 27, pp. 1479- 1495 ,(2001) , 10.1016/S0167-8191(01)00100-4