作者: Wesley B. Bland
DOI:
关键词: Construct (python library) 、 Distributed computing 、 Protocol (object-oriented programming) 、 Supercomputer 、 Message Passing Interface 、 Programming paradigm 、 Message passing 、 Fault tolerance 、 Computer science 、 Overhead (computing)
摘要: As machine sizes have increased and application runtimes lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, algorithm based tolerance, the type of recovery being performed changed, but programming model in which it executes remained virtually static since publication original Message Passing Interface (MPI) Standard 1992. Since that time, applications used a message passing paradigm communicate between processes, they could not perform process within an MPI implementation due limitations Standard. This dissertation describes new protocol using exiting called Checkpoint-on-Failure limited current framework MPI, proposes platform titled User Level Failure Mitigation (ULFM) build more complete complex solutions with true tolerant implementation. We will demonstrate overhead involved these give examples libraries construct other mechanisms on constructs provided ULFM.