作者: Sam Toueg , Richard Koo
关键词:
摘要: We consider the problem of bringing a distributed system to consistent state after transient failures. address two components this by describing algorithm create checkpoints, as well rollback-recovery recover state. In contrast previous algorithms, they tolerate failures that occur during their executions. Furthermore, when process takes checkpoint, minimal number additional processes are forced take checkpoints. Similarly, rolls back and restarts failure, roll with it. Our algorithms require each store at most checkpoints in stable storage. This storage requirement is shown be under general assumptions.