Checkpointing and rollback-recovery for distributed systems

作者： Sam Toueg , Richard Koo

关键词:

摘要: We consider the problem of bringing a distributed system to consistent state after transient failures. address two components this by describing algorithm create checkpoints, as well rollback-recovery recover state. In contrast previous algorithms, they tolerate failures that occur during their executions. Furthermore, when process takes checkpoint, minimal number additional processes are forced take checkpoints. Similarly, rolls back and restarts failure, roll with it. Our algorithms require each store at most checkpoints in stable storage. This storage requirement is shown be under general assumptions.

参考文章(13)

B. Randell, P. Lee, P. C. Treleaven, Reliability Issues in Computing System Design ACM Computing Surveys. ,vol. 10, pp. 123- 165 ,(1978) , 10.1145/356725.356729

David L. Russell, Process backup in producer-consumer systems ACM SIGOPS Operating Systems Review. ,vol. 11, pp. 151- 157 ,(1977) , 10.1145/1067625.806558

R. Koo, S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems IEEE Transactions on Software Engineering. ,vol. 13, pp. 23- 31 ,(1987) , 10.1109/TSE.1987.232562

Thomas A. Joseph, Kenneth P. Birman, Low cost management of replicated data in fault-tolerant distributed systems ACM Transactions on Computer Systems. ,vol. 4, pp. 54- 70 ,(1986) , 10.1145/6306.6309

Cynthia Dwork, Dale Skeen, The inherent cost of nonblocking commitment principles of distributed computing. pp. 1- 11 ,(1983) , 10.1145/800221.806705

M.J. Fischer, N.D. Griffeth, N.A. Lynch, Global States of a Distributed System IEEE Transactions on Software Engineering. ,vol. 8, pp. 198- 202 ,(1982) , 10.1109/TSE.1982.235418

Vassos Hadzilacos, An algorithm for minimizing roll back cost Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems - PODS '82. pp. 93- 97 ,(1982) , 10.1145/588111.588128

Rob Strom, Shaula Yemini, Optimistic recovery in distributed systems ACM Transactions on Computer Systems. ,vol. 3, pp. 204- 226 ,(1985) , 10.1145/3959.3962

Michael L. Powell, David L. Presotto, Publishing: a reliable broadcast communication mechanism symposium on operating systems principles. ,vol. 17, pp. 100- 109 ,(1983) , 10.1145/773379.806618

10.

K. Mani Chandy, Leslie Lamport, Distributed snapshots: determining global states of distributed systems ACM Transactions on Computer Systems. ,vol. 3, pp. 63- 75 ,(1985) , 10.1145/214451.214456

Checkpointing and rollback-recovery for distributed systems

来源期刊

我的账户

Checkpointing and rollback-recovery for distributed systems

来源期刊

相似文章 10

我的账户