Checkpointing and rollback-recovery for distributed systems

作者: Sam Toueg , Richard Koo

DOI: 10.5555/324493.325074

关键词:

摘要: We consider the problem of bringing a distributed system to consistent state after transient failures. address two components this by describing algorithm create checkpoints, as well rollback-recovery recover state. In contrast previous algorithms, they tolerate failures that occur during their executions. Furthermore, when process takes checkpoint, minimal number additional processes are forced take checkpoints. Similarly, rolls back and restarts failure, roll with it. Our algorithms require each store at most checkpoints in stable storage. This storage requirement is shown be under general assumptions.

参考文章(13)
B. Randell, P. Lee, P. C. Treleaven, Reliability Issues in Computing System Design ACM Computing Surveys. ,vol. 10, pp. 123- 165 ,(1978) , 10.1145/356725.356729
David L. Russell, Process backup in producer-consumer systems ACM SIGOPS Operating Systems Review. ,vol. 11, pp. 151- 157 ,(1977) , 10.1145/1067625.806558
R. Koo, S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems IEEE Transactions on Software Engineering. ,vol. 13, pp. 23- 31 ,(1987) , 10.1109/TSE.1987.232562
Thomas A. Joseph, Kenneth P. Birman, Low cost management of replicated data in fault-tolerant distributed systems ACM Transactions on Computer Systems. ,vol. 4, pp. 54- 70 ,(1986) , 10.1145/6306.6309
Cynthia Dwork, Dale Skeen, The inherent cost of nonblocking commitment principles of distributed computing. pp. 1- 11 ,(1983) , 10.1145/800221.806705
M.J. Fischer, N.D. Griffeth, N.A. Lynch, Global States of a Distributed System IEEE Transactions on Software Engineering. ,vol. 8, pp. 198- 202 ,(1982) , 10.1109/TSE.1982.235418
Vassos Hadzilacos, An algorithm for minimizing roll back cost Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems - PODS '82. pp. 93- 97 ,(1982) , 10.1145/588111.588128
Rob Strom, Shaula Yemini, Optimistic recovery in distributed systems ACM Transactions on Computer Systems. ,vol. 3, pp. 204- 226 ,(1985) , 10.1145/3959.3962
Michael L. Powell, David L. Presotto, Publishing: a reliable broadcast communication mechanism symposium on operating systems principles. ,vol. 17, pp. 100- 109 ,(1983) , 10.1145/773379.806618
K. Mani Chandy, Leslie Lamport, Distributed snapshots: determining global states of distributed systems ACM Transactions on Computer Systems. ,vol. 3, pp. 63- 75 ,(1985) , 10.1145/214451.214456