Experimental assessment of workstation failures and their impact on checkpointing systems

作者: J.S. Plank , W.R. Elwasif

DOI: 10.1109/FTCS.1998.689454

关键词: Local area networkCondition monitoringRollback recoveryComputer scienceSystem recoveryExperimental researchRunning timeDistributed computingTheoretical researchWorkstation

摘要: In the past twenty years, there has been a wealth of theoretical research on minimizing expected running time program in presence failures by employing checkpointing and rollback recovery. same period, little experimental to corroborate these results. We study three separate projects that monitor failure workstation networks. Our goals are twofold. The first is see how results correlate with results, second assess their impact strategies for long-running computations workstations networks workstations. A significant result our work although base assumptions do not hold, many still applicable.

参考文章(26)
James S. Plank, Kai Li, Micah Beck, Gerry Kingsley, Libckpt: transparent checkpointing under Unix usenix annual technical conference. pp. 18- 18 ,(1995)
C. M. Krishna, Yann-Hang Lee, Kang G. Shin, Optimization criteria for checkpoint placement Communications of the ACM. ,vol. 27, pp. 1008- 1012 ,(1984) , 10.1145/358274.358282
V.S Sunderam, G.A Geist, J Dongarra, R Manchek, The PVM concurrent computing system: evolution, experiences, and trends parallel computing. ,vol. 20, pp. 531- 545 ,(1994) , 10.1016/0167-8191(94)90027-2
Erol Gelenbe, On the Optimum Checkpoint Interval Journal of the ACM. ,vol. 26, pp. 259- 270 ,(1979) , 10.1145/322123.322131
Matt W. Mutka, Miron Livny, The available capacity of a privately owned workstation environment Performance Evaluation. ,vol. 12, pp. 269- 284 ,(1991) , 10.1016/0166-5316(91)90005-N
N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme IEEE Transactions on Computers. ,vol. 46, pp. 942- 947 ,(1997) , 10.1109/12.609281
E. Gelenbe, D. Derochette, Performance of rollback recovery systems under intermittent failures Communications of the ACM. ,vol. 21, pp. 493- 499 ,(1978) , 10.1145/359511.359531
Andrzej Duda, The effects of checkpointing on program execution time Information Processing Letters. ,vol. 16, pp. 221- 229 ,(1983) , 10.1016/0020-0190(83)90093-5
Larry H. Crow, Nozer D. Singpurwalla, An Empirically Developed Fourier Series Model for Describing Software Failures IEEE Transactions on Reliability. ,vol. R-33, pp. 176- 183 ,(1984) , 10.1109/TR.1984.5221770
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115