libhashckpt: hash-based incremental checkpointing using GPU's

作者： Kurt B. Ferreira , Rolf Riesen , Ron Brighwell , Patrick Bridges , Dorian Arnold

关键词:

摘要: Concern is beginning to grow in the high-performance computing (HPC) community regarding reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been dominant fault tolerance mechanism HPC systems for last 30 years. Checkpoint performance so fundamental scalability that nearly all capability applications have custom checkpoint strategies minimize state and reduce time. One well-known optimization traditional incremental checkpointing, which a number known limitations. To address these limitations, we introduce libhashckpt; hybrid checkpointing solution uses both page protection hashing on GPUs determine changes application data with very low overhead. Using real workloads, show merit this technique certain class applications.

参考文章(17)

James S. Plank, Kai Li, Micah Beck, Gerry Kingsley, Libckpt: transparent checkpointing under Unix usenix annual technical conference. pp. 18- 18 ,(1995)

Alfred J Menezes, Paul C van Oorschot, Scott A Vanstone, Handbook of Applied Cryptography ,(1996)

E.N. Elnozahy, How safe is probabilistic checkpointing ieee international symposium on fault tolerant computing. pp. 358- 363 ,(1998) , 10.1109/FTCS.1998.689486

Steve Plimpton, Fast parallel algorithms for short-range molecular dynamics Journal of Computational Physics. ,vol. 117, pp. 1- 19 ,(1995) , 10.1006/JCPH.1995.1039

J.S. Plank, Kai Li, ickp: a consistent checkpointer for multicomputers IEEE Parallel & Distributed Technology: Systems & Applications. ,vol. 2, pp. 62- 67 ,(1994) , 10.1109/88.311574

Yuqun Chen, James S. Plank, Kai Li, CLIP: A Checkpointing Tool for Message Passing Parallel Programs conference on high performance computing (supercomputing). pp. 1- 11 ,(1997) , 10.1145/509593.509626

Stuart I. Feldman, Channing B. Brown, IGOR: a system for program debugging via reversible execution workshop on parallel & distributed debugging. ,vol. 24, pp. 112- 123 ,(1988) , 10.1145/68210.69226

Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B Johnson, A survey of rollback-recovery protocols in message-passing systems ACM Computing Surveys. ,vol. 34, pp. 375- 408 ,(2002) , 10.1145/568522.568525

E. S. Hertel, R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I. Kerley, J. M. McGlaun, S. V. Petney, S. A. Silling, P. A. Taylor, L. Yarrington, CTH: A Software Family for Multi-Dimensional Shock Physics Analysis Shock Waves @ Marseille I. pp. 377- 382 ,(1995) , 10.1007/978-3-642-78829-1_61

10.

Hyo-chang Nam, Jong Kim, Sung Je Hong, Sunggu Lee, A secure checkpointing system pacific rim international symposium on dependable computing. pp. 49- 56 ,(2001) , 10.1109/PRDC.2001.992679

libhashckpt: hash-based incremental checkpointing using GPU's

来源期刊

我的账户

libhashckpt: hash-based incremental checkpointing using GPU's

来源期刊

相似文章 10

我的账户