libhashckpt: hash-based incremental checkpointing using GPU's

作者: Kurt B. Ferreira , Rolf Riesen , Ron Brighwell , Patrick Bridges , Dorian Arnold

DOI: 10.1007/978-3-642-24449-0_31

关键词:

摘要: Concern is beginning to grow in the high-performance computing (HPC) community regarding reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been dominant fault tolerance mechanism HPC systems for last 30 years. Checkpoint performance so fundamental scalability that nearly all capability applications have custom checkpoint strategies minimize state and reduce time. One well-known optimization traditional incremental checkpointing, which a number known limitations. To address these limitations, we introduce libhashckpt; hybrid checkpointing solution uses both page protection hashing on GPUs determine changes application data with very low overhead. Using real workloads, show merit this technique certain class applications.

参考文章(17)
James S. Plank, Kai Li, Micah Beck, Gerry Kingsley, Libckpt: transparent checkpointing under Unix usenix annual technical conference. pp. 18- 18 ,(1995)
Alfred J Menezes, Paul C van Oorschot, Scott A Vanstone, Handbook of Applied Cryptography ,(1996)
E.N. Elnozahy, How safe is probabilistic checkpointing ieee international symposium on fault tolerant computing. pp. 358- 363 ,(1998) , 10.1109/FTCS.1998.689486
Steve Plimpton, Fast parallel algorithms for short-range molecular dynamics Journal of Computational Physics. ,vol. 117, pp. 1- 19 ,(1995) , 10.1006/JCPH.1995.1039
J.S. Plank, Kai Li, ickp: a consistent checkpointer for multicomputers IEEE Parallel & Distributed Technology: Systems & Applications. ,vol. 2, pp. 62- 67 ,(1994) , 10.1109/88.311574
Yuqun Chen, James S. Plank, Kai Li, CLIP: A Checkpointing Tool for Message Passing Parallel Programs conference on high performance computing (supercomputing). pp. 1- 11 ,(1997) , 10.1145/509593.509626
Stuart I. Feldman, Channing B. Brown, IGOR: a system for program debugging via reversible execution workshop on parallel & distributed debugging. ,vol. 24, pp. 112- 123 ,(1988) , 10.1145/68210.69226
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B Johnson, A survey of rollback-recovery protocols in message-passing systems ACM Computing Surveys. ,vol. 34, pp. 375- 408 ,(2002) , 10.1145/568522.568525
E. S. Hertel, R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I. Kerley, J. M. McGlaun, S. V. Petney, S. A. Silling, P. A. Taylor, L. Yarrington, CTH: A Software Family for Multi-Dimensional Shock Physics Analysis Shock Waves @ Marseille I. pp. 377- 382 ,(1995) , 10.1007/978-3-642-78829-1_61
Hyo-chang Nam, Jong Kim, Sung Je Hong, Sunggu Lee, A secure checkpointing system pacific rim international symposium on dependable computing. pp. 49- 56 ,(2001) , 10.1109/PRDC.2001.992679