作者: Kurt B. Ferreira , Rolf Riesen , Ron Brighwell , Patrick Bridges , Dorian Arnold
DOI: 10.1007/978-3-642-24449-0_31
关键词:
摘要: Concern is beginning to grow in the high-performance computing (HPC) community regarding reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been dominant fault tolerance mechanism HPC systems for last 30 years. Checkpoint performance so fundamental scalability that nearly all capability applications have custom checkpoint strategies minimize state and reduce time. One well-known optimization traditional incremental checkpointing, which a number known limitations. To address these limitations, we introduce libhashckpt; hybrid checkpointing solution uses both page protection hashing on GPUs determine changes application data with very low overhead. Using real workloads, show merit this technique certain class applications.