Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

作者: Rinku Gupta , Harish Naik , Pete Beckman

DOI: 10.1177/1094342010369118

关键词: GridInput/outputDistributed computingOverhead (business)IBMSoftwareSystems analysisOperating systemComputer scienceTask (computing)Fault tolerancePetascale computing

摘要: Providing fault tolerance in high-end petascale systems, consisting of millions hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing such systems. Considerable research has focussed on optimizing checkpointing; however, practice, checkpointing still involves a high-cost overhead users. In this paper, we study seen by various applications running leadership-class machines like IBM Blue Gene/P at Argonne National Laboratory. addition studying popular applications, design methodology help users understand intelligently choose optimal frequency reduce overall incurred. particular, Grid-Based Projector-Augmented Wave application, Carr-Parrinello Molecular Dynamics Nek5000 computational fluid dynamics application Parallel Ocean Program application—and analyze their memory usage possible trends 65,536 processors system.

参考文章(24)
Sam Toueg, Richard Koo, Checkpointing and rollback-recovery for distributed systems fall joint computer conference. pp. 1150- 1158 ,(1986) , 10.5555/324493.325074
Max C. Holthausen, Wolfram Koch, A Chemist's Guide to Density Functional Theory ,(2000)
James S. Plank, Kai Li, Micah Beck, Gerry Kingsley, Libckpt: transparent checkpointing under Unix usenix annual technical conference. pp. 18- 18 ,(1995)
Cory Lueninghoener, William Scullin, Rick Bradshaw, Andrew Cherry, Susan Coghlan, Narayan Desai, Petascale system management experiences usenix large installation systems administration conference. pp. 41- 48 ,(2008)
L.M. Silva, J.G. Silva, System-level versus user-defined checkpointing symposium on reliable distributed systems. pp. 68- 74 ,(1998) , 10.1109/RELDIS.1998.740476
Wanda Andreoni, Alessandro Curioni, New advances in chemistry and materials science with CPMD and parallel computing parallel computing. ,vol. 26, pp. 819- 842 ,(2000) , 10.1016/S0167-8191(00)00014-4
J. J. Mortensen, L. B. Hansen, K. W. Jacobsen, Real-space grid implementation of the projector augmented wave method Physical Review B. ,vol. 71, pp. 035109- ,(2005) , 10.1103/PHYSREVB.71.035109
Dongsheng Wang, Weimin Zheng, Dingxing Wang, Meiming Shen, Checkpointing and rollback recovery for network of workstations Science in China Series E: Technological Sciences. ,vol. 42, pp. 207- 214 ,(1999) , 10.1007/BF02917117