作者: Rinku Gupta , Harish Naik , Pete Beckman
关键词: Grid 、 Input/output 、 Distributed computing 、 Overhead (business) 、 IBM 、 Software 、 Systems analysis 、 Operating system 、 Computer science 、 Task (computing) 、 Fault tolerance 、 Petascale computing
摘要: Providing fault tolerance in high-end petascale systems, consisting of millions hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing such systems. Considerable research has focussed on optimizing checkpointing; however, practice, checkpointing still involves a high-cost overhead users. In this paper, we study seen by various applications running leadership-class machines like IBM Blue Gene/P at Argonne National Laboratory. addition studying popular applications, design methodology help users understand intelligently choose optimal frequency reduce overall incurred. particular, Grid-Based Projector-Augmented Wave application, Carr-Parrinello Molecular Dynamics Nek5000 computational fluid dynamics application Parallel Ocean Program applicationâand analyze their memory usage possible trends 65,536 processors system.