Granularity and the cost of error recovery in resilient AMR scientific applications

作者： Devesh Tiwari , Hajime Fujita , Daniel T. Graves , Anshu Dubey , Andrew Chien

关键词:

摘要: Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory performance impact may vary with error types modes. Therefore, localized recovery schemes will be important for scientific computations, including modes where application intervention is suitable recovery. We present a resiliency methodology applications using structured adaptive mesh refinement, map granularities within detection correction. This approach also enables parameterization cost differentiated model built tuning parameters that can used customize strategy different computing environments. show this make proportional rate.

参考文章(23)

H. S. Johansen, P. W. McCorquodale, T. J. Ligocki, N. D. Keen, P. Colella, D. F. Martin, D. T. Graves, P. O. Schwartz, J. N. Johnson, B. Van Straalen, T. D. Sternberg, D. Modiano, M. Adams, Chombo Software Package for AMR Applications Design Document ,(2014)

Ziming Zheng, Andrew A. Chien, Keita Teranishi, Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study Lecture Notes in Computer Science. ,vol. 8969, pp. 124- 132 ,(2015) , 10.1007/978-3-319-17353-5_11

Hajime Fujita, Nan Dun, Zachary A. Rubenstein, Andrew A. Chien, Log-structured global array for efficient multi-version snapshots ieee acm international symposium cluster cloud and grid computing. pp. 281- 291 ,(2015) , 10.1109/CCGRID.2015.80

Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, Don Maxwell, Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. pp. 37- 44 ,(2015) , 10.1109/DSN.2015.52

Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, Don Maxwell, Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility ieee international conference on high performance computing data and analytics. pp. 38- ,(2015) , 10.1145/2807591.2807666

Guoming Lu, Ziming Zheng, Andrew A. Chien, When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale. pp. 49- 56 ,(2013) , 10.1145/2465813.2465821

Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, Manish Parashar, Local recovery and failure masking for stencil-based applications at extreme scales ieee international conference on high performance computing data and analytics. pp. 70- ,(2015) , 10.1145/2807591.2807672

Clark Mobarry, Peter MacNeice, Kevin M. Olson, Charles Packer, Rosalinda de Fainchtein, Paramesh: A Parallel Adaptive Mesh Refinement Community Toolkit ,(2013)

Marsha J. Berger, Joseph E. Oliger, Adaptive mesh refinement for hyperbolic partial differential equations ,(1982)

10.

Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18

Granularity and the cost of error recovery in resilient AMR scientific applications

来源期刊

我的账户

Granularity and the cost of error recovery in resilient AMR scientific applications

来源期刊

相似文章 4

Resilience for Stencil Computations with Latent Errors

ABFR: convenient management of latent error resilience using application knowledge

Simulation Planning Using Component Based Cost Model

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

我的账户