Granularity and the cost of error recovery in resilient AMR scientific applications

作者: Devesh Tiwari , Hajime Fujita , Daniel T. Graves , Anshu Dubey , Andrew Chien

DOI: 10.5555/3014904.3014961

关键词:

摘要: Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory performance impact may vary with error types modes. Therefore, localized recovery schemes will be important for scientific computations, including modes where application intervention is suitable recovery. We present a resiliency methodology applications using structured adaptive mesh refinement, map granularities within detection correction. This approach also enables parameterization cost differentiated model built tuning parameters that can used customize strategy different computing environments. show this make proportional rate.

参考文章(23)
H. S. Johansen, P. W. McCorquodale, T. J. Ligocki, N. D. Keen, P. Colella, D. F. Martin, D. T. Graves, P. O. Schwartz, J. N. Johnson, B. Van Straalen, T. D. Sternberg, D. Modiano, M. Adams, Chombo Software Package for AMR Applications Design Document ,(2014)
Ziming Zheng, Andrew A. Chien, Keita Teranishi, Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study Lecture Notes in Computer Science. ,vol. 8969, pp. 124- 132 ,(2015) , 10.1007/978-3-319-17353-5_11
Hajime Fujita, Nan Dun, Zachary A. Rubenstein, Andrew A. Chien, Log-structured global array for efficient multi-version snapshots ieee acm international symposium cluster cloud and grid computing. pp. 281- 291 ,(2015) , 10.1109/CCGRID.2015.80
Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, Don Maxwell, Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. pp. 37- 44 ,(2015) , 10.1109/DSN.2015.52
Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, Don Maxwell, Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility ieee international conference on high performance computing data and analytics. pp. 38- ,(2015) , 10.1145/2807591.2807666
Guoming Lu, Ziming Zheng, Andrew A. Chien, When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale. pp. 49- 56 ,(2013) , 10.1145/2465813.2465821
Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, Manish Parashar, Local recovery and failure masking for stencil-based applications at extreme scales ieee international conference on high performance computing data and analytics. pp. 70- ,(2015) , 10.1145/2807591.2807672
Clark Mobarry, Peter MacNeice, Kevin M. Olson, Charles Packer, Rosalinda de Fainchtein, Paramesh: A Parallel Adaptive Mesh Refinement Community Toolkit ,(2013)
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18