ABFR: convenient management of latent error resilience using application knowledge

作者: Aiman Fang , Andrew A. Chien

DOI: 10.1145/3208040.3208046

关键词: StencilProgramming complexityDistributed computingExploitLatency (engineering)ComputationGeneralityScalabilityComputer scienceMemory errors

摘要: Exascale systems face high error-rates due to increasing scale (109 cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent can often be detected by application-level tests but typically at long latencies.We propose a new approach called application-based focused recovery (ABFR), that defines the knowledge needed for efficient recovery. This allows pursue strategies exploiting range of semantics within well-defined resilience framework. The ABFR runtime then exploits this achieve tolerance. enables designers express without concern underlying architectures systems. Together, these properties support flexible resilience. To demonstrate its generality, we apply three varied scientific computations (stencil, N-Body tree, Monte Carlo). We measure performance rates; results indicate significant reductions in cost (up 367x) latency 24x). And achieves scalable with rates computations.

参考文章(49)
Junsheng Long, Jacob A. Abraham, W. Kent Fuchs, Forward Recovery Using Checkpointing in Parallel Systems. international conference on parallel processing. pp. 272- 275 ,(1990)
Guoming Lu, Ziming Zheng, Andrew A. Chien, When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale. pp. 49- 56 ,(2013) , 10.1145/2465813.2465821
Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, Manish Parashar, Local recovery and failure masking for stencil-based applications at extreme scales ieee international conference on high performance computing data and analytics. pp. 70- ,(2015) , 10.1145/2807591.2807672
Volker Springel, Naoki Yoshida, Simon D.M. White, GADGET: a code for collisionless and gasdynamical cosmological simulations New Astronomy. ,vol. 6, pp. 79- 117 ,(2001) , 10.1016/S1384-1076(01)00042-2
W. Peterson, D. Brown, Cyclic Codes for Error Detection Proceedings of the IRE. ,vol. 49, pp. 228- 235 ,(1961) , 10.1109/JRPROC.1961.287814
R. W. Hamming, Error detecting and error correcting codes Bell System Technical Journal. ,vol. 29, pp. 147- 160 ,(1950) , 10.1002/J.1538-7305.1950.TB00463.X
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427