Assessing the Impact of Partial Verifications against Silent Data Corruptions

作者: Aurelien Cavelan , Saurabh K. Raina , Yves Robert , Hongyang Sun

DOI: 10.1109/ICPP.2015.53

关键词:

摘要: Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, must be coupled with verification mechanism to guarantee that corrupted will never written into checkpoint file. Such guaranteed typically incurs high cost. In this paper, we assess impact using partial mechanisms in addition verification. The main objective investigate extent worthwhile light cost less accurate verifications middle computing pattern, ends right before each checkpoint. Introducing dramatically complicates analysis, are able analytically determine optimal pattern (up first-order approximation), including length number verifications, as well their positions inside pattern. Performance evaluations based wide range parameters confirm benefit under certain scenarios, when compared baseline algorithm uses verifications.

参考文章(27)
Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun, Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. pp. 215- 236 ,(2014) , 10.1007/978-3-319-17248-4_11
Murray Dow, Explicit inverses of Toeplitz and associated matrices Anziam Journal. ,vol. 44, pp. 185- 215 ,(2008) , 10.21914/ANZIAMJ.V44I0.493
J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave, J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O'Gorman, B. Messina, T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A. H. Taber, R. J. Sussman, W. A. Klein, C. W. Wahaus, IBM experiments in soft fails in computer electronics (1978–1994) Ibm Journal of Research and Development. ,vol. 40, pp. 3- 18 ,(1996) , 10.1147/RD.401.0003
Guoming Lu, Ziming Zheng, Andrew A. Chien, When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale. pp. 49- 56 ,(2013) , 10.1145/2465813.2465821
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18
Austin R Benson, Robert Schreiber, Sven Schmit, Silent error detection in numerical time-stepping schemes ieee international conference on high performance computing data and analytics. ,vol. 29, pp. 403- 421 ,(2015) , 10.1177/1094342014532297
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Poster Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion - SC '11 Companion. pp. 47- 48 ,(2011) , 10.1145/2148600.2148625
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115
Jing-Yang Jou, Jacob A. Abraham, Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted Checksums Proceedings of SPIE - The International Society for Optical Engineering. ,vol. 495, pp. 94- 101 ,(1984) , 10.1117/12.944013
T.J. O'Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level IEEE Transactions on Electron Devices. ,vol. 41, pp. 553- 557 ,(1994) , 10.1109/16.278509