Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

作者: Anne Benoit , Aurélien Cavelan , Florina M. Ciorba , Valentin Le Fèvre , Yves Robert

DOI: 10.15803/IJNC.9.1_2

关键词: Task (computing)Replication (computing)WorkflowReplicateImproved performanceDistributed computingQuadratic complexityInterruptDynamic programmingComputer science

摘要: Large-scale platforms currently experience errors from two different sources, namely fail-stop (which interrupt the execution) and silent strike unnoticed corrupt data). This work combines checkpointing replication for the reliable execution of linear workflows on subject to these error types. While checkpointing have been studied separately, their combination has not yet been investigated despite its promising potential minimize time linear workflows in error-prone environments. Moreover, combined replication has yet presence both errors. The combination raises new problems: each task, we decide whether checkpoint and/or replicate it ensure reliable execution. We provide an optimal dynamic programming algorithm quadratic complexity solve problems. validated through extensive simulations that reveal conditions in which only, or techniques, lead to improved performance.

参考文章(38)
Christian Engelmann, Hong Hoe Ong, Stephen L Scott, The Case for Modular Redundancy in Large-Scale High Performance Computing Systems ,(2009)
Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun, Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. pp. 215- 236 ,(2014) , 10.1007/978-3-319-17248-4_11
J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave, J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O'Gorman, B. Messina, T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A. H. Taber, R. J. Sussman, W. A. Klein, C. W. Wahaus, IBM experiments in soft fails in computer electronics (1978–1994) Ibm Journal of Research and Development. ,vol. 40, pp. 3- 18 ,(1996) , 10.1147/RD.401.0003
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18
Omer Subasi, Javier Arias, Osman Unsal, Jesus Labarta, Adrian Cristal, Programmer-directed partial redundancy for resilient HPC computing frontiers. pp. 47- ,(2015) , 10.1145/2742854.2742903
Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing Future Generation Computer Systems. ,vol. 51, pp. 7- 19 ,(2015) , 10.1016/J.FUTURE.2015.04.003
Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung Park, Yookun Cho, Using replication and checkpointing for reliable task management in computational Grids international conference on high performance computing and simulation. pp. 125- 131 ,(2010) , 10.1109/HPCS.2010.5547140
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, Marc Snir, None, Toward Exascale Resilience ieee international conference on high performance computing data and analytics. ,vol. 23, pp. 374- 388 ,(2009) , 10.1177/1094342009347767
Nitin H. Vaidya, A case for two-level distributed recovery schemes measurement and modeling of computer systems. ,vol. 23, pp. 64- 73 ,(1995) , 10.1145/223586.223596