Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart

作者: Ziming Zheng , Li Yu , Zhiling Lan

DOI: 10.1109/TC.2014.2317182

关键词:

摘要: Speedup models are powerful analytical tools for evaluating and predicting the performance of parallel applications. Unfortunately, well-known speedup like Amdahl’s law Gustafson’s do not take reliability into consideration therefore cannot accurately account application in presence failures. In this study, we enhance by considering impact failures effect coordinated checkpointing/restart. Unlike existing studies relying on Exponential failure distribution alone, work consider both Weibull distributions construction our reliability-aware models. The derived validated through trace-based simulations under a variety parameter settings. Our demonstrate these can effectively quantify speedup. Moreover, present two case to illustrate use

参考文章(52)
Patricia J. Teller, Sarala Arunagiri, Seetharami Seelam, Ron A. Oldfield, John T. Daly, Rolf Riesen, Maria Ruiz Varela, Opportunistic Checkpoint Intervals to Improve System Performance ,(2008)
P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, F. Cappello, Improved message logging versus improved coordinated checkpointing for fault tolerant MPI international conference on cluster computing. pp. 115- 124 ,(2004) , 10.1109/CLUSTR.2004.1392609
B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5
J.S. Plank, W.R. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems ieee international symposium on fault tolerant computing. pp. 48- 57 ,(1998) , 10.1109/FTCS.1998.689454
John L. Gustafson, Reevaluating Amdahl's law Communications of the ACM. ,vol. 31, pp. 532- 533 ,(1988) , 10.1145/42411.42415
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18
Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, André Schiper, Franck Cappello, SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing ieee international conference on high performance computing data and analytics. pp. 8- ,(2013) , 10.1145/2503210.2503271
Sachin Garg, Yennun Huang, Chandra Kintala, Kishor S. Trivedi, Minimizing completion time of a program by checkpointing and rejuvenation measurement and modeling of computer systems. ,vol. 24, pp. 252- 261 ,(1996) , 10.1145/233008.233050
Guohong Cao, Mukesh Singhal, Checkpointing with mutable checkpoints Theoretical Computer Science. ,vol. 290, pp. 1127- 1148 ,(2003) , 10.1016/S0304-3975(02)00566-2