Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

作者: Anne Benoit , Aurélien Cavelan , Yves Robert , Hongyang Sun

DOI: 10.1007/978-3-319-17248-4_11

关键词: Job shop schedulingWorkflowEnergy consumptionStatisticsComputer scienceRollback recoveryGeneral purposeDistributed computingGraph (abstract data type)

摘要: In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop silent errors. The objective is minimize either makespan or energy consumption. While DVFS a popular approach for reducing consumption, using lower speeds/voltages can increase number of errors, thereby complicating problem. We consider an application workflow whose dependence graph chain tasks, study three execution scenarios: (i) single speed used during whole execution; (ii) second, possibly higher any potential re-execution; (iii) different pairs speeds be throughout execution. For each scenario, determine optimal locations (and third scenario) objective. scenarios are then assessed compared through extensive set experiments.

参考文章(34)
Nikzad Babaii Rizvandi, Albert Y. Zomaya, Young Choon Lee, Ali Javadzadeh Boloori, Javid Taheri, Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption arXiv: Distributed, Parallel, and Cluster Computing. pp. 856- ,(2012) , 10.1002/9781118342015.CH17
Austin R Benson, Robert Schreiber, Sven Schmit, Silent error detection in numerical time-stepping schemes ieee international conference on high performance computing data and analytics. ,vol. 29, pp. 403- 421 ,(2015) , 10.1177/1094342014532297
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Poster Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion - SC '11 Companion. pp. 47- 48 ,(2011) , 10.1145/2148600.2148625
Guillaume Aupy, Anne Benoit, Yves Robert, Energy-aware scheduling under reliability and makespan constraints ieee international conference on high performance computing, data, and analytics. pp. 1- 10 ,(2012) , 10.1109/HIPC.2012.6507482
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115
Piyush Sao, Richard Vuduc, Self-stabilizing iterative solvers Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. pp. 4- ,(2013) , 10.1145/2530268.2530272
B. Veeravalli, C. Bolchini, A. Das, A. Kumar, A. Miele, Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs design, automation, and test in europe. pp. 61- ,(2014) , 10.5555/2616606.2616681
Osman Sarood, Esteban Meneses, Laxmikant V. Kale, A 'cool' way of improving the reliability of HPC machines ieee international conference on high performance computing data and analytics. pp. 58- ,(2013) , 10.1145/2503210.2503228
Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien, Checkpointing strategies for parallel jobs ieee international conference on high performance computing data and analytics. pp. 33- ,(2011) , 10.1145/2063384.2063428