作者: Anne Benoit , Aurélien Cavelan , Florina M. Ciorba , Valentin Le Fèvre , Yves Robert
DOI: 10.15803/IJNC.9.1_2
关键词: Task (computing) 、 Replication (computing) 、 Workflow 、 Replicate 、 Improved performance 、 Distributed computing 、 Quadratic complexity 、 Interrupt 、 Dynamic programming 、 Computer science
摘要: Large-scale platforms currently experience errors from two different sources, namely fail-stop (which interrupt the execution) and silent strike unnoticed corrupt data). This work combines checkpointing replication for the reliable execution of linear workflows on subject to these error types. While checkpointing have been studied separately, their combination has not yet been investigated despite its promising potential minimize time linear workflows in error-prone environments. Moreover, combined replication has yet presence both errors. The combination raises new problems: each task, we decide whether checkpoint and/or replicate it ensure reliable execution. We provide an optimal dynamic programming algorithm quadratic complexity solve problems. validated through extensive simulations that reveal conditions in which only, or techniques, lead to improved performance.