An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

作者: Scott Levy , Kurt B. Ferreira

DOI: 10.1145/2909428.2909430

关键词: Fault toleranceContext (language use)Reliability engineeringExponential distributionProbability distributionProcess (engineering)Distributed computingEngineeringInterval (mathematics)Standard deviationStochastic process

摘要: Fault tolerance is a key challenge to building the first exa\-scale system. To understand potential impacts of failures on next-generation systems, significant effort has been devoted collecting, characterizing and analyzing current systems. These studies require large volumes data complex analysis. Because occurrence in large-scale systems unpredictable, are commonly modeled as stochastic process. Failure from examined an attempt identify underlying probability distribution its statistical properties.In this paper, we use modeling examine impact failure distributions time-to-solution optimal checkpoint interval applications that coordinated checkpoint/restart. Using approach, show become more frequent, larger influence application performance. We also times less tightly grouped (i.e., standard deviation increases) greater Finally, computing based assumption exponentially distributed modest performance even when drawn different distribution.Our work provides critical analysis guidance process context Specifically, presented paper helps distinguish cases where strong those relatively little impact.

参考文章(15)
B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5
George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, Yves Robert, Frédéric Vivien, Dounia Zaidouni, Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience. ,vol. 26, pp. 2772- 2791 ,(2014) , 10.1002/CPE.3173
Mohamed-Slim Bouguerra, Thierry Gautier, Denis Trystram, Jean-Marc Vincent, A flexible checkpoint/restart model in distributed systems parallel processing and applied mathematics. pp. 206- 215 ,(2009) , 10.1007/978-3-642-14390-8_22
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, Sudhanva Gurumurthi, Memory Errors in Modern Systems: The Good, The Bad, and The Ugly architectural support for programming languages and operating systems. ,vol. 50, pp. 297- 310 ,(2015) , 10.1145/2694344.2694348
Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing Future Generation Computer Systems. ,vol. 51, pp. 7- 19 ,(2015) , 10.1016/J.FUTURE.2015.04.003
Thomas J. Hacker, Fabian Romero, Christopher D. Carothers, An analysis of clustered failures on large supercomputing systems Journal of Parallel and Distributed Computing. ,vol. 69, pp. 652- 665 ,(2009) , 10.1016/J.JPDC.2009.03.007
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115
Vilas Sridharan, Dean Liberty, A study of DRAM failures in the field ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2012) , 10.5555/2388996.2389100
Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien, Checkpointing strategies for parallel jobs ieee international conference on high performance computing data and analytics. pp. 33- ,(2011) , 10.1145/2063384.2063428
Ziming Zheng, Li Yu, Zhiling Lan, Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart IEEE Transactions on Computers. ,vol. 64, pp. 1402- 1415 ,(2015) , 10.1109/TC.2014.2317182