An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

关键词: Fault tolerance 、 Context (language use) 、 Reliability engineering 、 Exponential distribution 、 Probability distribution 、 Process (engineering) 、 Distributed computing 、 Engineering 、 Interval (mathematics) 、 Standard deviation 、 Stochastic process

摘要: Fault tolerance is a key challenge to building the first exa\-scale system. To understand potential impacts of failures on next-generation systems, significant effort has been devoted collecting, characterizing and analyzing current systems. These studies require large volumes data complex analysis. Because occurrence in large-scale systems unpredictable, are commonly modeled as stochastic process. Failure from examined an attempt identify underlying probability distribution its statistical properties.In this paper, we use modeling examine impact failure distributions time-to-solution optimal checkpoint interval applications that coordinated checkpoint/restart. Using approach, show become more frequent, larger influence application performance. We also times less tightly grouped (i.e., standard deviation increases) greater Finally, computing based assumption exponentially distributed modest performance even when drawn different distribution.Our work provides critical analysis guidance process context Specifically, presented paper helps distinguish cases where strong those relatively little impact.

acm.org 本地加速

参考文章(15)

B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, Yves Robert, Frédéric Vivien, Dounia Zaidouni, Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience. ,vol. 26, pp. 2772- 2791 ,(2014) , 10.1002/CPE.3173

Mohamed-Slim Bouguerra, Thierry Gautier, Denis Trystram, Jean-Marc Vincent, A flexible checkpoint/restart model in distributed systems parallel processing and applied mathematics. pp. 206- 215 ,(2009) , 10.1007/978-3-642-14390-8_22

Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, Sudhanva Gurumurthi, Memory Errors in Modern Systems: The Good, The Bad, and The Ugly architectural support for programming languages and operating systems. ,vol. 50, pp. 297- 310 ,(2015) , 10.1145/2694344.2694348

Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing Future Generation Computer Systems. ,vol. 51, pp. 7- 19 ,(2015) , 10.1016/J.FUTURE.2015.04.003

Thomas J. Hacker, Fabian Romero, Christopher D. Carothers, An analysis of clustered failures on large supercomputing systems Journal of Parallel and Distributed Computing. ,vol. 69, pp. 652- 665 ,(2009) , 10.1016/J.JPDC.2009.03.007

John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115

Vilas Sridharan, Dean Liberty, A study of DRAM failures in the field ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2012) , 10.5555/2388996.2389100

Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien, Checkpointing strategies for parallel jobs ieee international conference on high performance computing data and analytics. pp. 33- ,(2011) , 10.1145/2063384.2063428

10.

Ziming Zheng, Li Yu, Zhiling Lan, Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart IEEE Transactions on Computers. ,vol. 64, pp. 1402- 1415 ,(2015) , 10.1109/TC.2014.2317182

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

来源期刊

我的账户

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

来源期刊

相似文章 5

Soft Error Detection for Iterative Applications Using Offline Training

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Lessons learned from memory errors observed over the lifetime of Cielo

Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers

Models for Resilience Design Patterns

我的账户