作者: Scott Levy , Kurt B. Ferreira
关键词: Fault tolerance 、 Context (language use) 、 Reliability engineering 、 Exponential distribution 、 Probability distribution 、 Process (engineering) 、 Distributed computing 、 Engineering 、 Interval (mathematics) 、 Standard deviation 、 Stochastic process
摘要: Fault tolerance is a key challenge to building the first exa\-scale system. To understand potential impacts of failures on next-generation systems, significant effort has been devoted collecting, characterizing and analyzing current systems. These studies require large volumes data complex analysis. Because occurrence in large-scale systems unpredictable, are commonly modeled as stochastic process. Failure from examined an attempt identify underlying probability distribution its statistical properties.In this paper, we use modeling examine impact failure distributions time-to-solution optimal checkpoint interval applications that coordinated checkpoint/restart. Using approach, show become more frequent, larger influence application performance. We also times less tightly grouped (i.e., standard deviation increases) greater Finally, computing based assumption exponentially distributed modest performance even when drawn different distribution.Our work provides critical analysis guidance process context Specifically, presented paper helps distinguish cases where strong those relatively little impact.