作者: Aurelien Cavelan , Saurabh K. Raina , Yves Robert , Hongyang Sun
DOI: 10.1109/ICPP.2015.53
关键词:
摘要: Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, must be coupled with verification mechanism to guarantee that corrupted will never written into checkpoint file. Such guaranteed typically incurs high cost. In this paper, we assess impact using partial mechanisms in addition verification. The main objective investigate extent worthwhile light cost less accurate verifications middle computing pattern, ends right before each checkpoint. Introducing dramatically complicates analysis, are able analytically determine optimal pattern (up first-order approximation), including length number verifications, as well their positions inside pattern. Performance evaluations based wide range parameters confirm benefit under certain scenarios, when compared baseline algorithm uses verifications.