作者: Aiman Fang , Andrew A. Chien
关键词: Stencil 、 Programming complexity 、 Distributed computing 、 Exploit 、 Latency (engineering) 、 Computation 、 Generality 、 Scalability 、 Computer science 、 Memory errors
摘要: Exascale systems face high error-rates due to increasing scale (109 cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent can often be detected by application-level tests but typically at long latencies.We propose a new approach called application-based focused recovery (ABFR), that defines the knowledge needed for efficient recovery. This allows pursue strategies exploiting range of semantics within well-defined resilience framework. The ABFR runtime then exploits this achieve tolerance. enables designers express without concern underlying architectures systems. Together, these properties support flexible resilience. To demonstrate its generality, we apply three varied scientific computations (stencil, N-Body tree, Monte Carlo). We measure performance rates; results indicate significant reductions in cost (up 367x) latency 24x). And achieves scalable with rates computations.