作者: Antonio J. Peña , Wesley Bland , Pavan Balaji
关键词: Programming paradigm 、 Computer science 、 Fault tolerance 、 Embedded system 、 Test case 、 Coprocessor 、 Soft error 、 Leverage (statistics)
摘要: Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify explore how such a model enables optimization opportunities not utilized in traditional checkpoint/restart systems, analyze them as building blocks for an efficient fault-tolerant system accelerators. Although leverage our techniques protect from detected but uncorrected ECC errors device memory OpenCL-accelerated applications, coprocessor reliability solutions based different error detectors similar API semantics can directly adopt propose. Adding detection protection involves tradeoff between runtime overhead recovery time. optimal configurations depend particular application, length of run, rate, temporary storage speed, test cases reveal good balance with significantly reduced overheads.