VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

作者: Antonio J. Peña , Wesley Bland , Pavan Balaji

DOI: 10.1145/2807591.2807640

关键词: Programming paradigmComputer scienceFault toleranceEmbedded systemTest caseCoprocessorSoft errorLeverage (statistics)

摘要: Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify explore how such a model enables optimization opportunities not utilized in traditional checkpoint/restart systems, analyze them as building blocks for an efficient fault-tolerant system accelerators. Although leverage our techniques protect from detected but uncorrected ECC errors device memory OpenCL-accelerated applications, coprocessor reliability solutions based different error detectors similar API semantics can directly adopt propose. Adding detection protection involves tradeoff between runtime overhead recovery time. optimal configurations depend particular application, length of run, rate, temporary storage speed, test cases reveal good balance with significantly reduced overheads.

参考文章(23)
Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra, Post-failure recovery of MPI communication capability: Design and rationale ieee international conference on high performance computing data and analytics. ,vol. 27, pp. 244- 254 ,(2013) , 10.1177/1094342013488238
Jack Wadden, Alexander Lyashevsky, Sudhanva Gurumurthi, Vilas Sridharan, Kevin Skadron, Real-world design and evaluation of compiler-managed GPU redundant multithreading international symposium on computer architecture. ,vol. 42, pp. 73- 84 ,(2014) , 10.1145/2678373.2665686
Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, Hiroaki Kobayashi, CheCUDA: A Checkpoint/Restart Tool for CUDA Applications parallel and distributed computing: applications and technologies. pp. 408- 413 ,(2009) , 10.1109/PDCAT.2009.78
Aurélien Bouteiller, Franck Cappello, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, Frédéric Magniette, MPICH-V2 Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03. pp. 25- 25 ,(2003) , 10.1145/1048935.1050176
Riaz Naseer, Jeff Draper, Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs european solid-state circuits conference. pp. 222- 225 ,(2008) , 10.1109/ESSCIRC.2008.4681832
Arash Rezaei, Giuseppe Coviello, Cheng-Hong Li, Srimat Chakradhar, Frank Mueller, Snapify: capturing snapshots of offload applications on xeon phi manycore processors high performance distributed computing. pp. 1- 12 ,(2014) , 10.1145/2600212.2600215
Tetsu Narumi, Atsushi Kawai, Kenji Yasuoka, Kazuyuki Yoshikawa, Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability hgpu.org. ,(2012)
Enrique S. Quintana-Ortí, José Duato, Federico Silla, Adrián Castelló, Rafael Mayo, Antonio J. Peña, Vicente Roca, On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications ENERGY 2014, The Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies. pp. 57- 62 ,(2014)
Antonio José Peña Monferrer, Virtualization of accelerators in high performance clusters Universitat Jaume I. ,(2013)
Antonio J Pena, Sadaf R Alam, Evaluation of inter- and intra-node data transfer efficiencies between GPU devices and their impact on scalable applications ieee acm international symposium cluster cloud and grid computing. pp. 144- 151 ,(2013) , 10.1109/CCGRID.2013.15