GPGPUs: how to combine high computational power with high reliability

作者: N. DeBardeleben , S. Gurumurthi , M. Sonza Reorda , F. Cappello , P. Rech

DOI: 10.5555/2616606.2617090

关键词: Power (physics)Reliability (statistics)Fault injectionCUDAEmbedded systemComputer scienceDistributed computingGeneral-purpose computing on graphics processing unitsFault tolerance

摘要: GPGPUs are used increasingly in several domains, from gaming to different kinds of computationally intensive applications. In many applications GPGPU reliability is becoming a serious issue, and research activities focusing on its evaluation. This paper offers an overview some major results the area. First, it shows analyzes experiments assessing HPC datacenters. Second, provides recent derived radiation about GPGPUs. Third, describes characteristics advanced fault-injection environment, allowing effective evaluation resiliency running

参考文章(39)
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors: A Hands-on Approach Morgan Kaufmann. ,(2012)
D.T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, R.K. Iyer, NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000. pp. 91- 100 ,(2000) , 10.1109/IPDS.2000.839467
Mark D. Lerner, Algorithm Based Fault Tolerance in Massively Parallel Systems Department of Computer Science, Columbia University. ,(1988) , 10.7916/D88P67MN
Thomas Y. Yeh, Glenn Reinman, Sanjay J. Patel, Petros Faloutsos, Fool me twice ACM Transactions on Graphics. ,vol. 29, pp. 1- 11 ,(2009) , 10.1145/1640443.1640448
Hans-Joachim Wunderlich, Claus Braun, Sebastian Halder, Efficacy and efficiency of algorithm-based fault-tolerance on GPUs international on-line testing symposium. pp. 240- 243 ,(2013) , 10.1109/IOLTS.2013.6604090
P. Rech, C. Aguiar, C. Frost, L. Carro, An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs IEEE Transactions on Nuclear Science. ,vol. 60, pp. 2797- 2804 ,(2013) , 10.1109/TNS.2013.2252625
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator international symposium on performance analysis of systems and software. pp. 163- 174 ,(2009) , 10.1109/ISPASS.2009.4919648
Jeffrey S. Vetter, Weikuan Yu, Dong Li, Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2012) , 10.5555/2388996.2389074
P. Rech, C. Aguiar, R. Ferreira, C. Frost, L. Carro, Neutron radiation test of graphic processing units 2012 IEEE 18th International On-Line Testing Symposium (IOLTS). pp. 55- 60 ,(2012) , 10.1109/IOLTS.2012.6313841
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, Sudhanva Gurumurthi, GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications international symposium on performance analysis of systems and software. pp. 221- 230 ,(2014) , 10.1109/ISPASS.2014.6844486