Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

作者: Devesh Tiwari , Saurabh Gupta , George Gallarno , Jim Rogers , Don Maxwell

DOI: 10.1145/2807591.2807666

关键词:

摘要: … Counting only one DBE error per card addresses the previously mentioned issues, and … it is that a DBE is followed by an ECC page retirement error. Note that the the DBE occurrences …

参考文章(30)
Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, Don Maxwell, Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. pp. 37- 44 ,(2015) , 10.1109/DSN.2015.52
Nosayba El-Sayed, Bianca Schroeder, Reading between the lines of failure logs: Understanding how HPC systems fail dependable systems and networks. pp. 1- 12 ,(2013) , 10.1109/DSN.2013.6575356
Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, Arthur Bland, Understanding GPU errors on large-scale HPC systems and the implications for system design and operation high-performance computer architecture. pp. 331- 342 ,(2015) , 10.1109/HPCA.2015.7056044
Ana Gainaru, Franck Cappello, Joshi Fullop, Stefan Trausan-Matu, William Kramer, Adaptive event prediction strategy with dynamic time window for large-scale HPC systems Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques. pp. 4- ,(2011) , 10.1145/2038633.2038637
Qingrui Liu, Changhee Jung, Dongyoon Lee, Devesh Tiwari, Clover: Compiler Directed Lightweight Soft Error Resilience languages compilers and tools for embedded systems. ,vol. 50, pp. 2- ,(2015) , 10.1145/2670529.2754959
Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, William Kramer, Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters dependable systems and networks. pp. 610- 621 ,(2014) , 10.1109/DSN.2014.62
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, Sudhanva Gurumurthi, GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications international symposium on performance analysis of systems and software. pp. 221- 230 ,(2014) , 10.1109/ISPASS.2014.6844486
Jingweijia Tan, Nilanjan Goswami, Tao Li, Xin Fu, Analyzing soft-error vulnerability on GPGPU microarchitecture ieee international symposium on workload characterization. pp. 226- 235 ,(2011) , 10.1109/IISWC.2011.6114182
Devesh Tiwari, Saurabh Gupta, Sudharshan S. Vazhkudai, Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems dependable systems and networks. pp. 25- 36 ,(2014) , 10.1109/DSN.2014.101
Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu, Poster: Evaluating Error Resiliency of GPGPU Applications ieee international conference on high performance computing data and analytics. pp. 1504- 1504 ,(2012) , 10.1109/SC.COMPANION.2012.289