Real-world design and evaluation of compiler-managed GPU redundant multithreading

作者: Jack Wadden , Alexander Lyashevsky , Sudhanva Gurumurthi , Vilas Sridharan , Kevin Skadron

DOI: 10.1145/2678373.2665686

关键词: Computer scienceSupercomputerCompilerThread (computing)MultithreadingFault coverageGeneral-purpose computing on graphics processing unitsSoftwareParallel computing

摘要: Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in construction of reliable supercomputer systems. Because hardware protection expensive to develop, requires dedicated on-chip resources, and not portable across different architectures, efficiency software solutions such as redundant multithreading (RMT) must be explored.This paper presents real-world design evaluation automatic RMT hardware. We first describe compiler pass that automatically converts GPGPU kernels into redundantly threaded versions. then perform detailed power performance evaluations three algorithms, each which provides fault coverage set structures GPU. Using real hardware, we show compilermanaged has highly variable costs. further analyze individual costs work scheduling, computation, inter-thread communication, showing no single component responsible high overheads all applications; instead, certain workload properties tend cause well or poorly. Finally, demonstrate benefit architectural support with specific example fast, register-level thread communication

参考文章(25)
Timothy G. Rogers, Mike OConnor, Tor M. Aamodt, Cache-Conscious Wavefront Scheduling international symposium on microarchitecture. pp. 72- 83 ,(2012) , 10.1109/MICRO.2012.16
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, David I. August, Runtime asynchronous fault tolerance via speculation symposium on code generation and optimization. pp. 145- 154 ,(2012) , 10.1145/2259016.2259035
Steven K. Reinhardt, Shubhendu S. Mukherjee, Transient fault detection via simultaneous multithreading international symposium on computer architecture. ,vol. 28, pp. 25- 36 ,(2000) , 10.1145/339647.339652
Jeremy W. Sheaffer, David P. Luebke, Kevin Skadron, The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware international conference on computer graphics and interactive techniques. pp. 9- 16 ,(2006) , 10.1145/1283900.1283902
T. N. Vijaykumar, Irith Pomeranz, Karl Cheng, Transient-fault recovery using simultaneous multithreading ACM SIGARCH Computer Architecture News. ,vol. 30, pp. 87- 98 ,(2002) , 10.1145/545214.545226
Shubhendu S. Mukherjee, Michael Kontz, Steven K. Reinhardt, Detailed design and evaluation of redundant multithreading alternatives ACM SIGARCH Computer Architecture News. ,vol. 30, pp. 99- 110 ,(2002) , 10.1145/545214.545227
Cheng Wang, Ho-seop Kim, Youfeng Wu, Victor Ying, Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection symposium on code generation and optimization. pp. 244- 258 ,(2007) , 10.1109/CGO.2007.7
Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, Zbigniew Kalbarczyk, Ravishankar Iyer, Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU international parallel and distributed processing symposium. pp. 287- 300 ,(2011) , 10.1109/IPDPS.2011.36
M.K. Qureshi, O. Mutlu, Y.N. Patt, Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors dependable systems and networks. pp. 434- 443 ,(2005) , 10.1109/DSN.2005.62
N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors IEEE Transactions on Reliability. ,vol. 51, pp. 63- 75 ,(2002) , 10.1109/24.994913