作者: Jack Wadden , Alexander Lyashevsky , Sudhanva Gurumurthi , Vilas Sridharan , Kevin Skadron
关键词: Computer science 、 Supercomputer 、 Compiler 、 Thread (computing) 、 Multithreading 、 Fault coverage 、 General-purpose computing on graphics processing units 、 Software 、 Parallel computing
摘要: Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in construction of reliable supercomputer systems. Because hardware protection expensive to develop, requires dedicated on-chip resources, and not portable across different architectures, efficiency software solutions such as redundant multithreading (RMT) must be explored.This paper presents real-world design evaluation automatic RMT hardware. We first describe compiler pass that automatically converts GPGPU kernels into redundantly threaded versions. then perform detailed power performance evaluations three algorithms, each which provides fault coverage set structures GPU. Using real hardware, we show compilermanaged has highly variable costs. further analyze individual costs work scheduling, computation, inter-thread communication, showing no single component responsible high overheads all applications; instead, certain workload properties tend cause well or poorly. Finally, demonstrate benefit architectural support with specific example fast, register-level thread communication