Architecture-Aware Optimization on a 1600-core Graphics Processor

作者: Thomas R. W. Scogland , Mayank Daga , Wu-chun Feng

DOI:

关键词: Parallel computingTOP500GraphicsGeneral-purpose computing on graphics processing unitsCUDANode (networking)Computer scienceComputer clusterGraphics processing unitIsolation (database systems)

摘要: The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers world, as ranked by TOP500, employ GPUs accelerators. Despite this increasing interest GPUs, however, optimizing performance of a GPU-accelerated compute node requires deep technical knowledge underlying architecture. Although significant literature exists on how to optimize GPU performance the more mature NVIDIA CUDA architecture, converse is true for OpenCL AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for GPU. most prominent optimizations include (i) explicit use registers, (ii) vector types, (iii) removal branches, (iv) image memory global data. We demonstrate efficacy our optimizations by applying each optimization isolation well concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific optimizations, Radeon HD 5870 delivers 65% better than with wellknown NVIDIA-specific optimizations.

参考文章(18)
David A. Patterson, Samuel Webb Williams, Auto-tuning performance on multicore computers University of California at Berkeley. ,(2008)
Fang Xudong, Tang Yuhua, Wang Guibin, Tang Tao, Zhang Ying, Optimizing stencil application on multi-thread GPU architecture using stream programming model automation, robotics and control systems. pp. 234- 245 ,(2010) , 10.1007/978-3-642-11950-7_21
Daniel Cederman, Philippas Tsigas, On dynamic load balancing on graphics processors international conference on computer graphics and interactive techniques. ,vol. 2008, pp. 57- 64 ,(2008) , 10.5555/1413957.1413967
David Kirk, NVIDIA cuda software and gpu parallel computing architecture international symposium on memory management. pp. 103- 104 ,(2007) , 10.1145/1296907.1296909
Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08. pp. 73- 82 ,(2008) , 10.1145/1345206.1345220
Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, John A. Stratton, Sain-Zee Ueng, Sara S. Baghsorkhi, Wen-mei W. Hwu, Program optimization carving for GPU computing Journal of Parallel and Distributed Computing. ,vol. 68, pp. 1389- 1401 ,(2008) , 10.1016/J.JPDC.2008.05.011
Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, Bixia Zheng, Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors international conference on parallel architectures and compilation techniques. pp. 205- 216 ,(2010) , 10.1145/1854273.1854302
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan, Brook for GPUs ACM Transactions on Graphics. ,vol. 23, pp. 777- 786 ,(2004) , 10.1145/1015706.1015800
John C. Gordon, Andrew T. Fenley, Alexey Onufriev, An analytical approach to computing biomolecular electrostatic potential. II. Validation and applications Journal of Chemical Physics. ,vol. 129, pp. 075102- 075102 ,(2008) , 10.1063/1.2956499
James W. Demmel, Vasily Volkov, Benchmarking GPUs to tune dense linear algebra ieee international conference on high performance computing data and analytics. pp. 31- ,(2008) , 10.5555/1413370.1413402