Architecture-Aware Optimization on a 1600-core Graphics Processor

作者： Thomas R. W. Scogland , Mayank Daga , Wu-chun Feng

DOI:

关键词: Parallel computing 、 TOP500 、 Graphics 、 General-purpose computing on graphics processing units 、 CUDA 、 Node (networking) 、 Computer science 、 Computer cluster 、 Graphics processing unit 、 Isolation (database systems)

摘要: The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers world, as ranked by TOP500, employ GPUs accelerators. Despite this increasing interest GPUs, however, optimizing performance of a GPU-accelerated compute node requires deep technical knowledge underlying architecture. Although significant literature exists on how to optimize GPU performance the more mature NVIDIA CUDA architecture, converse is true for OpenCL AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for GPU. most prominent optimizations include (i) explicit use registers, (ii) vector types, (iii) removal branches, (iv) image memory global data. We demonstrate efficacy our optimizations by applying each optimization isolation well concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific optimizations, Radeon HD 5870 delivers 65% better than with wellknown NVIDIA-specific optimizations.

参考文章(18)

David A. Patterson, Samuel Webb Williams, Auto-tuning performance on multicore computers University of California at Berkeley. ,(2008)

Fang Xudong, Tang Yuhua, Wang Guibin, Tang Tao, Zhang Ying, Optimizing stencil application on multi-thread GPU architecture using stream programming model automation, robotics and control systems. pp. 234- 245 ,(2010) , 10.1007/978-3-642-11950-7_21

Daniel Cederman, Philippas Tsigas, On dynamic load balancing on graphics processors international conference on computer graphics and interactive techniques. ,vol. 2008, pp. 57- 64 ,(2008) , 10.5555/1413957.1413967

David Kirk, NVIDIA cuda software and gpu parallel computing architecture international symposium on memory management. pp. 103- 104 ,(2007) , 10.1145/1296907.1296909

Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08. pp. 73- 82 ,(2008) , 10.1145/1345206.1345220

Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, John A. Stratton, Sain-Zee Ueng, Sara S. Baghsorkhi, Wen-mei W. Hwu, Program optimization carving for GPU computing Journal of Parallel and Distributed Computing. ,vol. 68, pp. 1389- 1401 ,(2008) , 10.1016/J.JPDC.2008.05.011

Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, Bixia Zheng, Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors international conference on parallel architectures and compilation techniques. pp. 205- 216 ,(2010) , 10.1145/1854273.1854302

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan, Brook for GPUs ACM Transactions on Graphics. ,vol. 23, pp. 777- 786 ,(2004) , 10.1145/1015706.1015800

John C. Gordon, Andrew T. Fenley, Alexey Onufriev, An analytical approach to computing biomolecular electrostatic potential. II. Validation and applications Journal of Chemical Physics. ,vol. 129, pp. 075102- 075102 ,(2008) , 10.1063/1.2956499

10.

James W. Demmel, Vasily Volkov, Benchmarking GPUs to tune dense linear algebra ieee international conference on high performance computing data and analytics. pp. 31- ,(2008) , 10.5555/1413370.1413402

Architecture-Aware Optimization on a 1600-core Graphics Processor

来源期刊

我的账户

Architecture-Aware Optimization on a 1600-core Graphics Processor

来源期刊

相似文章 3

An Insightful Program Performance Tuning Chain for GPU Computing

GPURoofline: a model for guiding performance optimizations on GPUs

CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures

我的账户