Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

作者: Dafei Huang , Mei Wen , Changqing Xun , Dong Chen , Xing Cai

DOI: 10.1007/978-3-319-09873-9_18

关键词: Many coreParallel computingMulti-core processorCoprocessorCentral processing unitSoftware portabilityThread (computing)LocalityComputer science

摘要: When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in code are usually inherited without analysis, which may give side-effects CPU performance. executing local-memory arrays no longer match well with hardware associated synchronizations costly. To solve this dilemma, we actively analyze memory access patterns by using array-access descriptors derived from kernels, can be adapted for CPUs removing all unwanted together obsolete barrier statements. Experiments show that automated transformation satisfactorily improve kernel performances Sandy Bridge Intel’s Many-Integrated-Core coprocessor.

参考文章(10)
John A. Stratton, Sam S. Stone, Wen-mei W. Hwu, MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs languages and compilers for parallel computing. pp. 16- 30 ,(2008) , 10.1007/978-3-540-89740-8_2
Sangmin Seo, Gangwon Jo, Jaejin Lee, Jun Lee, Automatic OpenCL work-group size selection for multicore CPUs international conference on parallel architectures and compilation techniques. pp. 387- 398 ,(2013) , 10.5555/2523721.2523772
Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, Bixia Zheng, Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors international conference on parallel architectures and compilation techniques. pp. 205- 216 ,(2010) , 10.1145/1854273.1854302
Cedric Bastoul, Code Generation in the Polyhedral Model Is Easier Than You Think international conference on parallel architectures and compilation techniques. pp. 7- 16 ,(2004) , 10.5555/1025127.1025992
S.J. Pennycook, S.D. Hammond, S.A. Wright, J.A. Herdman, I. Miller, S.A. Jarvis, An investigation of the performance portability of OpenCL Journal of Parallel and Distributed Computing. ,vol. 73, pp. 1439- 1450 ,(2013) , 10.1016/J.JPDC.2012.07.005
M Manikandan, U Bondhugula, S Krishnamoorthy, J Ramanujam, A Rountev, P Sadayappan, None, A compiler framework for optimization of affine loop nests for gpgpus Proceedings of the 22nd annual international conference on Supercomputing - ICS '08. pp. 225- 234 ,(2008) , 10.1145/1375527.1375562
John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, Wen-mei W. Hwu, Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs symposium on code generation and optimization. pp. 111- 119 ,(2010) , 10.1145/1772954.1772971
V. Balasundaram, K. Kennedy, A technique for summarizing data access and its use in parallelism enhancing transformations Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation - PLDI '89. ,vol. 24, pp. 41- 53 ,(1989) , 10.1145/73141.74822
Wen-Mei W. Hwu, John A. Stratton, Thoman B. Jablin, Hee-Seok Kim, Performance Portability in Accelerated Parallel Kernels hgpu.org. ,(2013)