Auto-tuning Streamed Applications on Intel Xeon Phi

作者: Peng Zhang , Jianbin Fang , Tao Tang , Canqun Yang , Zheng Wang

DOI: 10.1109/IPDPS.2018.00061

关键词:

摘要: Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial temporal sharing of computing resources improve overall system performance. To unlock this performance potential requires effectively partition hardware resource maximize overlap between host-device communication accelerator computation, match granularity task parallelism partition. However, determining right on a per program, dataset basis is challenging. This because number possible solutions huge, benefit choosing solution may be large, but mistakes can seriously hurt In paper, we present an automatic approach determine for any given streamed application, targeting Intel architecture. Instead hand-crafting heuristic which process will have repeat each generation, employ machine learning techniques automatically learn it. We achieve first predictive model offline using training programs; then use learned predict unseen programs at runtime. apply our 23 representative parallel applications evaluate it CPU-XeonPhi mixed heterogenous many-core platform. Our achieves, average, 1.6x (upto 5.6x) speedup, translates 94.5% delivered theoretically perfect predictor.

参考文章(42)
Zheng Wang, Daniel Powell, Björn Franke, Michael O’Boyle, Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code compiler construction. pp. 154- 173 ,(2014) , 10.1007/978-3-642-54807-9_9
Elad Yom-Tov, Olivier Temam, Mircea Namolaru, Michael O'Boyle, Ayal Zaks, Grigori Fursin, Eric Courtois, Phil Barnard, Christopher K. I. Williams, Hugh Leather, Elton Ashton, Edwin Bonilla, Cupertino Miranda, Francois Bodin, Bilha Mendelson, John Thomson, MILEPOST GCC: machine learning based research compiler GCC Summit. ,(2008)
Yuan Wen, Zheng Wang, Michael F. P. O'Boyle, Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms ieee international conference on high performance computing, data, and analytics. pp. 1- 10 ,(2014) , 10.1109/HIPC.2014.7116910
Dominik Grewe, Zheng Wang, Michael F. P. O’Boyle, OpenCL Task Partitioning in the Presence of GPU Contention languages and compilers for parallel computing. pp. 87- 101 ,(2013) , 10.1007/978-3-319-09967-5_5
Sparsh Mittal, Jeffrey S. Vetter, A Survey of CPU-GPU Heterogeneous Computing Techniques ACM Computing Surveys. ,vol. 47, pp. 69- ,(2015) , 10.1145/2788396
Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, Nicolás Guil, Performance models for asynchronous data transfers on consumer Graphics Processing Units Journal of Parallel and Distributed Computing. ,vol. 72, pp. 1117- 1126 ,(2012) , 10.1016/J.JPDC.2011.07.011
Michael Boyer, Jiayuan Meng, Kalyan Kumaran, Improving GPU Performance Prediction with Data Transfer Modeling ieee international symposium on parallel & distributed processing, workshops and phd forum. pp. 1097- 1106 ,(2013) , 10.1109/IPDPSW.2013.236
D. Grewe, Zheng Wang, M. F. P. O'Boyle, Portable mapping of data parallel programs to OpenCL for heterogeneous systems symposium on code generation and optimization. pp. 1- 10 ,(2013) , 10.1109/CGO.2013.6494993
B. van Werkhoven, J. Maassen, F.J. Seinstra, H.E. Bal, Performance models for CPU-GPU data transfers ieee acm international symposium cluster cloud and grid computing. pp. 11- 20 ,(2014) , 10.1109/CCGRID.2014.16
Zheng Wang, Georgios Tournavitis, Björn Franke, Michael F. P. O'boyle, Integrating profile-driven parallelism detection and machine-learning-based mapping ACM Transactions on Architecture and Code Optimization. ,vol. 11, pp. 1- 26 ,(2014) , 10.1145/2579561