作者: Jianbin Fang , Zheng Wang , Peng Zhang , Tao Tang , Canqun Yang
DOI:
关键词: Speedup 、 Granularity 、 Machine learning 、 Artificial intelligence 、 Partition (database) 、 Computer science 、 Coprocessor 、 Exploit 、 Xeon Phi 、 Software 、 Task parallelism
摘要: Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial temporal sharing of computing resources improve overall system performance. To unlock this performance potential requires effectively partition hardware resource maximize overlap between hostdevice communication accelerator computation, match granularity task parallelism partition. However, determining right on a per program, dataset basis is challenging. This because number possible solutions huge, benefit choosing solution may be large, but mistakes can seriously hurt In paper, we present an automatic approach determine for any given application, targeting Intel architecture. Instead hand-crafting heuristic which process will have repeat each generation, employ machine learning techniques automatically learn it. We achieve first predictive model offline using training programs; then use learned predict unseen programs at runtime. apply our 23 representative parallel applications evaluate it CPU-XeonPhi mixed heterogenous many-core platform. Our achieves, average, 1.6x (upto 5.6x) speedup, translates 94.5% delivered theoretically perfect predictor.