作者: Peng Zhang , Jianbin Fang , Tao Tang , Canqun Yang , Zheng Wang
关键词:
摘要: Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial temporal sharing of computing resources improve overall system performance. To unlock this performance potential requires effectively partition hardware resource maximize overlap between host-device communication accelerator computation, match granularity task parallelism partition. However, determining right on a per program, dataset basis is challenging. This because number possible solutions huge, benefit choosing solution may be large, but mistakes can seriously hurt In paper, we present an automatic approach determine for any given streamed application, targeting Intel architecture. Instead hand-crafting heuristic which process will have repeat each generation, employ machine learning techniques automatically learn it. We achieve first predictive model offline using training programs; then use learned predict unseen programs at runtime. apply our 23 representative parallel applications evaluate it CPU-XeonPhi mixed heterogenous many-core platform. Our achieves, average, 1.6x (upto 5.6x) speedup, translates 94.5% delivered theoretically perfect predictor.