作者: Chen Ding , Ken Kennedy
DOI:
关键词:
摘要: While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth increased only 139 during same period. Consequently, on modern machines limited data supply simply cannot keep busy, and applications often utilize few percent peak performance. The hardware solution, which provides layers high-bandwidth cache, is not effective for large complex primarily two reasons: far-separated reuse large-stride access. first repeats unnecessary transfer second communicates useless data. Both waste bandwidth. This dissertation pursues software remedy. It investigates potential compiler optimizations to alter program behavior reduce its consumption. To this end, research studied two-step transformation strategy: fuse computations then group used computation. Existing techniques such as loop blocking can be viewed an application strategy within single nest. In order carry out full extent, developed set transformations that perform computation fusion grouping whole entire execution. major new their unique contributions are: Maximal : algorithm achieves maximal among all statements bounded distance fused loop. Inter-array regrouping selectively global structures do so with guaranteed profitability compile-time optimality. Locality dynamic packing: compiler-inserted compiler-optimized at run time. These have implemented in evaluated real-world SGI Origin2000. result shows that, average, eliminates 41% loads regular 63% irregular programs. As result, overall execution time shortened 12% 77%. In addition optimizations, performance model designed tool. former allows precise measurement bottleneck; latter enables user tuning accurate prediction applications: neither goal was achieved before thesis.