Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

作者: Chen Ding , Ken Kennedy

DOI:

关键词:

摘要: While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth increased only 139 during same period. Consequently, on modern machines limited data supply simply cannot keep busy, and applications often utilize few percent peak performance. The hardware solution, which provides layers high-bandwidth cache, is not effective for large complex primarily two reasons: far-separated reuse large-stride access. first repeats unnecessary transfer second communicates useless data. Both waste bandwidth. This dissertation pursues software remedy. It investigates potential compiler optimizations to alter program behavior reduce its consumption. To this end, research studied two-step transformation strategy: fuse computations then group used computation. Existing techniques such as loop blocking can be viewed an application strategy within single nest. In order carry out full extent, developed set transformations that perform computation fusion grouping whole entire execution. major new their unique contributions are: Maximal : algorithm achieves maximal among all statements bounded distance fused loop. Inter-array regrouping selectively global structures do so with guaranteed profitability compile-time optimality. Locality dynamic packing: compiler-inserted compiler-optimized at run time. These have implemented in evaluated real-world SGI Origin2000. result shows that, average, eliminates 41% loads regular 63% irregular programs. As result, overall execution time shortened 12% 77%. In addition optimizations, performance model designed tool. former allows precise measurement bottleneck; latter enables user tuning accurate prediction applications: neither goal was achieved before thesis.

参考文章(50)
Callahan, A global approach to detection of parallelism Rice University. ,(1987)
Nathaniel Mcintosh, Ken Kennedy, Compiler support for software prefetching Rice University. ,(1998)
William Pugh, Evan Rosser, Iteration Space Slicing for Locality languages and compilers for parallel computing. pp. 164- 184 ,(1999) , 10.1007/3-540-44905-1_11
G. Gao, R. Olsen, V. Sarkar, R. Thekkath, Collective Loop Fusion for Array Contraction languages and compilers for parallel computing. pp. 281- 295 ,(1992) , 10.1007/3-540-57502-2_53
David Callahan, Ken Kennedy, Allan Porterfield, Analyzing and visualizing performance of memory hierarchies Parallel computer systems. pp. 1- 26 ,(1990) , 10.1145/100215.100233
Khalid Omar Thabit, Cache management by the compiler Rice University. ,(1982)
Ken Kennedy, Steve Carr, Blocking Linear Algebra Codes for Memory Hierarchies siam conference on parallel processing for scientific computing. pp. 400- 405 ,(1989)