PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches

作者: Mainak Chaudhuri

DOI: 10.1109/HPCA.2009.4798258

关键词:

摘要: As the last-level on-chip caches in chip-multiprocessors increase size, physical locality of data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks large cache spread over chip necessitates active mechanisms improving locality. central proposal this paper is fully hardwired coarse-grain migration mechanism that dynamically monitors patterns cores at granularity page reduce book-keeping overhead and decides when where migrate an entire amortize performance overhead. page-grain compared against two variants previously proposed block-grain dynamic OS-assisted static management mechanisms. Our detailed execution-driven simulation eight-core chip-multiprocessor with shared 16 MB L2 employing bidirectional ring connect shows migration, while using only 4.8% extra storage out total budget, delivers best energy-efficiency across set memory parallel applications selected from SPLASH-2, SPEC OMP, DARPA DIS, FFTW suites multiprogrammed workloads prepared 2000 BioBench suites. It reduces execution time 18.7% 12.6% on average (geometric mean) respectively baseline architecture distributes pages round-robin banks.

参考文章(27)
Joseph Musmanno, Data Intensive Systems (DIS) Benchmark Performance Summary Defense Technical Information Center. ,(2003) , 10.21236/ADA418752
S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache international solid-state circuits conference. pp. 315- 324 ,(2006) , 10.1109/ISSCC.2006.1696062
B.M. Beckmann, D.A. Wood, Managing Wire Delay in Large Chip-Multiprocessor Caches international symposium on microarchitecture. pp. 319- 330 ,(2004) , 10.1109/MICRO.2004.21
Lisa Noordergraaf, Ruud van der Pas, Performance experiences on Sun's Wildfire prototype conference on high performance computing (supercomputing). pp. 38- 38 ,(1999) , 10.1145/331532.331570
Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, John Carter, Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches high-performance computer architecture. pp. 250- 261 ,(2009) , 10.1109/HPCA.2009.4798260
James Laudon, Daniel Lenoski, The SGI Origin Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97. ,vol. 25, pp. 241- 251 ,(1997) , 10.1145/264107.264206
Jeff Gibson, Robert Kunz, David Ofelt, Mark Horowitz, John Hennessy, Mark Heinrich, FLASH vs. (simulated) FLASH: closing the simulation loop architectural support for programming languages and operating systems. ,vol. 35, pp. 49- 58 ,(2000) , 10.1145/356989.356994
Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures international symposium on computer architecture. ,vol. 28, pp. 248- 259 ,(2000) , 10.1145/339647.339691
David Brooks, Vivek Tiwari, Margaret Martonosi, Wattch: a framework for architectural-level power analysis and optimizations international symposium on computer architecture. ,vol. 28, pp. 83- 94 ,(2000) , 10.1145/339647.339657
R. E. Kessler, Mark D. Hill, Page placement algorithms for large real-indexed caches ACM Transactions on Computer Systems. ,vol. 10, pp. 338- 359 ,(1992) , 10.1145/138873.138876