作者: Mainak Chaudhuri
DOI: 10.1109/HPCA.2009.4798258
关键词:
摘要: As the last-level on-chip caches in chip-multiprocessors increase size, physical locality of data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks large cache spread over chip necessitates active mechanisms improving locality. central proposal this paper is fully hardwired coarse-grain migration mechanism that dynamically monitors patterns cores at granularity page reduce book-keeping overhead and decides when where migrate an entire amortize performance overhead. page-grain compared against two variants previously proposed block-grain dynamic OS-assisted static management mechanisms. Our detailed execution-driven simulation eight-core chip-multiprocessor with shared 16 MB L2 employing bidirectional ring connect shows migration, while using only 4.8% extra storage out total budget, delivers best energy-efficiency across set memory parallel applications selected from SPLASH-2, SPEC OMP, DARPA DIS, FFTW suites multiprogrammed workloads prepared 2000 BioBench suites. It reduces execution time 18.7% 12.6% on average (geometric mean) respectively baseline architecture distributes pages round-robin banks.