摘要: In response to increasing (relative) wire delay, architects have proposed various technologies manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines provide low all Traditional stride-based hardware prefetching strives tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share cache, requiring multiple ports sufficient bandwidth. Second, threads mean working sets, which compete for limited storage. Third, sharing code data interferes with block migration, since one processor's low-latency bank is another high-latency bank. this paper, we develop cache designs that incorporate these three management techniques. We detailed full-system simulation analyze performance trade-offs both commercial scientific workloads. demonstrate less effective because 40-60% hits in workloads are satisfied central banks, equally far from processors. observe although latency, contention their restricted bandwidth limits performance. show between L1 caches alone improves at least as much other two Finally, a hybrid design-combining techniques-that an 2% 19% over alone.