Reproducible user-level simulation of multi-threaded workloads

作者: Brad Calder , Cristiano Pereira

DOI:

关键词: PortingHand codingEmulationDistributed computingDeterministic simulationBenchmark (computing)Task (computing)Real-time computingInterruptSystem callComputer science

摘要: As the complexity of processors increases, it becomes harder for designers to understand non-trivial and many times non-intuitive interactions among micro-architecture internal structures. Understanding these is important because helps pinpoint bottlenecks, enabling reason about sources performance loss improve their next generation processors. To help in current and, more importantly, future designs, make heavy use computer architecture detailed simulation. These simulators model behavior processor on a per-cycle basis, allowing look at very trade-offs. Building maintaining large complicated task. In addition, recent trends designing micro-architectures with multiple cores same chip brings new challenges that affect way simulation results should be compared. This dissertation focuses techniques build maintain simulators, as well architects evaluate design choices using Existing user-level require manual hand coding emulation each every possible system effect (e.g., call, interrupt, DMA transfer) can impact application's execution. Developing such an emulator given operating tedious exercise, also costly support newer versions system. Furthermore, porting completely different might involve building all together from scratch. The first contribution this technique automatically capture effects application. are captured logs then used guide By proposed technique, implementing greatly reduced. guarantees deterministic uni-processor systems. As multi-core become main stream, address efficient multi-threaded workloads needed. Simulation systems suffer non-determinism across runs configurations. If execution paths between two benchmark, input, too different, cannot compare other contributions focus efficiently collect checkpoints workloads. It extends previous Using checkpoints, deterministic. stalls would not naturally occur proposes allow one accurately configurations presence stalls.

参考文章(56)
Kevin M. Lepak, Harold W. Cain, Mikko H. Lipasti, Precise and Accurate Processor Simulation ,(2002)
Michel Dubois, Faye A. Briggs, Indira Patil, Meera Balakrishnan, Trace-Driven Simulations of Parallel and Distributed Algorithms in Multiprocessors. international conference on parallel processing. pp. 909- 916 ,(1986)
Thierry Lafage, Andre Seznec, Choosing representative slices of program execution for microarchitecture simulations: a preliminary application to the data stream Workload characterization of emerging computer applications. pp. 145- 163 ,(2001) , 10.1007/978-1-4615-1613-2_7
John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach ,(1989)
Vishal Aslot, Max Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, Bodo Parady, SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance international workshop on openmp. pp. 1- 10 ,(2001) , 10.1007/3-540-44587-0_1
T.M. Conte, M.A. Hirsch, K.N. Menezes, Reducing state loss for effective trace sampling of superscalar processors international conference on computer design. pp. 468- 477 ,(1996) , 10.1109/ICCD.1996.563595
D.A. Patterson, RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform international symposium on performance analysis of systems and software. pp. 1- 1 ,(2006) , 10.1109/ISPASS.2006.1620784