Pipelined multithreading transformations and support mechanisms

作者: Ram Rangan , David I. August

DOI:

关键词: State (computer science)Parallel computingVariable (computer science)MultithreadingOverhead (engineering)Computer scienceScalabilityLegacy systemSoftware pipeliningQueue

摘要: Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, multiple on-chip cores do not directly result in improved application performance (especially legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on is essential their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied parallelize general-purpose characterized by difficult-to-break recurrences. cyclic DOACROSS, more applicable, dependences created these techniques cause them very low tolerance rising inter-core latencies. To address problems, this work introduces a pipelined (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, particular, and PMT general, able tolerate latencies, while still handling codes with complex They achieve enforcing an acyclic discipline amongst allow threads use queues fashion. This dissertation demonstrates that DSWPed only costs, but also effectively variable latency stalls better than single-threaded execution both in-order out-of-order issue processors comparable resources. It then performs thorough analysis of scalability automatically generated identifies conditions necessary peak performance. Next, shows even latencies well, frequency (once every 5 20 dynamic instructions) codes, makes sensitive intra-thread overhead imposed operations. In order understand issues surrounding undertakes methodical exploration design space support options PMT. Three new mechanisms varying cost-performance tradeoffs introduced shown perform 38% 200% state art.

参考文章(69)
Wen-mei W. Hwu, Roy Dz-ching Ju, Erik M. Nystrom, Characterization of Repeating Data Access Patterns in Integer Benchmarks ,(2001)
G. Hinton, The microarchitecture of the Pentium 4 processor Intel Technical Journal. ,vol. 1, ,(2001)
Ron Cytron, Doacross: Beyond Vectorization for Multiprocessors. international conference on parallel processing. pp. 836- 844 ,(1986)
Gregory T. Byrd, Michael J. Flynn, Bruce A. Delagi, Communication mechanisms in shared memory multiprocessors ,(1998)
David Alejandro Padua Haiek, Multiprocessors: discussion of some theoretical and practical problems University of Illinois at Urbana-Champaign. ,(1980)
G H Barnes, S F Lundstrom, A controllable MIMD architecture Advanced computer architecture. pp. 30- 38 ,(1986)
David Kristian Poulsen, Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors University of Illinois at Urbana-Champaign. ,(1994)
M. Takesue, Software queue-based algorithms for pipelined synchronization on multiprocessors international conference on parallel processing. pp. 115- 122 ,(2003) , 10.1109/ICPPW.2003.1240361
John Wawrzynek, Eylon Caspi, André DeHon, A Streaming Multi-Threaded Model ,(2001)
William Thies, Michal Karczmarek, Saman Amarasinghe, StreamIt: A Language for Streaming Applications compiler construction. pp. 179- 196 ,(2002) , 10.1007/3-540-45937-5_14