High-Performance Communication Primitives and Data Structures on Message-Passing Manycores: Broadcast and Map

作者: Omid Shahmirzadi

DOI:

关键词:

摘要: The constant increase in single core frequency reached a plateau during recent years. This is due to physical phenomenon, known as power wall, where the produced heat inside chip so high that cannot be cooled down by existing technologies. An alternative harvest more computational per die fabricate number of cores into chip. Thereforemanycore chips with than thousand are expected end decade. These environments provide level parallel processing while their energy consumption considerably lower multi-chip counterparts. Although shared-memory programming classical paradigm program these environments, there numerous claims taking account full life cycle software, message-passing model have advantages. direct architectural consequence applying support message passing between entities directly hardware. Therefore manycore architectures hardware for becoming and visible. platforms can seen two ways: (i) High Performance Computing (HPC) cluster programmed highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) mainstream computing platform requiring global operating system abstract away complexities from ordinary programmer. In first view, performance communication primitives an important bottleneck MPI applications. second kernel data structures been shown limiting factor. this thesis we overview state-of-the-art techniques circumvent mentioned bottlenecks; study high-performance broadcast primitive andmap structure onmodern architectures, hardware, different chapters respectively. one chapter, how make use features implement efficient primitive. We consider Intel Single-chip Cloud Computer (SCC) our target platformwhich offers ability tomove on-chipMessage Buffers (MPB) Remote Memory Access (RMA). propose OC-Bcast (On-Chip Broadcast), pipelined k-ary tree algorithm tailored exploit parallelism provided on-chip RMA. Experimental results show attains better terms latency throughput compared solutions. improvement highlights benefits exploiting platform: Our takes advantage RMA, unlike other solutions which based on higher-level send/receive interface. implementation high-throughput concurrent maps

参考文章(93)
Isaías A. Comprés Ureña, Michael Riepen, Michael Konow, RCKMPI - lightweight MPI implementation for intel's single-chip cloud computer (SCC) EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface. pp. 208- 217 ,(2011) , 10.1007/978-3-642-24449-0_24
Kevin Klues, Krste Asanović, John Kubiatowicz, Steven Hofmeyr, Sarah Bird, Rose Liu, Tessellation: space-time partitioning in a manycore client OS usenix conference on hot topics in parallelism. pp. 10- 10 ,(2009)
Craig E. Rasmussen, Timothy G. Mattson, Matthew Sottile, Introduction to Concurrency in Programming Languages ,(2009)
H Chen, Rong Chen, Yandong Mao, F Kaashoek, R Morris, A Pesterev, L Stein, M Wu, Y Dai, Y Zhang, Z Zhang, None, Corey: an operating system for many cores operating systems design and implementation. pp. 43- 57 ,(2008) , 10.5555/1855741.1855745
Robert A. van de Geijn, Mohak Shroff, CollMark: MPI Collective Communication Benchmark ,(2000)
Omid Shahmirzadi, Thomas Ropars, André Schiper, Darko Petrovic, Asynchronous Broadcast on the Intel SCC using Interrupts The 6th Many-core Applications Research Community (MARC) Symposium. pp. 24- 29 ,(2012)
Emmanuel Jeannot, Guillaume Mercier, Near-optimal placement of MPI processes on hierarchical NUMA architectures european conference on parallel processing. pp. 199- 210 ,(2010) , 10.1007/978-3-642-15291-7_20
M.E. Verstraaten, C.U. Grelck, M.W. van Tol, R. Bakker, C.R. Jesshope, Efficient memory copy operations on the 48-core Intel SCC processor MARC Symposium. pp. 13- 18 ,(2011)