Efficient intra-node communication on Intel-MIC clusters

作者: S. Potluri , A. Venkatesh , D. Bureddy , K. Kandalla , D. K. Panda

DOI: 10.1109/CCGRID.2013.86

关键词: Computer scienceParallel computingx86CoprocessorPerformance per wattInfiniBandXeon PhiShared memorySupercomputerOperating systemPOSIX

摘要: Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intel's Xeon Phi coprocessor packs up 1 TFLOP of double precision single chip while providing x86 compatibility supporting popular programming models like MPI OpenMP. This makes it an attractive choice for accelerating HPC applications. The provides several channels communication between processes running on host. While POSIX shared memory within coprocessor, exposes low level API called Symmetric Communication Interface (SCIF) gives direct control DMA engine user. SCIF can also be used implementation InfiniBand (IB) Verbs interface enables link with adapter In this paper, we propose evaluate design alternatives efficient node coprocessor. We incorporate our designs MVAPICH2 library. use memory, IB hybrid solution improves latency from Host by 70%, 4MByte messages, compared out-of-the-box version MVAPICH2. Our delivers more than 6x improvement peak uni-directional bandwidth 3x bi-directional bandwidth. Through designs, are able improve 16 process Gather, Alltoall All gather collective operations 85% 80%, respectively, 4MB messages. further using application benchmarks show improvements 18% 3D Stencil kernel 11.5% P3DFFT

参考文章(6)
Larry Meadows, Experiments with WRF on intel® many integrated core (intel MIC) architecture international workshop on openmp. pp. 130- 139 ,(2012) , 10.1007/978-3-642-30961-8_10
Dmitry Pekurovsky, P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions SIAM Journal on Scientific Computing. ,vol. 34, ,(2012) , 10.1137/11082748X
L. Koesterke, J. Boisseau, J. Cazes, K. Milfeld, D. Stanzione, Early experiences with the intel many integrated cores accelerated computing technology teragrid conference. pp. 21- ,(2011) , 10.1145/2016741.2016764
Michael Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey, Designing and dynamically load balancing hybrid LU for multi/many-core Computer Science - Research and Development. ,vol. 26, pp. 211- 220 ,(2011) , 10.1007/S00450-011-0169-X
K. Tomko, H. Subramoni, S. Potluri, J. Vienne, K. Kandalla, B. Barth, D. K. Panda, K. Schulz, A. Moody, J. Keasler, Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes ieee international conference on high performance computing data and analytics. pp. 1- 12 ,(2012) , 10.5555/2388996.2389091
Sayantan Sur, Hyun-Wook Jin, Lei Chai, Dhabaleswar K. Panda, RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits acm sigplan symposium on principles and practice of parallel programming. pp. 32- 39 ,(2006) , 10.1145/1122971.1122978