Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

作者: John Jenkins , James Dinan , Pavan Balaji , Nagiza F. Samatova , Rajeev Thakur

DOI: 10.1109/CLUSTER.2012.72

关键词:

摘要: Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges acceleration large-scale scientific computations. A particular challenge is the transfer noncontiguous to from memory. MPI implementations currently do not provide an means utilizing types for communication To address this gap, we present type-processing system capable efficiently processing arbitrary directly on GPU. We a converting conventional type representations into GPU-amenable format. Fine-grained, element-level parallelism then utilized by kernel perform in-device packing unpacking elements. demonstrate several-fold performance improvement column vectors, 3D array slices, 4D sub volumes over CUDA-based alternatives. Compared optimized, layout-specific implementations, our approach incurs low overhead, while enabling that have direct CUDA equivalent. These improvements are demonstrated translate significant end-to-end, GPU-to-GPU time. In addition, identify evaluate patterns may cause resource contention operations, providing baseline adaptively selecting data-processing strategies.

参考文章(11)
Robert Ross, Neill Miller, William D. Gropp, Implementing Fast and Reusable Datatype Processing Recent Advances in Parallel Virtual Machine and Message Passing Interface. ,vol. 2840, pp. 404- 413 ,(2003) , 10.1007/978-3-540-39924-7_55
Zhe Fan, Feng Qiu, Arie E. Kaufman, Zippy: A Framework for Computation and Visualization on a GPU Cluster Computer Graphics Forum. ,vol. 27, pp. 341- 350 ,(2008) , 10.1111/J.1467-8659.2008.01131.X
Andreas Schäfer, Dietmar Fey, High Performance Stencil Code Algorithms for GPGPUs international conference on conceptual structures. ,vol. 4, pp. 2027- 2036 ,(2011) , 10.1016/J.PROCS.2011.04.221
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, Pradeep Dubey, 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs ieee international conference on high performance computing data and analytics. pp. 1- 13 ,(2010) , 10.1109/SC.2010.2
Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur, Dhabaleswar K. Panda, MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters Computer Science - Research and Development. ,vol. 26, pp. 257- 266 ,(2011) , 10.1007/S00450-011-0171-3
Naoya Maruyama, Tatsuo Nomura, Kento Sato, Satoshi Matsuoka, None, Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers ieee international conference on high performance computing data and analytics. pp. 1- 12 ,(2011) , 10.1145/2063384.2063398
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, Wen-mei W. Hwu, An asymmetric distributed shared memory model for heterogeneous parallel systems architectural support for programming languages and operating systems. ,vol. 45, pp. 347- 358 ,(2010) , 10.1145/1735970.1736059
Dana Jacobsen, Julien Thibault, Inanc Senocak, An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters 48th AIAA Aerospace Sciences Meeting Including The New Horizons Forum and Aerospace Exposition. ,(2010) , 10.2514/6.2010-522
Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Xiangyong Ouyang, Sayantan Sur, Dhabaleswar K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 2011 IEEE International Conference on Cluster Computing. pp. 308- 316 ,(2011) , 10.1109/CLUSTER.2011.42
Jeff A. Stuart, John D. Owens, Message passing on data-parallel architectures international parallel and distributed processing symposium. pp. 1- 12 ,(2009) , 10.1109/IPDPS.2009.5161065