作者: John Jenkins , James Dinan , Pavan Balaji , Nagiza F. Samatova , Rajeev Thakur
关键词:
摘要: Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges acceleration large-scale scientific computations. A particular challenge is the transfer noncontiguous to from memory. MPI implementations currently do not provide an means utilizing types for communication To address this gap, we present type-processing system capable efficiently processing arbitrary directly on GPU. We a converting conventional type representations into GPU-amenable format. Fine-grained, element-level parallelism then utilized by kernel perform in-device packing unpacking elements. demonstrate several-fold performance improvement column vectors, 3D array slices, 4D sub volumes over CUDA-based alternatives. Compared optimized, layout-specific implementations, our approach incurs low overhead, while enabling that have direct CUDA equivalent. These improvements are demonstrated translate significant end-to-end, GPU-to-GPU time. In addition, identify evaluate patterns may cause resource contention operations, providing baseline adaptively selecting data-processing strategies.