Speeding up Collective Communications Through Inter-GPU Re-Routing

作者: Kiran Ranganath , AmirAli Abdolrashidi , Shuaiwen Leon Song , Daniel Wong

DOI: 10.1109/LCA.2019.2933842

关键词:

摘要: In order to address the vast needs of disparate domains, computing engines are becoming more sophisticated and complex. A typical high-performance computational engine is composed several accelerator units, in most cases GPUs, plus one or CPU controllers. All these components increasingly interconnected satisfy bandwidth latency tolerance demands from modern workloads. Due constraints, solutions efficiently interconnect them systematically manage their traffic—such as PCIe v3, NVLink v1 v2 on hardware side, NVIDIA Collective Communication Library (NCCL) AMD ROCM layer software side—are commonplace inside HPC systems cloud data centers. However, number accelerators increases, workloads (especially machine learning) might not be able fully exploit substrate due inefficient use interconnects. Such scenarios can lead performance bottlenecks where high-bandwidth links used by underlying libraries under-performing overused. This work proposes Workload Optimization Through Inter-GPU Re-routing (WOTIR), which consists enhanced NCCL-based collective primitives that aim boost utilization (through efficient routing) reduce communication overhead. WOTIR targets GPUs with no direct path (which leads communications) instead re-routes through intermediate bridge segments avoid communications. method allows maximum possible between without routing bus. Using this method, we see a reduction up 34 percent execution time for selected learning when non-optimal GPU allocations arise.

参考文章(15)
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
M Sai Praneeth, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah, Going deeper with convolutions computer vision and pattern recognition. pp. 1- 9 ,(2015) , 10.1109/CVPR.2015.7298594
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, Mateo Valero, Enabling preemptive multiprogramming on GPUs international symposium on computer architecture. ,vol. 42, pp. 193- 204 ,(2014) , 10.1145/2678373.2665702
Jason Jong Kyu Park, Yongjun Park, Scott Mahlke, Chimera: Collaborative Preemption for Multitasking on a Shared GPU architectural support for programming languages and operating systems. ,vol. 50, pp. 593- 606 ,(2015) , 10.1145/2694344.2694346
Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, Ragunathan Rajkumar, None, RGEM: A Responsive GPGPU Execution Model for Runtime Engines real-time systems symposium. pp. 57- 66 ,(2011) , 10.1109/RTSS.2011.13
Ilya Sutskever, Geoffrey E. Hinton, Alex Krizhevsky, ImageNet Classification with Deep Convolutional Neural Networks neural information processing systems. ,vol. 25, pp. 1097- 1105 ,(2012)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition computer vision and pattern recognition. pp. 770- 778 ,(2016) , 10.1109/CVPR.2016.90
Iman Faraji, Seyed H. Mirsadeghi, Ahmad Afsahi, Topology-Aware GPU Selection on Multi-GPU Nodes international parallel and distributed processing symposium. pp. 712- 720 ,(2016) , 10.1109/IPDPSW.2016.44
Tom Deakin, James Price, Matt Martineau, Simon McIntosh-Smith, GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models ieee international conference on high performance computing, data, and analytics. pp. 489- 507 ,(2016) , 10.1007/978-3-319-46079-6_34
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda, Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Proceedings of the 25th European MPI Users' Group Meeting. pp. 2- ,(2018) , 10.1145/3236367.3236381