作者: Kiran Ranganath , AmirAli Abdolrashidi , Shuaiwen Leon Song , Daniel Wong
关键词:
摘要: In order to address the vast needs of disparate domains, computing engines are becoming more sophisticated and complex. A typical high-performance computational engine is composed several accelerator units, in most cases GPUs, plus one or CPU controllers. All these components increasingly interconnected satisfy bandwidth latency tolerance demands from modern workloads. Due constraints, solutions efficiently interconnect them systematically manage their traffic—such as PCIe v3, NVLink v1 v2 on hardware side, NVIDIA Collective Communication Library (NCCL) AMD ROCM layer software side—are commonplace inside HPC systems cloud data centers. However, number accelerators increases, workloads (especially machine learning) might not be able fully exploit substrate due inefficient use interconnects. Such scenarios can lead performance bottlenecks where high-bandwidth links used by underlying libraries under-performing overused. This work proposes Workload Optimization Through Inter-GPU Re-routing (WOTIR), which consists enhanced NCCL-based collective primitives that aim boost utilization (through efficient routing) reduce communication overhead. WOTIR targets GPUs with no direct path (which leads communications) instead re-routes through intermediate bridge segments avoid communications. method allows maximum possible between without routing bus. Using this method, we see a reduction up 34 percent execution time for selected learning when non-optimal GPU allocations arise.