作者: Hao Wang , Yuxuan Qin , ChonLam Lao , Yanfang Le , Wenfei Wu
DOI:
关键词:
摘要: Recent works introduce In-Network Aggregation (INA) for distributed training (DT), which moves the gradient summation into network programmable switches. INA can reduce the traffic volume and accelerate communication in DT jobs. However, switch memory is a scarce resource, unable to support massive DT jobs in data centers, and existing INA solutions have not utilized switch memory to the best extent. We propose DSA, an Efficient Data-Plane switch memory Scheduler for in-network Aggregation. DSA introduces preemption to the switch memory management for INA jobs. In the data plane, DSA allows gradient tensors with high priority to preempt the switch aggregators (basic computation unit in INA) from tensors with low priority, which avoids an aggregator wasting time in idle. In the control plane, DSA devises a priority policy which assigns high priority to gradient tensors that benefit overall job efficiency …