作者: Thomas Courtade , Vaishaal Shankar , Kannan Ramchandran , Vipul Gupta , Yaoqing Yang
DOI: 10.1109/ICDCS47774.2020.00019
关键词:
摘要: Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase the end-to-end latency for distributed computation. We propose and implement simple yet principled approaches straggler mitigation in systems matrix multiplication evaluate them on several common applications from machine learning high-performance computing. The proposed schemes inspired by error-correcting codes employ parallel encoding decoding over data stored using workers. This creates a fully computing framework without master node conduct or decoding, which removes computation, communication storage bottleneck at master. On theory side, we establish our scheme is asymptotically optimal terms of time provide lower bound number stragglers it can tolerate with high probability. Through extensive experiments, show outperforms existing speculative execution other coding theoretic methods least 25%.