Serverless Straggler Mitigation using Error-Correcting Codes

作者: Thomas Courtade , Vaishaal Shankar , Kannan Ramchandran , Vipul Gupta , Yaoqing Yang

DOI: 10.1109/ICDCS47774.2020.00019

关键词:

摘要: Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase the end-to-end latency for distributed computation. We propose and implement simple yet principled approaches straggler mitigation in systems matrix multiplication evaluate them on several common applications from machine learning high-performance computing. The proposed schemes inspired by error-correcting codes employ parallel encoding decoding over data stored using workers. This creates a fully computing framework without master node conduct or decoding, which removes computation, communication storage bottleneck at master. On theory side, we establish our scheme is asymptotically optimal terms of time provide lower bound number stragglers it can tolerate with high probability. Through extensive experiments, show outperforms existing speculative execution other coding theoretic methods least 25%.

参考文章(44)
Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh, WTF Proceedings of the 22nd international conference on World Wide Web - WWW '13. pp. 505- 514 ,(2013) , 10.1145/2488388.2488433
Edgar Solomonik, James Demmel, Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms international conference on parallel processing. pp. 90- 109 ,(2011) , 10.1007/978-3-642-23397-5_10
Rajeev Motwani, Terry Winograd, Lawrence Page, Sergey Brin, The PageRank Citation Ranking : Bringing Order to the Web the web conference. ,vol. 98, pp. 161- 172 ,(1999)
Per-Gunnar Martinsson, A Fast Direct Solver for a Class of Elliptic Partial Differential Equations Journal of Scientific Computing. ,vol. 38, pp. 316- 330 ,(2009) , 10.1007/S10915-008-9240-6
M. J. D. Powell, Updating conjugate directions by the BRGS formula Mathematical Programming. ,vol. 38, pp. 29- 46 ,(1987) , 10.1007/BF02591850
Jeffrey Dean, Luiz André Barroso, The tail at scale Communications of The ACM. ,vol. 56, pp. 74- 80 ,(2013) , 10.1145/2408776.2408794
Parikshit Gopalan, Cheng Huang, Huseyin Simitci, Sergey Yekhanin, On the Locality of Codeword Symbols IEEE Transactions on Information Theory. ,vol. 58, pp. 6925- 6934 ,(2012) , 10.1109/TIT.2012.2208937
Dimitris S. Papailiopoulos, Alexandros G. Dimakis, Locally Repairable Codes IEEE Transactions on Information Theory. ,vol. 60, pp. 5843- 5855 ,(2014) , 10.1109/TIT.2014.2325570
Yehuda Koren, Robert Bell, Chris Volinsky, Matrix Factorization Techniques for Recommender Systems IEEE Computer. ,vol. 42, pp. 30- 37 ,(2009) , 10.1109/MC.2009.263
R. A. Van De Geijn, J. Watts, SUMMA: Scalable Universal Matrix Multiplication Algorithm Concurrency and Computation: Practice and Experience. ,vol. 9, pp. 255- 274 ,(1995) , 10.1002/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2