A characteristic study on failures of production distributed data-parallel programs

作者: Tao Xie , Hucheng Zhou , Tian Xiao , Haoxiang Lin , Sihan Li

DOI: 10.5555/2486788.2486921

关键词:

摘要: SCOPE is adopted by thousands of developers from tens different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking and advertisement display. A job composed declarative SQL-like queries imperative C# user-defined functions (UDFs), which are executed pipeline machines. There jobs on clusters per day, while some them fail after a long execution time thus waste tremendous resources. Reducing failures would save significant This paper presents comprehensive characteristic study 200 failures/fixes 50 with debugging statistics Bing, investigating not only major failure types, sources, fixes, but also current practice. Our findings include (1) most the (84.5%) caused defects processing rather than code logic; (2) table-level (22.5%) mainly programmers mistakes frequent schema changes row-level (62%) exceptional data; (3) 93.0% fixes do change (4) there 8.0% root cause at failure-exposing stage, making practice insufficient this case. results provide valuable guidelines future development data-parallel programs. We believe that these limited to SCOPE, can be generalized other similar platforms.

参考文章(27)
Scott Shenker, George Porter, Ion Stoica, Randy H. Katz, Rodrigo Fonseca, X-trace: a pervasive network tracing framework networked systems design and implementation. pp. 20- 20 ,(2007)
Thomas J. Leblanc, Debugging in Distributed Systems Encyclopedia of Software Engineering. ,(2002) , 10.1002/0471028959.SOF085
Xuezheng Liu, Zheng Zhang, Wei Lin, Aimin Pan, WiDS checker: combating bugs in distributed systems networked systems design and implementation. pp. 19- 19 ,(2007)
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum, FlumeJava Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10. ,vol. 45, pp. 363- 375 ,(2010) , 10.1145/1806596.1806638
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy, Hive - a petabyte scale data warehouse using Hadoop 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). pp. 996- 1005 ,(2010) , 10.1109/ICDE.2010.5447738
Koushik Sen, Darko Marinov, Gul Agha, CUTE: a concolic unit testing engine for C foundations of software engineering. ,vol. 30, pp. 263- 272 ,(2005) , 10.1145/1081706.1081750
Sunghun Kim, Kai Pan, E. E. James Whitehead, Memories of bug fixes Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering - SIGSOFT '06/FSE-14. pp. 35- 45 ,(2006) , 10.1145/1181775.1181781
Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, Lakshmi Bairavasundaram, How do fixes become bugs foundations of software engineering. pp. 26- 36 ,(2011) , 10.1145/2025113.2025121
Patrice Godefroid, Nils Klarlund, Koushik Sen, DART: directed automated random testing programming language design and implementation. ,vol. 40, pp. 213- 223 ,(2005) , 10.1145/1064978.1065036
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins, Pig latin Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08. pp. 1099- 1110 ,(2008) , 10.1145/1376616.1376726