MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines.

作者: Beng Chin Ooi , Meihui Zhang , Kaiping Zheng , Gang Chen , Zhaojing Luo

DOI:

关键词: Computer scienceData deduplicationSoftware deploymentSearch treeSoftware versioningPipeline transportMerge (version control)Software engineeringAnalyticsData analysis

摘要: With the ever-increasing adoption of machine learning for data analytics, maintaining a pipeline is becoming more complex as both datasets and trained models evolve with time. In collaborative environment, changes updates due to evolution often cause cumbersome coordination maintenance work, raising costs making it hard use. Existing solutions, unfortunately, do not address version problem, especially in environment where non-linear control semantics are necessary isolate operations made by different user roles. The lack also incurs unnecessary storage consumption lowers efficiency duplication repeated pre-processing, which avoidable. this paper, we identify two main challenges that arise during deployment pipelines, them design versioning an end-to-end analytics system MLCask. supports multiple roles ability perform Git-like branching merging context pipelines. We define accelerate metric-driven merge operation pruning search tree using reusable history records compatibility information. Further, implement prioritized search, gives preference pipelines probably yield better performance. effectiveness MLCask evaluated through extensive study over several real-world cases. performance evaluation shows proposed up 7.8x faster saves 11.9x space than baseline method does utilize records.

参考文章(36)
Fernando Chirigati, Juliana Freire, Towards integrating workflow and database provenance international provenance and annotation workshop. pp. 11- 23 ,(2012) , 10.1007/978-3-642-34222-6_2
Wei Wang, Gang Chen, Anh Tien Tuan Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, SINGA: Putting Deep Learning in the Hands of Multimedia Users acm multimedia. pp. 25- 34 ,(2015) , 10.1145/2733373.2806232
Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, Wei Wang, Qingchao Cai, Gang Chen, Jinyang Gao, Zhaojing Luo, Anthony K.H. Tung, Yuan Wang, Zhongle Xie, Meihui Zhang, Kaiping Zheng, SINGA: A Distributed Deep Learning Platform acm multimedia. pp. 685- 688 ,(2015) , 10.1145/2733373.2807410
Giorgio Guzzetta, Giuseppe Jurman, Cesare Furlanello, A machine learning pipeline for quantitative phenotype prediction from genotype data BMC Bioinformatics. ,vol. 11, pp. 1- 9 ,(2010) , 10.1186/1471-2105-11-S8-S3
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition computer vision and pattern recognition. pp. 770- 778 ,(2016) , 10.1109/CVPR.2016.90
Zheng Jye Ling, Quoc Trung Tran, Ju Fan, Gerald C. H. Koh, Thi Nguyen, Chuen Seng Tan, James W. L. Yip, Meihui Zhang, GEMINI Proceedings of the VLDB Endowment. ,vol. 7, pp. 1766- 1771 ,(2014) , 10.14778/2733004.2733081
Michael Maddox, David Goehring, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran, Amol Deshpande, Decibel Proceedings of the VLDB Endowment. ,vol. 9, pp. 624- 635 ,(2016) , 10.14778/2947618.2947619
Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, Matei Zaharia, ModelDB: a system for machine learning model management international conference on management of data. pp. 14- ,(2016) , 10.1145/2939502.2939516
H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Meihui Zhang, Wei Wang, Gang Chen, Database Meets Deep Learning: Challenges and Opportunities international conference on management of data. ,vol. 45, pp. 17- 22 ,(2016) , 10.1145/3003665.3003669
Francois Chollet, Xception: Deep Learning with Depthwise Separable Convolutions computer vision and pattern recognition. pp. 1800- 1807 ,(2017) , 10.1109/CVPR.2017.195