作者: Beng Chin Ooi , Meihui Zhang , Kaiping Zheng , Gang Chen , Zhaojing Luo
DOI:
关键词: Computer science 、 Data deduplication 、 Software deployment 、 Search tree 、 Software versioning 、 Pipeline transport 、 Merge (version control) 、 Software engineering 、 Analytics 、 Data analysis
摘要: With the ever-increasing adoption of machine learning for data analytics, maintaining a pipeline is becoming more complex as both datasets and trained models evolve with time. In collaborative environment, changes updates due to evolution often cause cumbersome coordination maintenance work, raising costs making it hard use. Existing solutions, unfortunately, do not address version problem, especially in environment where non-linear control semantics are necessary isolate operations made by different user roles. The lack also incurs unnecessary storage consumption lowers efficiency duplication repeated pre-processing, which avoidable. this paper, we identify two main challenges that arise during deployment pipelines, them design versioning an end-to-end analytics system MLCask. supports multiple roles ability perform Git-like branching merging context pipelines. We define accelerate metric-driven merge operation pruning search tree using reusable history records compatibility information. Further, implement prioritized search, gives preference pipelines probably yield better performance. effectiveness MLCask evaluated through extensive study over several real-world cases. performance evaluation shows proposed up 7.8x faster saves 11.9x space than baseline method does utilize records.