The Adverse Effects of Code Duplication in Machine Learning Models of Code

作者: Miltiadis Allamanis

DOI:

关键词:

摘要: The field of big code relies on mining large corpora to perform some learning task. A significant threat this approach has been recently identified by Lopes et al. (2017) who found a amount near-duplicate GitHub. However, the impact duplication not noticed researchers devising machine models for source code. In work, we explore effects showing that reported performance metrics are sometimes inflated up 100% when testing duplicated compared de-duplicated which more accurately represent how used software engineers. We present index widely datasets, list best practices collecting and evaluating them. Finally, release tools help community avoid problem in future research.

参考文章(27)
Miltiadis Allamanis, Charles Sutton, Mining source code repositories at massive scale using language modeling mining software repositories. pp. 207- 216 ,(2013) , 10.1109/MSR.2013.6624029
Veselin Raychev, Martin Vechev, Andreas Krause, Predicting Program Properties from "Big Code" symposium on principles of programming languages. ,vol. 50, pp. 111- 124 ,(2015) , 10.1145/2676726.2677009
Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause, Learning programs from noisy data symposium on principles of programming languages. ,vol. 51, pp. 761- 774 ,(2016) , 10.1145/2837614.2837671
Jeffrey Svajlenko, Hitesh Sajnani, Vaibhav Saini, Cristina V. Lopes, Chanchal K. Roy, SourcererCC: scaling code clone detection to big-code international conference on software engineering. pp. 1157- 1168 ,(2016) , 10.1145/2884781.2884877
James R. Cordy, Chanchal Kumar Roy, A Survey on Software Clone Detection Research ,(2007)
Martin Vechev, Veselin Raychev, Pavol Bielik, PHOG: probabilistic model for code international conference on machine learning. pp. 2933- 2942 ,(2016)
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, Summarizing Source Code using a Neural Attention Model Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2073- 2083 ,(2016) , 10.18653/V1/P16-1195
Rico Sennrich, Antonio Valerio Miceli Barone, A parallel corpus of Python functions and documentation strings for automated code documentation and code generation arXiv: Computation and Language. ,(2017)
Vincent J. Hellendoorn, Premkumar Devanbu, Are deep neural networks the best choice for modeling source code foundations of software engineering. pp. 763- 773 ,(2017) , 10.1145/3106237.3106290
Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, DéjàVu: a map of code duplicates on GitHub Proceedings of the ACM on Programming Languages. ,vol. 1, pp. 84- ,(2017) , 10.1145/3133908