The Adverse Effects of Code Duplication in Machine Learning Models of Code

作者： Miltiadis Allamanis

DOI:

关键词:

摘要: The field of big code relies on mining large corpora to perform some learning task. A significant threat this approach has been recently identified by Lopes et al. (2017) who found a amount near-duplicate GitHub. However, the impact duplication not noticed researchers devising machine models for source code. In work, we explore effects showing that reported performance metrics are sometimes inflated up 100% when testing duplicated compared de-duplicated which more accurately represent how used software engineers. We present index widely datasets, list best practices collecting and evaluating them. Finally, release tools help community avoid problem in future research.

arxiv.org 本地加速

harvard.edu 本地加速

arxiv.org 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(27)

Miltiadis Allamanis, Charles Sutton, Mining source code repositories at massive scale using language modeling mining software repositories. pp. 207- 216 ,(2013) , 10.1109/MSR.2013.6624029

Veselin Raychev, Martin Vechev, Andreas Krause, Predicting Program Properties from "Big Code" symposium on principles of programming languages. ,vol. 50, pp. 111- 124 ,(2015) , 10.1145/2676726.2677009

Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause, Learning programs from noisy data symposium on principles of programming languages. ,vol. 51, pp. 761- 774 ,(2016) , 10.1145/2837614.2837671

Jeffrey Svajlenko, Hitesh Sajnani, Vaibhav Saini, Cristina V. Lopes, Chanchal K. Roy, SourcererCC: scaling code clone detection to big-code international conference on software engineering. pp. 1157- 1168 ,(2016) , 10.1145/2884781.2884877

James R. Cordy, Chanchal Kumar Roy, A Survey on Software Clone Detection Research ,(2007)

Martin Vechev, Veselin Raychev, Pavol Bielik, PHOG: probabilistic model for code international conference on machine learning. pp. 2933- 2942 ,(2016)

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, Summarizing Source Code using a Neural Attention Model Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2073- 2083 ,(2016) , 10.18653/V1/P16-1195

Rico Sennrich, Antonio Valerio Miceli Barone, A parallel corpus of Python functions and documentation strings for automated code documentation and code generation arXiv: Computation and Language. ,(2017)

Vincent J. Hellendoorn, Premkumar Devanbu, Are deep neural networks the best choice for modeling source code foundations of software engineering. pp. 763- 773 ,(2017) , 10.1145/3106237.3106290

10.

Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, DéjàVu: a map of code duplicates on GitHub Proceedings of the ACM on Programming Languages. ,vol. 1, pp. 84- ,(2017) , 10.1145/3133908

The Adverse Effects of Code Duplication in Machine Learning Models of Code

来源期刊

我的账户

The Adverse Effects of Code Duplication in Machine Learning Models of Code

来源期刊

相似文章 10

我的账户