作者: Miltiadis Allamanis
DOI:
关键词:
摘要: The field of big code relies on mining large corpora to perform some learning task. A significant threat this approach has been recently identified by Lopes et al. (2017) who found a amount near-duplicate GitHub. However, the impact duplication not noticed researchers devising machine models for source code. In work, we explore effects showing that reported performance metrics are sometimes inflated up 100% when testing duplicated compared de-duplicated which more accurately represent how used software engineers. We present index widely datasets, list best practices collecting and evaluating them. Finally, release tools help community avoid problem in future research.