作者: James A. Hammerton , Michael Granitzer , Dan Harvey , Maya Hristakeva , Kris Jack
关键词: Ground truth 、 Task (project management) 、 Focus (computing) 、 Data mining 、 Quality (business) 、 Computer science 、 Upload 、 Scale (map) 、 Data deduplication 、 Nearest neighbor search 、 Information retrieval
摘要: Mendeley's crowd-sourced catalogue of research papers forms the basis features such as ability to search for papers, finding related one currently being viewed and personalised recommendations. In order generate this it is necessary deduplicate records uploaded from users' libraries imported external sources PubMed arXiv. This task has been achieved at Mendeley via an automated system.However quality deduplication needs be improved. "Ground truth" data sets are thus needed evaluating system's performance but existing datasets very small. paper, problem generating large scale database tackled. An approach based purely on random sampling produced easy so approaches that focus more difficult examples were explored. We found selecting duplicates non documents with similar titles challenging datasets. Additionally we established a Solr-based system can achieve fingerprint-based employed. Finally, introduce ground truth dataset hope will useful others tackling deduplication.