On generating large-scale ground truth datasets for the deduplication of bibliographic records

作者: James A. Hammerton , Michael Granitzer , Dan Harvey , Maya Hristakeva , Kris Jack

DOI: 10.1145/2254129.2254153

关键词: Ground truthTask (project management)Focus (computing)Data miningQuality (business)Computer scienceUploadScale (map)Data deduplicationNearest neighbor searchInformation retrieval

摘要: Mendeley's crowd-sourced catalogue of research papers forms the basis features such as ability to search for papers, finding related one currently being viewed and personalised recommendations. In order generate this it is necessary deduplicate records uploaded from users' libraries imported external sources PubMed arXiv. This task has been achieved at Mendeley via an automated system.However quality deduplication needs be improved. "Ground truth" data sets are thus needed evaluating system's performance but existing datasets very small. paper, problem generating large scale database tackled. An approach based purely on random sampling produced easy so approaches that focus more difficult examples were explored. We found selecting duplicates non documents with similar titles challenging datasets. Additionally we established a Solr-based system can achieve fingerprint-based employed. Finally, introduce ground truth dataset hope will useful others tackling deduplication.

参考文章(13)
Moses S. Charikar, Similarity estimation techniques from rounding algorithms symposium on the theory of computing. pp. 380- 388 ,(2002) , 10.1145/509907.509965
Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 269- 278 ,(2002) , 10.1145/775047.775087
Mikhail Bilenko and Raymond J. Mooney, On Evaluation and Training-Set Construction for Duplicate Detection pp. 7- 12 ,(2003)
V.S. Verykios, P.G. Ipeirotis, A.K. Elmagarmid, Duplicate Record Detection: A Survey IEEE Transactions on Knowledge and Data Engineering. ,vol. 19, pp. 1- 16 ,(2007) , 10.1109/TKDE.2007.9
Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz, Adaptive near-duplicate detection via similarity learning international acm sigir conference on research and development in information retrieval. pp. 419- 426 ,(2010) , 10.1145/1835449.1835520
Steve Lawrence, C. Lee Giles, Kurt D. Bollacker, Autonomous citation matching adaptive agents and multi-agents systems. pp. 392- 393 ,(1999) , 10.1145/301136.301255
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, Detecting near-duplicates for web crawling the web conference. pp. 141- 150 ,(2007) , 10.1145/1242572.1242592
Isaac G. Councill, Huajing Li, Ziming Zhuang, Sandip Debnath, Levent Bolelli, Wang Chien Lee, Anand Sivasubramaniam, C. Lee Giles, Learning metadata from the evidence in an on-line citation matching scheme acm/ieee joint conference on digital libraries. pp. 276- 285 ,(2006) , 10.1145/1141753.1141817
Rares Vernica, Michael J. Carey, Chen Li, Efficient parallel set-similarity joins using MapReduce Proceedings of the 2010 international conference on Management of data - SIGMOD '10. pp. 495- 506 ,(2010) , 10.1145/1807167.1807222
S. Lawrence, C. Lee Giles, K. Bollacker, Digital libraries and autonomous citation indexing Computer. ,vol. 32, pp. 67- 71 ,(1999) , 10.1109/2.769447