On generating large-scale ground truth datasets for the deduplication of bibliographic records

作者： James A. Hammerton , Michael Granitzer , Dan Harvey , Maya Hristakeva , Kris Jack

关键词: Ground truth 、 Task (project management) 、 Focus (computing) 、 Data mining 、 Quality (business) 、 Computer science 、 Upload 、 Scale (map) 、 Data deduplication 、 Nearest neighbor search 、 Information retrieval

摘要: Mendeley's crowd-sourced catalogue of research papers forms the basis features such as ability to search for papers, finding related one currently being viewed and personalised recommendations. In order generate this it is necessary deduplicate records uploaded from users' libraries imported external sources PubMed arXiv. This task has been achieved at Mendeley via an automated system.However quality deduplication needs be improved. "Ground truth" data sets are thus needed evaluating system's performance but existing datasets very small. paper, problem generating large scale database tackled. An approach based purely on random sampling produced easy so approaches that focus more difficult examples were explored. We found selecting duplicates non documents with similar titles challenging datasets. Additionally we established a Solr-based system can achieve fingerprint-based employed. Finally, introduce ground truth dataset hope will useful others tackling deduplication.

参考文章(13)

Moses S. Charikar, Similarity estimation techniques from rounding algorithms symposium on the theory of computing. pp. 380- 388 ,(2002) , 10.1145/509907.509965

Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 269- 278 ,(2002) , 10.1145/775047.775087

Mikhail Bilenko and Raymond J. Mooney, On Evaluation and Training-Set Construction for Duplicate Detection pp. 7- 12 ,(2003)

V.S. Verykios, P.G. Ipeirotis, A.K. Elmagarmid, Duplicate Record Detection: A Survey IEEE Transactions on Knowledge and Data Engineering. ,vol. 19, pp. 1- 16 ,(2007) , 10.1109/TKDE.2007.9

Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz, Adaptive near-duplicate detection via similarity learning international acm sigir conference on research and development in information retrieval. pp. 419- 426 ,(2010) , 10.1145/1835449.1835520

Steve Lawrence, C. Lee Giles, Kurt D. Bollacker, Autonomous citation matching adaptive agents and multi-agents systems. pp. 392- 393 ,(1999) , 10.1145/301136.301255

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, Detecting near-duplicates for web crawling the web conference. pp. 141- 150 ,(2007) , 10.1145/1242572.1242592

Isaac G. Councill, Huajing Li, Ziming Zhuang, Sandip Debnath, Levent Bolelli, Wang Chien Lee, Anand Sivasubramaniam, C. Lee Giles, Learning metadata from the evidence in an on-line citation matching scheme acm/ieee joint conference on digital libraries. pp. 276- 285 ,(2006) , 10.1145/1141753.1141817

Rares Vernica, Michael J. Carey, Chen Li, Efficient parallel set-similarity joins using MapReduce Proceedings of the 2010 international conference on Management of data - SIGMOD '10. pp. 495- 506 ,(2010) , 10.1145/1807167.1807222

10.

S. Lawrence, C. Lee Giles, K. Bollacker, Digital libraries and autonomous citation indexing Computer. ,vol. 32, pp. 67- 71 ,(1999) , 10.1109/2.769447

On generating large-scale ground truth datasets for the deduplication of bibliographic records

来源期刊

我的账户

On generating large-scale ground truth datasets for the deduplication of bibliographic records

来源期刊

相似文章 10

我的账户