Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

作者: Xi Zhang , Yuntao Yao , Yingsheng Ji , Binxing Fang

DOI: 10.1155/2016/3919043

关键词:

摘要: Detecting near duplicates on the web is challenging due to its volume and variety. Most of previous studies require setting input parameters, making it difficult for them achieve robustness across various scenarios without careful tuning. Recently, a universal parameter-free similarity metric, normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied medium-to-large datasets as lacks efficiency tends get skewed by large object size. To make this method feasible corpus documents, we propose new called SigNCD which measures based lightweight signatures instead full leading improved stability. We derive lower bounds pruning policies further reduce computational complexity. evaluate both English Chinese show an increase score compared with original significant reduction runtime. Comparisons other competitive methods also demonstrate superiority our method. Moreover, no parameter tuning required SigNCD, except threshold.

参考文章(30)
Shengli Wu, Jinbo Feng, Detecting Near-Duplicate Documents Using Sentence Level Features database and expert systems applications. pp. 195- 204 ,(2015) , 10.1007/978-3-319-22852-5_17
Omar Alonso, Dennis Fetterly, Mark Manasse, Duplicate News Story Detection Revisited asia information retrieval symposium. pp. 203- 214 ,(2013) , 10.1007/978-3-642-45068-6_18
Nicholas Tran, The normalized compression distance and image distinguishability electronic imaging. ,vol. 6492, ,(2007) , 10.1117/12.704334
Manuel Alfonseca, Manuel Cebrián, Alfonso Ortega, Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor Communications in Information and Systems. ,vol. 5, pp. 367- 384 ,(2005) , 10.4310/CIS.2005.V5.N4.A1
Tanaya Guha, Rabab K. Ward, Image Similarity Using Sparse Representation and Compression Distance IEEE Transactions on Multimedia. ,vol. 16, pp. 980- 987 ,(2014) , 10.1109/TMM.2014.2306175
Piotr Indyk, A small approximately min-wise independent family of hash functions symposium on discrete algorithms. ,vol. 38, pp. 454- 456 ,(1999) , 10.1006/JAGM.2000.1131
Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Stanford WebBase components and applications ACM Transactions on Internet Technology. ,vol. 6, pp. 153- 186 ,(2006) , 10.1145/1149121.1149124
Moses S. Charikar, Similarity estimation techniques from rounding algorithms symposium on the theory of computing. pp. 380- 388 ,(2002) , 10.1145/509907.509965
Rudi Cilibrasi, Paul Vitányi, Ronald de Wolf, Algorithmic Clustering of Music Based on String Compression Computer Music Journal. ,vol. 28, pp. 49- 67 ,(2004) , 10.1162/0148926042728449