作者: Xi Zhang , Yuntao Yao , Yingsheng Ji , Binxing Fang
DOI: 10.1155/2016/3919043
关键词:
摘要: Detecting near duplicates on the web is challenging due to its volume and variety. Most of previous studies require setting input parameters, making it difficult for them achieve robustness across various scenarios without careful tuning. Recently, a universal parameter-free similarity metric, normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied medium-to-large datasets as lacks efficiency tends get skewed by large object size. To make this method feasible corpus documents, we propose new called SigNCD which measures based lightweight signatures instead full leading improved stability. We derive lower bounds pruning policies further reduce computational complexity. evaluate both English Chinese show an increase score compared with original significant reduction runtime. Comparisons other competitive methods also demonstrate superiority our method. Moreover, no parameter tuning required SigNCD, except threshold.