Effectively and efficiently detect web page duplication

作者： Zhongming Han , Qian Mo , Hongzhi Liu , Jianzhi Sun

DOI: 10.1109/ICDIM.2009.5356801

关键词:

摘要: There are a lot of redundant web pages on Internet. Based tag statistic and text similarity comparison, we present novel multilayer framework for detecting duplicated in this paper. We propose two paragraphs detection algorithms implement our framework. The experimental results show that approach achieves high performance, which means can be efficiently detected simply by comparison.

参考文章(14)

Chia-Hui Chang, Shih-Chien Kuo, Kuo-Yu Hwang, Tsung-Hsin Ho, Chih-Lung Lin, Automatic Information Extraction for Multiple Singular Web Pages knowledge discovery and data mining. pp. 297- 303 ,(2002) , 10.1007/3-540-47887-6_29

Wei Li, Jian-Yi Liu, Cong Wang, Web document duplicate removal algorithm based on keyword sequences international conference natural language processing. pp. 511- 516 ,(2005) , 10.1109/NLPKE.2005.1598791

Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)

Udi Manber, Finding similar files in a large file system usenix winter technical conference. pp. 2- 2 ,(1994)

Tak W. Yan, Hector Garcia-Molina, The SIFT information dissemination system ACM Transactions on Database Systems. ,vol. 24, pp. 529- 565 ,(1999) , 10.1145/331983.331992

Hassan Artail, Kassem Fawaz, A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations data and knowledge engineering. ,vol. 66, pp. 326- 337 ,(2008) , 10.1016/J.DATAK.2008.04.003

Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3

Monika Henzinger, Finding near-duplicate web pages Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 284- 291 ,(2006) , 10.1145/1148170.1148222

Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer, Using visual cues for extraction of tabular data from arbitrary HTML documents the web conference. pp. 1000- 1001 ,(2005) , 10.1145/1062745.1062838

10.

D. Fetterly, M. Manasse, M. Najork, On the evolution of clusters of near-duplicate Web pages lasers and electro optics society meeting. pp. 37- 45 ,(2003) , 10.1109/LAWEB.2003.1250280

Effectively and efficiently detect web page duplication

来源期刊

我的账户

Effectively and efficiently detect web page duplication

来源期刊

相似文章 4

Web content outlier mining through mathematical approach and trust rating

A Mathematical Approach for Mining Web Content Outliers using Term Frequency Ranking

Correlation Based Method to Detect and Remove Redundant Web Document

A novel web page duplication detection framework

我的账户