Effectively and efficiently detect web page duplication

作者: Zhongming Han , Qian Mo , Hongzhi Liu , Jianzhi Sun

DOI: 10.1109/ICDIM.2009.5356801

关键词:

摘要: There are a lot of redundant web pages on Internet. Based tag statistic and text similarity comparison, we present novel multilayer framework for detecting duplicated in this paper. We propose two paragraphs detection algorithms implement our framework. The experimental results show that approach achieves high performance, which means can be efficiently detected simply by comparison.

参考文章(14)
Chia-Hui Chang, Shih-Chien Kuo, Kuo-Yu Hwang, Tsung-Hsin Ho, Chih-Lung Lin, Automatic Information Extraction for Multiple Singular Web Pages knowledge discovery and data mining. pp. 297- 303 ,(2002) , 10.1007/3-540-47887-6_29
Wei Li, Jian-Yi Liu, Cong Wang, Web document duplicate removal algorithm based on keyword sequences international conference natural language processing. pp. 511- 516 ,(2005) , 10.1109/NLPKE.2005.1598791
Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)
Udi Manber, Finding similar files in a large file system usenix winter technical conference. pp. 2- 2 ,(1994)
Tak W. Yan, Hector Garcia-Molina, The SIFT information dissemination system ACM Transactions on Database Systems. ,vol. 24, pp. 529- 565 ,(1999) , 10.1145/331983.331992
Hassan Artail, Kassem Fawaz, A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations data and knowledge engineering. ,vol. 66, pp. 326- 337 ,(2008) , 10.1016/J.DATAK.2008.04.003
Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3
Monika Henzinger, Finding near-duplicate web pages Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 284- 291 ,(2006) , 10.1145/1148170.1148222
Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer, Using visual cues for extraction of tabular data from arbitrary HTML documents the web conference. pp. 1000- 1001 ,(2005) , 10.1145/1062745.1062838
D. Fetterly, M. Manasse, M. Najork, On the evolution of clusters of near-duplicate Web pages lasers and electro optics society meeting. pp. 37- 45 ,(2003) , 10.1109/LAWEB.2003.1250280