Enhancing duplicate collection detection through replica boundary discovery

作者: Weijia Jia , Xiaoming Li , Zhigang Zhang

DOI: 10.1007/11731139_42

关键词: ReplicaElectronic documentWeb pageWeb serviceBoundary (topology)Web applicationKnowledge extractionThe InternetData miningComputer science

摘要: Web documents are widely replicated on the Internet. These bring potential problems to based information systems. So replica detection is an indispensable task. The challenge find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce notion of boundary roughly reflect situation replicas; then propose effective and efficient approach discover replicas. advantages proposed include: first, it dramatically reduces pair-wise document similarity computation, making much faster than traditional approaches; second, can identify accurately, demonstrating what extent two replicated. On web page sets containing 24 million 30 pages respectively, evaluated accuracy approach.

参考文章(17)
Rajeev Motwani, Craig Silverstein, Monika R. Henzinger, Challenges in web search engines international joint conference on artificial intelligence. pp. 1573- 1579 ,(2003)
Jing Chen, Zhigang Zhang, Xiaoming Li, A preprocessing framework and approach for web applications Journal of Web Engineering. ,vol. 2, pp. 176- 192 ,(2003)
Hector Garcia-Molina, Narayanan Shivakumar, Finding Near-Replicas of Documents and Servers on the Web international workshop on the web and databases. pp. 204- 212 ,(1998)
Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)
Wensi Xi, Edward A. Fox, Roy P. Tan, Jiang Shu, Machine Learning Approach for Homepage Finding Task string processing and information retrieval. pp. 145- 159 ,(2002) , 10.1007/3-540-45735-6_14
Narayanan Shivakumar, Hector Garcia-Molina, Finding Near-Replicas of Documents on the Web Lecture Notes in Computer Science. pp. 204- 212 ,(1999) , 10.1007/10704656_13
Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3
Narayanan Shivakumar, Hector Garcia-Molina, Building a scalable and accurate copy detection mechanism acm international conference on digital libraries. pp. 160- 168 ,(1996) , 10.1145/226931.226961
Sergey Brin, James Davis, Héctor García-Molina, Copy detection mechanisms for digital documents international conference on management of data. ,vol. 24, pp. 398- 409 ,(1995) , 10.1145/223784.223855
Abdur Chowdhury, Ophir Frieder, David Grossman, Mary Catherine McCabe, Collection statistics for fast duplicate document detection ACM Transactions on Information Systems. ,vol. 20, pp. 171- 191 ,(2002) , 10.1145/506309.506311