Enhancing duplicate collection detection through replica boundary discovery

作者： Weijia Jia , Xiaoming Li , Zhigang Zhang

关键词: Replica 、 Electronic document 、 Web page 、 Web service 、 Boundary (topology) 、 Web application 、 Knowledge extraction 、 The Internet 、 Data mining 、 Computer science

摘要: Web documents are widely replicated on the Internet. These bring potential problems to based information systems. So replica detection is an indispensable task. The challenge find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce notion of boundary roughly reflect situation replicas; then propose effective and efficient approach discover replicas. advantages proposed include: first, it dramatically reduces pair-wise document similarity computation, making much faster than traditional approaches; second, can identify accurately, demonstrating what extent two replicated. On web page sets containing 24 million 30 pages respectively, evaluated accuracy approach.

参考文章(17)

Rajeev Motwani, Craig Silverstein, Monika R. Henzinger, Challenges in web search engines international joint conference on artificial intelligence. pp. 1573- 1579 ,(2003)

Jing Chen, Zhigang Zhang, Xiaoming Li, A preprocessing framework and approach for web applications Journal of Web Engineering. ,vol. 2, pp. 176- 192 ,(2003)

Hector Garcia-Molina, Narayanan Shivakumar, Finding Near-Replicas of Documents and Servers on the Web international workshop on the web and databases. pp. 204- 212 ,(1998)

Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)

Wensi Xi, Edward A. Fox, Roy P. Tan, Jiang Shu, Machine Learning Approach for Homepage Finding Task string processing and information retrieval. pp. 145- 159 ,(2002) , 10.1007/3-540-45735-6_14

Narayanan Shivakumar, Hector Garcia-Molina, Finding Near-Replicas of Documents on the Web Lecture Notes in Computer Science. pp. 204- 212 ,(1999) , 10.1007/10704656_13

Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3

Narayanan Shivakumar, Hector Garcia-Molina, Building a scalable and accurate copy detection mechanism acm international conference on digital libraries. pp. 160- 168 ,(1996) , 10.1145/226931.226961

Sergey Brin, James Davis, Héctor García-Molina, Copy detection mechanisms for digital documents international conference on management of data. ,vol. 24, pp. 398- 409 ,(1995) , 10.1145/223784.223855

10.

Abdur Chowdhury, Ophir Frieder, David Grossman, Mary Catherine McCabe, Collection statistics for fast duplicate document detection ACM Transactions on Information Systems. ,vol. 20, pp. 171- 191 ,(2002) , 10.1145/506309.506311

Enhancing duplicate collection detection through replica boundary discovery

来源期刊

我的账户

Enhancing duplicate collection detection through replica boundary discovery

来源期刊

相似文章 1

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

我的账户