作者: Weijia Jia , Xiaoming Li , Zhigang Zhang
DOI: 10.1007/11731139_42
关键词: Replica 、 Electronic document 、 Web page 、 Web service 、 Boundary (topology) 、 Web application 、 Knowledge extraction 、 The Internet 、 Data mining 、 Computer science
摘要: Web documents are widely replicated on the Internet. These bring potential problems to based information systems. So replica detection is an indispensable task. The challenge find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce notion of boundary roughly reflect situation replicas; then propose effective and efficient approach discover replicas. advantages proposed include: first, it dramatically reduces pair-wise document similarity computation, making much faster than traditional approaches; second, can identify accurately, demonstrating what extent two replicated. On web page sets containing 24 million 30 pages respectively, evaluated accuracy approach.