A preprocessing framework and approach for web applications

作者: Jing Chen , Zhigang Zhang , Xiaoming Li

DOI:

关键词: Web modelingData miningWeb pageComputer scienceStatic web pageInformation retrievalRewrite engineSame-origin policyWeb mappingMashupData Web

摘要: Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and corresponding approach. The includes three parts: Web page cleaning, replica removal integration. After stage, pages are purified transformed into general model called DocView. consists eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most them meta data, while latter two data. approach first partitions blocks according some selected tags in markup tag tree. Based on set heuristics, it identifies that contain page. Then quantitative measure (a feature vector) with respect is obtained. From vector, elements DocView extracted by algorithms. main advantage our no need for other information beyond raw page, additional usually necessary previous related work. have been applied search engine (Tianwang [15]) system. strong evidence improvement applications shows practicability verifies validity It's not difficult realize after such can up well-formed, purified, easily manipulated layer top any collection (including WWW) applications.

参考文章(21)
H. Garcia-Molina, A. Crespo, J. Hammer, J. Cho, R. Aranha, Extracting Semistructured Information from the Web. Stanford InfoLab. ,(1997)
Hector Garcia-Molina, Narayanan Shivakumar, Finding Near-Replicas of Documents and Servers on the Web international workshop on the web and databases. pp. 204- 212 ,(1998)
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)
David Hawking, Nick Craswell, Peter Bailey, Kathleen Griffihs, Measuring Search Engine Quality Information Retrieval. ,vol. 4, pp. 33- 59 ,(2001) , 10.1023/A:1011468107287
Udi Manber, Finding similar files in a large file system usenix winter technical conference. pp. 2- 2 ,(1994)
Yiming Yang, Noise reduction in a statistical approach to text categorization Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95. pp. 256- 263 ,(1995) , 10.1145/215206.215367
Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0
Lan Yi, Bing Liu, Xiaoli Li, None, Eliminating noisy information in Web pages for data mining knowledge discovery and data mining. pp. 296- 305 ,(2003) , 10.1145/956750.956785