A preprocessing framework and approach for web applications

作者： Jing Chen , Zhigang Zhang , Xiaoming Li

DOI:

关键词: Web modeling 、 Data mining 、 Web page 、 Computer science 、 Static web page 、 Information retrieval 、 Rewrite engine 、 Same-origin policy 、 Web mapping 、 Mashup 、 Data Web

摘要: Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and corresponding approach. The includes three parts: Web page cleaning, replica removal integration. After stage, pages are purified transformed into general model called DocView. consists eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most them meta data, while latter two data. approach first partitions blocks according some selected tags in markup tag tree. Based on set heuristics, it identifies that contain page. Then quantitative measure (a feature vector) with respect is obtained. From vector, elements DocView extracted by algorithms. main advantage our no need for other information beyond raw page, additional usually necessary previous related work. have been applied search engine (Tianwang [15]) system. strong evidence improvement applications shows practicability verifies validity It's not difficult realize after such can up well-formed, purified, easily manipulated layer top any collection (including WWW) applications.

uni-trier.de 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(21)

H. Garcia-Molina, A. Crespo, J. Hammer, J. Cho, R. Aranha, Extracting Semistructured Information from the Web. Stanford InfoLab. ,(1997)

Hsinchun Chen, Thian-Huat Ong, Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management ,(1999)

Hector Garcia-Molina, Narayanan Shivakumar, Finding Near-Replicas of Documents and Servers on the Web international workshop on the web and databases. pp. 204- 212 ,(1998)

Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)

Hector Garcia-Molina, Narayanan Shivakumar, SCAM: A Copy Detection Mechanism for Digital Documents DL. ,(1995)

David Hawking, Nick Craswell, Peter Bailey, Kathleen Griffihs, Measuring Search Engine Quality Information Retrieval. ,vol. 4, pp. 33- 59 ,(2001) , 10.1023/A:1011468107287

Udi Manber, Finding similar files in a large file system usenix winter technical conference. pp. 2- 2 ,(1994)

Yiming Yang, Noise reduction in a statistical approach to text categorization Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95. pp. 256- 263 ,(1995) , 10.1145/215206.215367

Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0

10.

Lan Yi, Bing Liu, Xiaoli Li, None, Eliminating noisy information in Web pages for data mining knowledge discovery and data mining. pp. 296- 305 ,(2003) , 10.1145/956750.956785

A preprocessing framework and approach for web applications

来源期刊

我的账户

A preprocessing framework and approach for web applications

来源期刊

相似文章 10

我的账户