作者: Jing Chen , Zhigang Zhang , Xiaoming Li
DOI:
关键词: Web modeling 、 Data mining 、 Web page 、 Computer science 、 Static web page 、 Information retrieval 、 Rewrite engine 、 Same-origin policy 、 Web mapping 、 Mashup 、 Data Web
摘要: Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and corresponding approach. The includes three parts: Web page cleaning, replica removal integration. After stage, pages are purified transformed into general model called DocView. consists eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most them meta data, while latter two data. approach first partitions blocks according some selected tags in markup tag tree. Based on set heuristics, it identifies that contain page. Then quantitative measure (a feature vector) with respect is obtained. From vector, elements DocView extracted by algorithms. main advantage our no need for other information beyond raw page, additional usually necessary previous related work. have been applied search engine (Tianwang [15]) system. strong evidence improvement applications shows practicability verifies validity It's not difficult realize after such can up well-formed, purified, easily manipulated layer top any collection (including WWW) applications.