作者: Eduard C. Dragut , Thomas Kabisch , Clement Yu , Ulf Leser
关键词: Web modeling 、 Web page 、 Query expansion 、 Information retrieval 、 Computer science 、 Data integration 、 Crawling 、 Query optimization 、 Web search query 、 Tree structure 、 Web service 、 Web query classification
摘要: Much data in the Web is hidden behind query interfaces. In most cases only means to "surface" content of a database by formulating complex queries on such Applications as Deep crawling and integration require an automatic usage these Therefore, important problem be addressed extraction interfaces into appropriate model. We hypothesize existence set domain-independent "commonsense design rules" that guides creation These rules transform schema trees. this paper we describe interface algorithm, which combines HTML tokens geometric layout within page. Tokens are classified several classes out significant ones text field tokens. A tree structure derived for using their layout. Another The hierarchical representation obtained iteratively merging two Thus, convert problem. Our experiments show promise our algorithm: it outperforms previous approaches extracting about 6.5% accuracy evaluated over three corpora with more than 500 from 15 different domains.