A hierarchical approach to model web query interfaces for web source integration

作者: Eduard C. Dragut , Thomas Kabisch , Clement Yu , Ulf Leser

DOI: 10.14778/1687627.1687665

关键词: Web modelingWeb pageQuery expansionInformation retrievalComputer scienceData integrationCrawlingQuery optimizationWeb search queryTree structureWeb serviceWeb query classification

摘要: Much data in the Web is hidden behind query interfaces. In most cases only means to "surface" content of a database by formulating complex queries on such Applications as Deep crawling and integration require an automatic usage these Therefore, important problem be addressed extraction interfaces into appropriate model. We hypothesize existence set domain-independent "commonsense design rules" that guides creation These rules transform schema trees. this paper we describe interface algorithm, which combines HTML tokens geometric layout within page. Tokens are classified several classes out significant ones text field tokens. A tree structure derived for using their layout. Another The hierarchical representation obtained iteratively merging two Thus, convert problem. Our experiments show promise our algorithm: it outperforms previous approaches extracting about 6.5% accuracy evaluated over three corpora with more than 500 from 15 different domains.

参考文章(25)
Boris Chidlovskii, André Bergholz, Crawling for Domain-Speci.c Hidden Web Resources web information systems engineering. pp. 125- ,(2003)
Shirley Cohen, Shawn R. Jeffery, David Ko, Alon Halevy, Xin (Luna) Dong, Jayant Madhavan, Cong Yu, Web-scale Data Integration: You can only afford to Pay As You Go conference on innovative data systems research. pp. 342- 350 ,(2007)
Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing very large data bases. pp. 408- 419 ,(2004) , 10.1016/B978-012088469-8.50038-3
Hai He, Weiyi Meng, Clement Yu, Zonghuan Wu, Constructing interface schemas for search interfaces of web databases web information systems engineering. pp. 29- 42 ,(2005) , 10.1007/11581062_3
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, Zhen Zhang, Structured databases on the web: observations and implications international conference on management of data. ,vol. 33, pp. 61- 70 ,(2004) , 10.1145/1031570.1031584
Clement Yu, Weiyi Meng, Eduard C. Dragut, Meaningful labeling of integrated query interfaces very large data bases. pp. 679- 690 ,(2006) , 10.5555/1182635.1164186
Bin He, Zhen Zhang, Kevin Chen-Chuan Chang, MetaQuerier Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05. pp. 927- 929 ,(2005) , 10.1145/1066157.1066291
Bin He, Kevin Chen-Chuan Chang, Jiawei Han, Discovering complex matchings across web query interfaces: a correlation mining approach knowledge discovery and data mining. pp. 148- 157 ,(2004) , 10.1145/1014052.1014071
Jiying Wang, Fred H. Lochovsky, Data extraction and label assignment for web databases Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 187- 196 ,(2003) , 10.1145/775152.775179
Hai He, Weiyi Meng, Clement Yu, Zonghuan Wu, Automatic integration of Web search interfaces with WISE-Integrator very large data bases. ,vol. 13, pp. 256- 273 ,(2004) , 10.1007/S00778-004-0126-4