Automatic discovery of Web Query Interfaces using machine learning techniques

作者: Heidy M. Marin-Castro , Victor J. Sosa-Sosa , Jose F. Martinez-Trinidad , Ivan Lopez-Arevalo

DOI: 10.1007/S10844-012-0217-4

关键词:

摘要: The amount of information contained in databases available on the Web has grown explosively last years. This information, known as Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) through Query Interfaces (WQIs) that are a special type HTML forms. problem accessing to great challenge because existing usually not indexed general-purpose search engines. Therefore, it necessary create efficient mechanisms access, extract integrate Web. Since WQIs only means access automatic identification plays an important role. It facilitates traditional engines increase coverage interesting indexable accurate data sources key issues retrieval process. In this paper we propose new strategy for discovery WQIs. novel proposal makes adequate selection elements extracted from forms, which used set heuristic rules help identify proposed uses machine learning algorithms classification searchable non-searchable (non-WQI) forms using prototypes algorithm allows remove irrelevant or redundant training set. internal content was analyzed with objective identifying those frequently appearing provide relevant identification. For testing, use three groups datasets, two at UIUC repository dataset created generic crawler supported human experts includes advanced simple query interfaces. experimental results show outperforms others previously reported works.

参考文章(29)
J. Arturo Olvera-López, J. Francisco Martínez-Trinidad, J. Ariel Carrasco-Ochoa, Mixed data object selection based on clustering and border objects iberoamerican congress on pattern recognition. pp. 674- 683 ,(2007) , 10.1007/978-3-540-76725-1_70
Javier Raymundo García-Serrano, José Francisco Martínez-Trinidad, Extension to C-means Algorithm for the Use of Similarity Functions european conference on principles of data mining and knowledge discovery. pp. 354- 359 ,(1999) , 10.1007/978-3-540-48247-5_42
Ling Lin, Lizhu Zhou, Web database schema identification through simple query interface Lecture Notes in Computer Science. ,vol. 6162, pp. 18- 34 ,(2009) , 10.1007/978-3-642-14415-8_2
Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu, Qinghua Zheng, Efficient deep web crawling using reinforcement learning knowledge discovery and data mining. pp. 428- 439 ,(2010) , 10.1007/978-3-642-13657-3_46
Ralph B. D'agostino, Albert Belanger, Ralph B. D'agostino, A Suggestion for Using Powerful and Informative Tests of Normality The American Statistician. ,vol. 44, pp. 316- 321 ,(1990) , 10.1080/00031305.1990.10475751
Luciano Barbosa, Hoa Nguyen, Thanh Nguyen, Ramesh Pinnamaneni, Juliana Freire, Creating and exploring web form repositories Proceedings of the 2010 international conference on Management of data - SIGMOD '10. pp. 1175- 1178 ,(2010) , 10.1145/1807167.1807311
Yingjun Li, Tiezheng Nie, Derong Shen, Ge Yu, Domain-oriented Deep Web Data Sources' Discovery and Identification asia-pacific web conference. pp. 464- 467 ,(2010) , 10.1109/APWEB.2010.54
Yanbo Ru, Ellis Horowitz, Indexing the invisible web: a survey Online Information Review. ,vol. 29, pp. 249- 265 ,(2005) , 10.1108/14684520510607579
Jianguo Lu, Dingding Li, Estimating deep web data source size by capture---recapture method Information Retrieval. ,vol. 13, pp. 70- 95 ,(2010) , 10.1007/S10791-009-9107-Y