作者: Heidy M. Marin-Castro , Victor J. Sosa-Sosa , Jose F. Martinez-Trinidad , Ivan Lopez-Arevalo
DOI: 10.1007/S10844-012-0217-4
关键词:
摘要: The amount of information contained in databases available on the Web has grown explosively last years. This information, known as Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) through Query Interfaces (WQIs) that are a special type HTML forms. problem accessing to great challenge because existing usually not indexed general-purpose search engines. Therefore, it necessary create efficient mechanisms access, extract integrate Web. Since WQIs only means access automatic identification plays an important role. It facilitates traditional engines increase coverage interesting indexable accurate data sources key issues retrieval process. In this paper we propose new strategy for discovery WQIs. novel proposal makes adequate selection elements extracted from forms, which used set heuristic rules help identify proposed uses machine learning algorithms classification searchable non-searchable (non-WQI) forms using prototypes algorithm allows remove irrelevant or redundant training set. internal content was analyzed with objective identifying those frequently appearing provide relevant identification. For testing, use three groups datasets, two at UIUC repository dataset created generic crawler supported human experts includes advanced simple query interfaces. experimental results show outperforms others previously reported works.