QProber: A system for automatic classification of hidden-Web databases

作者: Luis Gravano , Panagiotis G. Ipeirotis , Mehran Sahami

DOI: 10.1145/635484.635485

关键词:

摘要: The contents of many valuable Web-accessible databases are only available through search interfaces and hence invisible to traditional Web "crawlers." Recently, commercial sites have started manually organize into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this process by using small number query probes, generated document classifiers. QProber can use variety types classifiers generate the probes. To classify database, does not retrieve or inspect any documents pages from but rather just exploits matches each probe generates at database in question. We conducted an extensive experimental evaluation over collections real documents, experimenting with different retrieval models. also tested our one hundred databases. Our experiments show has low overhead achieves high accuracy across

参考文章(53)
Panagiotis G. Ipeirotis, Mehran Sahami, Luis Gravano, Query- vs. Crawling-based Classification of Searchable Web Databases. IEEE Data(base) Engineering Bulletin. ,vol. 25, pp. 43- 50 ,(2002)
Ramakrishnan Srikant, Rakesh Agrawal, Fast algorithms for mining association rules very large data bases. pp. 580- 592 ,(1998)
Gregory Grefenstette, Julien Nioche, Estimation of English and non-English language use on the WWW riao conference. pp. 237- 246 ,(2000)
Mehran Sahami, Daphne Koller, Using machine learning to improve information access Stanford University. ,(1998)
Ramakrishnan Srikant, Rakesh Agrawal, Fast Algorithms for Mining Association Rules in Large Databases very large data bases. pp. 487- 499 ,(1994)
William W. Cohen, Learning trees and rules with set-valued features national conference on artificial intelligence. pp. 709- 716 ,(1996)
Mike Perkowitz, Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, Learning to Understand Information on the Internet: AnExample-Based Approach next generation information technologies and systems. ,vol. 8, pp. 133- 153 ,(1997) , 10.1023/A:1008672508721
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Ralph Grishman, Roman Yangarber, NYU: Description of the Proteus/PET system as used for MUC-7 ST Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998. ,(1998)