作者: Yan Wang , Jianguo Lu , Jie Liang , Jessica Chen , Jiming Liu
关键词:
摘要: This paper studies the problem of selecting queries to efficiently crawl a deep web data source using set sample documents. Crawling is process collecting from search interfaces by issuing queries. One major challenges in crawling selection so that most can be retrieved at low cost. We propose learn source. To verify selected also produce good result for entire source, we carried out experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. show our sampling-based method effective empirically proving 1 The samples harvest original database; 2 with overlapping rate will 3 size terms where select do not need very large. Compared other query methods, obtains analyzing small documents, instead learning next best incrementally all documents matched previous