Selecting queries from sample to crawl deep web data sources

作者: Yan Wang , Jianguo Lu , Jie Liang , Jessica Chen , Jiming Liu

DOI: 10.3233/WIA-2012-0232

关键词:

摘要: This paper studies the problem of selecting queries to efficiently crawl a deep web data source using set sample documents. Crawling is process collecting from search interfaces by issuing queries. One major challenges in crawling selection so that most can be retrieved at low cost. We propose learn source. To verify selected also produce good result for entire source, we carried out experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. show our sampling-based method effective empirically proving 1 The samples harvest original database; 2 with overlapping rate will 3 size terms where select do not need very large. Compared other query methods, obtains analyzing small documents, instead learning next best incrementally all documents matched previous

参考文章(43)
Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy, Google's Deep Web crawl very large data bases. ,vol. 1, pp. 1241- 1252 ,(2008) , 10.14778/1454159.1454163
S. B. Yao, Approximating block accesses in database organizations Communications of the ACM. ,vol. 20, pp. 260- 261 ,(1977) , 10.1145/359461.359475
Stephen W. Liddle, David W. Embley, Del T. Scott, Sai Ho Yau, Extracting Data behind Web Forms Lecture Notes in Computer Science. pp. 402- 413 ,(2003) , 10.1007/978-3-540-45275-1_35
Hector Garcia-Molina, Sriram Raghavan, Crawling the Hidden Web very large data bases. pp. 129- 138 ,(2001)
Yuen Ren Chao, George Kingsley Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology Language. ,vol. 26, pp. 394- ,(1950) , 10.2307/409735
Thomas T. Cormen, Ronald L. Rivest, Charles E. Leiserson, Introduction to Algorithms ,(1990)
Thomas H. Cormen, Ronald L. Rivest, Charles E. Leiserson, Clifford Stein, Introduction to Algorithms, Second Edition ,(2001)
M.Komp, Adhi Harmoko S, Joseph Marie Jacquard, Eniac, Konrad Zuse, Introduction to Algorithms ,(2005)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Otis Gospodnetić, Erik Hatcher, Doug Cutting, Lucene in Action ,(2004)