Estimating deep web data source size by capture---recapture method

作者: Jianguo Lu , Dingding Li

DOI: 10.1007/S10791-009-9107-Y

关键词:

摘要: This paper addresses the problem of estimating size a deep web data source that is accessible by queries only. Since most sources are non-cooperative, can only be estimated sending and analyzing returning results. We propose an efficient estimator based on capture---recapture method. First we derive equation between overlapping rate percentage examined when random samples retrieved from uniform distribution. conceptually simple leads to derivation for obtained queries. do not produce documents, it well known traditional methods underestimate size, i.e., those estimators have negative bias. Based samples, adjust so handle returned conduct both simulation studies experiments corpora including Gov2, Reuters, Newsgroups, Wikipedia. The results show our method has small bias standard deviation.

参考文章(42)
Adrian Dobra¹, Stephen E Fienberg, None, How Large Is the WorldWide Web Web Dynamics. pp. 23- 44 ,(2004)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Igor A. Bolshakov, Sofia N. Galicia-Haro, Can we correctly estimate the total number of pages in Google for a specific language international conference on computational linguistics. pp. 415- 419 ,(2003) , 10.1007/3-540-36456-0_44
Otis Gospodnetić, Erik Hatcher, Doug Cutting, Lucene in Action ,(2004)
Shengli Wu, Forbes Gibb, Fabio Crestani, Experiments with Document Archive Size Detection Lecture Notes in Computer Science. pp. 294- 304 ,(2003) , 10.1007/3-540-36618-0_21
Lynne Stokes, Jeffrey F. Naughton, Peter J. Haas, S. Seshadri, Sampling-Based Estimation of the Number of Distinct Values of an Attribute very large data bases. pp. 311- 322 ,(1995)
Krishna Bharat, Andrei Broder, A technique for measuring the relative size and overlap of public Web search engines the web conference. ,vol. 30, pp. 379- 388 ,(1998) , 10.1016/S0169-7552(98)00127-5
Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea, Accurately and reliably extracting data from the Web: a machine learning approach Intelligent exploration of the web. pp. 275- 287 ,(2003) , 10.1007/978-3-7908-1772-0_17