Estimating deep web data source size by capture---recapture method

作者： Jianguo Lu , Dingding Li

关键词:

摘要: This paper addresses the problem of estimating size a deep web data source that is accessible by queries only. Since most sources are non-cooperative, can only be estimated sending and analyzing returning results. We propose an efficient estimator based on capture---recapture method. First we derive equation between overlapping rate percentage examined when random samples retrieved from uniform distribution. conceptually simple leads to derivation for obtained queries. do not produce documents, it well known traditional methods underestimate size, i.e., those estimators have negative bias. Based samples, adjust so handle returned conduct both simulation studies experiments corpora including Gov2, Reuters, Newsgroups, Wikipedia. The results show our method has small bias standard deviation.

参考文章(42)

Handbook of Capture-Recapture Analysis Princeton University Press. ,(2010) , 10.1515/9781400837717

Adrian Dobra¹, Stephen E Fienberg, None, How Large Is the WorldWide Web Web Dynamics. pp. 23- 44 ,(2004)

Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7

H. S. Heaps, Information retrieval, computational and theoretical aspects ,(1978)

Igor A. Bolshakov, Sofia N. Galicia-Haro, Can we correctly estimate the total number of pages in Google for a specific language international conference on computational linguistics. pp. 415- 419 ,(2003) , 10.1007/3-540-36456-0_44

Otis Gospodnetić, Erik Hatcher, Doug Cutting, Lucene in Action ,(2004)

Shengli Wu, Forbes Gibb, Fabio Crestani, Experiments with Document Archive Size Detection Lecture Notes in Computer Science. pp. 294- 304 ,(2003) , 10.1007/3-540-36618-0_21

Lynne Stokes, Jeffrey F. Naughton, Peter J. Haas, S. Seshadri, Sampling-Based Estimation of the Number of Distinct Values of an Attribute very large data bases. pp. 311- 322 ,(1995)

Krishna Bharat, Andrei Broder, A technique for measuring the relative size and overlap of public Web search engines the web conference. ,vol. 30, pp. 379- 388 ,(1998) , 10.1016/S0169-7552(98)00127-5

10.

Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea, Accurately and reliably extracting data from the Web: a machine learning approach Intelligent exploration of the web. pp. 275- 287 ,(2003) , 10.1007/978-3-7908-1772-0_17

Estimating deep web data source size by capture---recapture method

来源期刊

我的账户

Estimating deep web data source size by capture---recapture method

来源期刊

相似文章 10

我的账户