作者: Jianguo Lu , Dingding Li
DOI: 10.1007/S10791-009-9107-Y
关键词:
摘要: This paper addresses the problem of estimating size a deep web data source that is accessible by queries only. Since most sources are non-cooperative, can only be estimated sending and analyzing returning results. We propose an efficient estimator based on capture---recapture method. First we derive equation between overlapping rate percentage examined when random samples retrieved from uniform distribution. conceptually simple leads to derivation for obtained queries. do not produce documents, it well known traditional methods underestimate size, i.e., those estimators have negative bias. Based samples, adjust so handle returned conduct both simulation studies experiments corpora including Gov2, Reuters, Newsgroups, Wikipedia. The results show our method has small bias standard deviation.