The analytical bootstrap: a new method for fast error estimation in approximate query processing

作者: Kai Zeng , Shi Gao , Barzan Mozafari , Carlo Zaniolo

DOI: 10.1145/2588555.2588579

关键词:

摘要: Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area research that now made more critical by need for timely and cost-effective analytics over "Big Data". Assessing quality (i.e., estimating error) approximate answers essential meaningful AQP, two main approaches past to address this problem are based on either (i) analytic error quantification or (ii) bootstrap method. The first approach extremely efficient but lacks generality, whereas second quite general suffers from its high computational overhead. In paper, we introduce a probabilistic relational model process, along with rigorous semantics unified model, which bridges gap between these traditional approaches. Based our framework, develop algorithms predict distribution approximation results. These enable computation any bootstrap-based measure large class SQL queries via single-round evaluation slightly modified query. Extensive experiments both synthetic real-world datasets show method has superior prediction accuracy measures, several orders magnitude faster than bootstrap.

参考文章(41)
George E. P. Box, William G. Hunter, J. Stuart Hunter, Statistics for experimenters : design, innovation, and discovery John Wiley & Sons. ,(2005)
Hector Garcia-Molina, Jennifer Widom, Jeffrey D. Ullman, Database Systems: The Complete Book ,(2001)
Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael I Jordan, None, A general bootstrap performance diagnostic knowledge discovery and data mining. pp. 419- 427 ,(2013) , 10.1145/2487575.2487650
Grigoris Karvounarakis, Todd J. Green, Semiring-annotated data ACM SIGMOD Record. ,vol. 41, pp. 5- 14 ,(2012) , 10.1145/2380776.2380778
Barzan Mozafari, Carlo Zaniolo, Optimal load shedding with aggregates and mining queries 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). pp. 76- 88 ,(2010) , 10.1109/ICDE.2010.5447867
Christopher Ré, Dan Suciu, The trichotomy of HAVING queries on a probabilistic database very large data bases. ,vol. 18, pp. 1091- 1116 ,(2009) , 10.1007/S00778-009-0151-4
Thanh T. L. Tran, Liping Peng, Yanlei Diao, Andrew McGregor, Anna Liu, CLARO: modeling and processing uncertain data streams very large data bases. ,vol. 21, pp. 651- 676 ,(2012) , 10.1007/S00778-011-0261-7
Yingwei Cui, Jennifer Widom, Janet L Wiener, None, Tracing the lineage of view data in a warehousing environment ACM Transactions on Database Systems. ,vol. 25, pp. 179- 227 ,(2000) , 10.1145/357775.357777
Robert J Tibshirani, Bradley Efron, An introduction to the bootstrap ,(1993)