Probery: A Probability-based Incomplete Query Optimization for Big Data.

作者: Ge Yu , Yichuan Zhang , Yubin Bao , Jie Song

DOI:

关键词: Data miningCompleteness (order theory)NoSQLOnline transaction processingComputer scienceBig dataQuery optimization

摘要: Nowadays, query optimization has been highly concerned in big data management, especially NoSQL databases. Approximate queries boost performance by loss of accuracy, for example, sampling approaches trade off completeness efficiency. Different from them, we propose an uncertainty completeness, called Probability Completeness (PC short). PC refers to the possibility that results contain all satisfied records. For example PC=0.95, it guarantees there are no more than 5 incomplete among 100 ones, but not how they are. We performance, and experiments show a small doubles performance. The proposed Probery (PROBability-based quERY) adopts accelerate OLTP queries. This paper illustrates probability models, based placement processing, Apache Drill-based implementation Probery. In experiments, first prove percentage complete is larger given confidence various cases, namely guarantee validate. Then compared with Drill, Impala Hive terms indicate performs as fast Drill query, while averagely 1.8x, 1.3x 1.6x faster possible respectively.

参考文章(39)
Minos N. Garofalakis, Phillip B. Gibbon, Approximate Query Processing: Taming the TeraBytes very large data bases. pp. 725- ,(2001)
Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, Ronitt Rubinfeld, Rapid sampling for visualizations with ordering guarantees Proceedings of the VLDB Endowment. ,vol. 8, pp. 521- 532 ,(2015) , 10.14778/2735479.2735485
Ron Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection international joint conference on artificial intelligence. ,vol. 2, pp. 1137- 1143 ,(1995)
Verena Kantere, George Orfanoudakis, Anastasios Kementsietsidis, Timos Sellis, Query Relaxation across Heterogeneous Data Sources conference on information and knowledge management. pp. 473- 482 ,(2015) , 10.1145/2806416.2806529
John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe, Kim Pham, Chrisjan Matser, Performance Evaluation of NoSQL Databases: A Case Study Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems. pp. 5- 10 ,(2015) , 10.1145/2694730.2694731
Hina A. Khan, Mohamed A. Sharaf, Abdullah Albarrak, DivIDE: efficient diversification for interactive data exploration statistical and scientific database management. pp. 15- ,(2014) , 10.1145/2618243.2618253
Chris Jermaine, Minos Garofalakis, Peter J. Haas, Graham Cormode, Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches ,(2012)
Brian Babcock, Surajit Chaudhuri, Gautam Das, Dynamic sample selection for approximate query processing international conference on management of data. pp. 539- 550 ,(2003) , 10.1145/872757.872822
Bogdan George Tudorica, Cristian Bucur, A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research. pp. 1- 5 ,(2011) , 10.1109/ROEDUNET.2011.5993686
Michael Hausenblas, Jacques Nadeau, Apache Drill: Interactive Ad-Hoc Analysis at Scale Big data. ,vol. 1, pp. 100- 104 ,(2013) , 10.1089/BIG.2013.0011