VerdictDB: Universalizing Approximate Query Processing

作者: Yongjoo Park , Barzan Mozafari , Joseph Sorenson , Junhao Wang

DOI: 10.1145/3183713.3196905

关键词: CorrectnessSpeedupDatabaseResult setSQLSpark (mathematics)Computer science

摘要: Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One the major causes this slow adoption is reluctance traditional vendors to make radical changes their legacy codebases, and preoccupation newer (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, few AQP engines that are available each tied a specific platform require users completely abandon existing databases---an unrealistic expectation given infancy technology. Therefore, we argue universal solution needed: database-agnostic approximation engine will widen reach emerging technology across various platforms. Our proposal, called VerdictDB, uses middleware architecture requires no backend database, thus, can work all off-the-shelf engines. Operating at driver-level, VerdictDB intercepts analytical queries issued database rewrites them into another that, if executed by any relational engine, yield sufficient information for computing an answer. returned result set compute answer error estimates, which then passed on user or application. However, lack access execution layer introduces significant challenges terms generality, correctness, efficiency. This paper shows how overcomes these delivers up 171× speedup (18.45× average) variety engines, such as Impala, Spark SQL, Amazon Redshift, while incurring less than 2.6% relative error. open-sourced under Apache License.

参考文章(66)
Peter Hall, On Symmetric Bootstrap Confidence Intervals Journal of the royal statistical society series b-methodological. ,vol. 50, pp. 35- 45 ,(1988) , 10.1111/J.2517-6161.1988.TB01709.X
J. Considine, F. Li, G. Kollios, J. Byers, Approximate aggregation techniques for sensor databases international conference on data engineering. pp. 449- 460 ,(2004) , 10.1109/ICDE.2004.1320018
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, On random sampling over joins ACM SIGMOD Record. ,vol. 28, pp. 263- 274 ,(1999) , 10.1145/304181.304206
Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael I Jordan, None, A general bootstrap performance diagnostic knowledge discovery and data mining. pp. 419- 427 ,(2013) , 10.1145/2487575.2487650
Barzan Mozafari, Carlo Zaniolo, Optimal load shedding with aggregates and mining queries 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). pp. 76- 88 ,(2010) , 10.1109/ICDE.2010.5447867
Angelo J. Canty, Anthony C. Davison, David V. Hinkley, Valérie Ventura, Bootstrap diagnostics and remedies Canadian Journal of Statistics-revue Canadienne De Statistique. ,vol. 34, pp. 5- 27 ,(2006) , 10.1002/CJS.5550340103
Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica, Knowing when you're wrong: building fast and reliable approximate query processing systems international conference on management of data. pp. 481- 492 ,(2014) , 10.1145/2588555.2593667
E. L. Lehmann, Consistency and Unbiasedness of Certain Nonparametric Tests Annals of Mathematical Statistics. ,vol. 22, pp. 165- 179 ,(1951) , 10.1214/AOMS/1177729639
Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, Ion Stoica, G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data international conference on management of data. pp. 913- 918 ,(2015) , 10.1145/2723372.2735381
Peter J. Haas, Joseph M. Hellerstein, Ripple joins for online aggregation ACM SIGMOD Record. ,vol. 28, pp. 287- 298 ,(1999) , 10.1145/304181.304208