BlinkDB

作者: Sameer Agarwal , Barzan Mozafari , Aurojit Panda , Henry Milner , Samuel Madden

DOI: 10.1145/2465351.2465355

关键词: Data miningAdaptive optimizationSQLResponse timeSample (statistics)Node (networking)Bounded functionComputer scienceSet (abstract data type)Massively parallel

摘要: In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off accuracy response time, enabling over massive data by samples and presenting results annotated with meaningful error bars. To achieve this, uses two key ideas: (1) an adaptive optimization framework that builds maintains set multi-dimensional stratified from original (2) dynamic sample selection strategy selects appropriately sized based query's or time requirements. We evaluate against the well-known TPC-H benchmarks real-world analytic workload derived Conviva Inc., company manages video distribution Internet. Our experiments 100 node cluster show can answer up 17 TBs in less than 2 seconds (over 200 x faster Hive), within 2-10%.

参考文章(31)
Lefteris Sidirourgos, Peter A. Boncz, Martin L. Kersten, SciBORQ: Scientific Data Management with Bounds on Runtime and Quality conference on innovative data systems research. pp. 296- 301 ,(2011)
Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang, Online aggregation international conference on management of data. pp. 171- 182 ,(1997) , 10.1145/253260.253291
C. Sapia, PROMISE : Predicting query behavior to enable predictive caching strategies for OLAP systems Lecture Notes in Computer Science. pp. 224- 233 ,(2000)
Tyson Condie, Joseph M. Hellerstein, Khaled Elmeleegy, Neil Conway, Peter Alvaro, Russell Sears, MapReduce online networked systems design and implementation. pp. 21- 21 ,(2010) , 10.5555/1855711.1855732
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Dremel: interactive analysis of web-scale datasets Communications of The ACM. ,vol. 54, pp. 114- 123 ,(2011) , 10.1145/1953122.1953148
Srikanta Tirthapura, David P. Woodruff, Optimal Random Sampling from Distributed Streams Revisited Lecture Notes in Computer Science. ,vol. 6950, pp. 283- 297 ,(2011) , 10.1007/978-3-642-24100-0_27
Srikanth Kandula, Nicolas Bruno, Ion Stoica, Ming-Chuan Wu, Sameer Agarwal, Jingren Zhou, Re-optimizing data-parallel computing networked systems design and implementation. pp. 21- 21 ,(2012)
Minos N. Garofalakis, Phillip B. Gibbon, Approximate Query Processing: Taming the TeraBytes very large data bases. pp. 725- ,(2001)
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Ion Stoica, Randy Katz, Improving MapReduce performance in heterogeneous environments operating systems design and implementation. pp. 29- 42 ,(2008) , 10.5555/1855741.1855744
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, On random sampling over joins ACM SIGMOD Record. ,vol. 28, pp. 263- 274 ,(1999) , 10.1145/304181.304206