作者: Sameer Agarwal , Barzan Mozafari , Aurojit Panda , Henry Milner , Samuel Madden
关键词: Data mining 、 Adaptive optimization 、 SQL 、 Response time 、 Sample (statistics) 、 Node (networking) 、 Bounded function 、 Computer science 、 Set (abstract data type) 、 Massively parallel
摘要: In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off accuracy response time, enabling over massive data by samples and presenting results annotated with meaningful error bars. To achieve this, uses two key ideas: (1) an adaptive optimization framework that builds maintains set multi-dimensional stratified from original (2) dynamic sample selection strategy selects appropriately sized based query's or time requirements. We evaluate against the well-known TPC-H benchmarks real-world analytic workload derived Conviva Inc., company manages video distribution Internet. Our experiments 100 node cluster show can answer up 17 TBs in less than 2 seconds (over 200 x faster Hive), within 2-10%.