作者: Victor Bittorf , Marcel Kornacker , Christopher Ré , Ce Zhang , None
DOI:
关键词:
摘要: Recent years have seen a surge in main-memory SQL-style analytic solutions to quickly deliver business critical information over massive data sets [1, 7, 14]. At the same time, there is an arms race to offer increasingly sophisticated statistical analytics inspired by the success of web search, voice recognition, and image analysis, eg, Google Brain [8], Facebook [6], and Microsoft’s Adam [2]. This talk describes the first author’s experience porting statistical analytics to Impala via MADlib and observations about research for high-performance main-memory analytics that may be relevant for systems like Impala. A major motivation for Impala was to enable interactive SQL-analytics queries over data stored in Hadoop. Impala achieves high performance through many techniques including as co-location of computation with data in HDFS, LLVM code generation [13], and aggressive use of SIMD instructions. These optimizations allow Impala to achieve 8x query throughput compared to Shark and Hive for queries in the TPC-DS benchmark [3], and a recent independent benchmark has shown that Impala is about 5 times faster than Hive on MapReduce for TPC-H queries on uncompressed data [10]. We also want high performance statistical analytics in Impala without major changes to its infrastructure. We started with an approach popularized in MADlib, an existing package for in-RDBMS analytics [4]. We ported a subset of MADlib’s statistical models to Impala [5], many of which use the Bismarck architecture [9] that allows statistical analytics via user-defined functions. In particular, the main algorithm is Stochastic Gradient Descent (SGD) a method that has …