A rewrite-based optimizer for Spark

作者: Zeinab Shmeis , Mohamad Jaber , None

DOI: 10.1016/J.FUTURE.2019.03.044

关键词:

摘要: Abstract Spark is the leading platform for distributed large-scale data processing. Spark’s Application Programming Interface (API) has a powerful easy-to-use abstractions similarly related to functional programming (e.g., map , filter reduce ) in several different languages. However, writing an efficient applications still error-prone, time-consuming, and requires clear deep understanding of inner-workings Spark. For instance, same task can be implemented ways, yet execution time vary drastically between them. this, we introduce TaBOS, rewrite-based optimizer programs. TaBOS takes job automatically generates state-space equivalent optimized jobs using set semantics-preserving rewrite rules. Then, from generated state-space, it selects one optimal program based on predefined strategy. We selection strategies with maximum number applied rules, minimum heavy operations) identifying state-space. evaluate effectiveness, robustness speedup gain our solutions case studies.

参考文章(27)
Jens Dörre, Sven Apel, Christian Lengauer, Modeling and optimizing MapReduce programs Concurrency and Computation: Practice and Experience. ,vol. 27, pp. 1734- 1766 ,(2015) , 10.1002/CPE.3333
Ashish Gupta, Inderpal Singh Mumick, V. S. Subrahmanian, Maintaining views incrementally Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. ,vol. 22, pp. 157- 166 ,(1993) , 10.1145/170035.170066
Fabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, Kostas Tzoumas, Opening the black boxes in data flow optimization Proceedings of the VLDB Endowment. ,vol. 5, pp. 1256- 1267 ,(2012) , 10.14778/2350229.2350244
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, Matei Zaharia, None, Spark SQL: Relational Data Processing in Spark international conference on management of data. pp. 1383- 1394 ,(2015) , 10.1145/2723372.2742797
Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, Manu Sridharan, Translating imperative code to MapReduce conference on object-oriented programming systems, languages, and applications. ,vol. 49, pp. 909- 927 ,(2014) , 10.1145/2660193.2660228
Eelco Visser, A Survey of Rewriting Strategies in Program Transformation Systems Electronic Notes in Theoretical Computer Science. ,vol. 57, pp. 109- 143 ,(2001) , 10.1016/S1571-0661(04)00270-1
Astrid Rheinländer, Arvid Heise, Fabian Hueske, Ulf Leser, Felix Naumann, SOFA: An extensible logical optimizer for UDF-heavy data flows Information Systems. ,vol. 52, pp. 96- 125 ,(2015) , 10.1016/J.IS.2015.04.002
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler, Apache Hadoop YARN: yet another resource negotiator symposium on cloud computing. pp. 5- ,(2013) , 10.1145/2523616.2523633
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, Daniel Warneke, The Stratosphere platform for big data analytics very large data bases. ,vol. 23, pp. 939- 964 ,(2014) , 10.1007/S00778-014-0357-Y