Evaluating Hadoop for Data-Intensive Scientific Operations

作者: Zacharia Fadika , Madhusudhan Govindaraju , Richard Canon , Lavanya Ramakrishnan

DOI: 10.1109/CLOUD.2012.118

关键词: SupercomputerComputer scienceCloud computingSoftwareData-intensive computingDatabaseDistributed computingFile systemWireless sensor network

摘要: Emerging sensor networks, more capable instruments, and ever increasing simulation scales are generating data at a rate that exceeds our ability to effectively manage, curate, analyze, share it. Data-intensive computing is expected revolutionize the next-generation software stack. Hadoop, an open source implementation of MapReduce model provides way for large volumes be seamlessly processed through use commodity computers. The inherent parallelization, synchronization fault-tolerance offers, makes it ideal highly-parallel data-intensive applications. Hadoop have traditionally been used web processing only recently scientific There limited understanding on performance characteristics intensive applications can obtain from Hadoop. Thus, important evaluate specifically operations -- filter, merge reorder-- understand its various design considerations trade-offs. In this paper, we these in context High Performance Computing (HPC) environments impact file system, network programming modes performance.

参考文章(24)
Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe, Experiences on Processing Spatial Data with MapReduce statistical and scientific database management. pp. 302- 319 ,(2009) , 10.1007/978-3-642-02279-1_24
Chen Zhang, Hans De Sterck, Ashraf Aboulnaga, Haig Djambazian, Rob Sladek, Case Study of Scientific Data Processing on a Cloud Using Hadoop High Performance Computing Systems and Applications. pp. 400- 415 ,(2010) , 10.1007/978-3-642-12659-8_29
Himabindu Pucha, Renu Tewari, Prashant Pandey, Prasenjit Sarkar, Rajagopal Ananthanarayanan, Mansi Shah, Karan Gupta, Cloud analytics: do we really need to reinvent the storage stack? ieee international conference on cloud computing technology and science. pp. 15- ,(2009)
Herodotos Herodotou, Nedyalko Borisov, Harold Lim, Shivnath Babu, Fatma Bilgen Cetin, Gang Luo, Liang Dong, Starfish: A Self-tuning System for Big Data Analytics. conference on innovative data systems research. pp. 261- 272 ,(2011)
Lavanya Ramakrishnan, Keith R. Jackson, Shane Canon, Shreyas Cholia, John Shalf, Defining future platform requirements for e-Science clouds Proceedings of the 1st ACM symposium on Cloud computing - SoCC '10. pp. 101- 106 ,(2010) , 10.1145/1807128.1807145
Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, The performance of MapReduce Proceedings of the VLDB Endowment. ,vol. 3, pp. 472- 483 ,(2010) , 10.14778/1920841.1920903
Phuong Nguyen, Milton Halem, A MapReduce workflow system for architecting scientific data intensive applications Proceeding of the 2nd international workshop on Software engineering for cloud computing - SECLOUD '11. pp. 57- 63 ,(2011) , 10.1145/1985500.1985510
Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, Lavanya Ramakrishnan, Benchmarking MapReduce Implementations for Application Usage Scenarios grid computing. pp. 90- 97 ,(2011) , 10.1109/GRID.2011.21
Yunhong Gu, Robert L. Grossman, Lessons learned from a year's worth of benchmarks of large data clouds Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers - MTAGS '09. pp. 3- ,(2009) , 10.1145/1646468.1646471
Michael Stonebraker, Daniel Abadi, David J DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin, None, MapReduce and parallel DBMSs: friends or foes? Communications of The ACM. ,vol. 53, pp. 64- 71 ,(2010) , 10.1145/1629175.1629197