作者: Zacharia Fadika , Madhusudhan Govindaraju , Richard Canon , Lavanya Ramakrishnan
关键词: Supercomputer 、 Computer science 、 Cloud computing 、 Software 、 Data-intensive computing 、 Database 、 Distributed computing 、 File system 、 Wireless sensor network
摘要: Emerging sensor networks, more capable instruments, and ever increasing simulation scales are generating data at a rate that exceeds our ability to effectively manage, curate, analyze, share it. Data-intensive computing is expected revolutionize the next-generation software stack. Hadoop, an open source implementation of MapReduce model provides way for large volumes be seamlessly processed through use commodity computers. The inherent parallelization, synchronization fault-tolerance offers, makes it ideal highly-parallel data-intensive applications. Hadoop have traditionally been used web processing only recently scientific There limited understanding on performance characteristics intensive applications can obtain from Hadoop. Thus, important evaluate specifically operations -- filter, merge reorder-- understand its various design considerations trade-offs. In this paper, we these in context High Performance Computing (HPC) environments impact file system, network programming modes performance.