ParSA: High-throughput scientific data analysis framework with distributed file system

作者: Tao Zhang , XiangZheng Sun , Wei Xue , Nan Qiao , Huang Huang

DOI: 10.1016/J.FUTURE.2014.10.015

关键词: VisualizationScheduling (computing)Distributed computingNetCDFComputer scienceScalabilityDistributed File SystemParallel computingData-intensive computing

摘要: Abstract Scientific data analysis and visualization have become the key component for nowadays large scale simulations. Due to rapidly increasing volume awkward I/O pattern among high structured files, known serial methods/tools cannot well usually lead poor performance over traditional architectures. In this paper, we propose a new framework: ParSA (parallel scientific analysis) high-throughput scalable analysis, with distributed file system. presents optimization strategies grouping splitting logical units utilize property of system, scheduling distribution block replicas reduce network reading, as maximize overlapping processing, transferring during computation. Besides, provides similar interfaces NetCDF Operator (NCO), which is used in most climate diagnostic packages, making it easy use framework. We accelerate well-known methods models on Hadoop Distributed File System (HDFS). Experimental results demonstrate efficiency scalability ParSA, getting maximum 1.3 GB/s throughput six nodes cluster five disks per node. Yet, can only get 392 MB/s RAID-6 storage

参考文章(13)
Tony Hey, Anne Trefethen, The Data Deluge: An e-Science Perspective Wiley and Sons. pp. 809- 824 ,(2003) , 10.1002/0470867167.CH36
Tevfik Kosar, Mehmet Balman, A new paradigm: Data-aware scheduling in grid computing Future Generation Computer Systems. ,vol. 25, pp. 406- 413 ,(2009) , 10.1016/J.FUTURE.2008.09.006
R. Rew, G. Davis, NetCDF: an interface for scientific data access IEEE Computer Graphics and Applications. ,vol. 10, pp. 76- 82 ,(1990) , 10.1109/38.56302
Charles S. Zender, Short communication: Analysis of self-describing gridded geoscience data with netCDF Operators (NCO) Environmental Modelling and Software. ,vol. 23, pp. 1338- 1342 ,(2008) , 10.1016/J.ENVSOFT.2008.03.004
Robert Jacob, Jayesh Krishna, Xiabing Xu, Tim Tautges, Iulian Grindeanu, Rob Latham, Kara Peterson, Pavel Bochev, Mary Haley, David Brown, Richard Brownrigg, Dennis Shea, Wei Huang, Don Middleton, ParNCL and ParGAL: Data-parallel Tools for Postprocessing of Large-scale Earth Science Data☆ international conference on conceptual structures. ,vol. 18, pp. 1245- 1254 ,(2013) , 10.1016/J.PROCS.2013.05.291
Jun Wang, Lu Cheng, Lizhe Wang, Concentric layout, a new scientific data layout for matrix data-set in Hadoop file system International Journal of Parallel, Emergent and Distributed Systems. ,vol. 28, pp. 407- 433 ,(2013) , 10.1080/17445760.2012.720982
Daniel L. Wang, Charles S. Zender, Stephen F. Jenks, Clustered Workflow Execution of Retargeted Data Analysis Scripts cluster computing and the grid. pp. 449- 458 ,(2008) , 10.1109/CCGRID.2008.69
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The Google file system symposium on operating systems principles. ,vol. 37, pp. 29- 43 ,(2003) , 10.1145/1165389.945450
J. Rasch, L. Williamson, A. Boville, Cecilia Bitz, James McCaa, P. Briegleb, William Collins, S.-J. Lin, Minghua Zhang, T. Kiehl, Youngjiu Dai, Description of the NCAR Community Atmosphere Model (CAM 3.0) ,(2004) , 10.5065/D63N21CH
Hui Zhao, SiYun Ai, ZhenHua Lv, Bo Li, Parallel Accessing Massive NetCDF Data Based on MapReduce Web Information Systems and Mining. pp. 425- 431 ,(2010) , 10.1007/978-3-642-16515-3_53