Reliable and scalable checkpointing systems for distributed computing environments

作者: Saurabh Bagchi , Tanzima Zerin Islam

DOI:

关键词:

摘要: By leveraging the enormous amount of computational capabilities, scientists today are being able to make significant progress in solving problems, ranging from finding cure cancer -- using fusion world's clean energy crisis. The number components extreme scale computing environments is growing exponentially. Since failure rate each component starts factoring in, reliability overall systems decreases proportionately. Hence, spite having these groundbreaking simulations may never run completion. only way ensure their timely completion by making reliable, so that no can hinder science. On such systems, long running scientific applications periodically store execution states checkpoint files on stable storage, and recover a restarting last saved file. Resilient high-throughput high-performance enable simulate problems at granularities finer than ever thought possible. Unfortunately, this explosion capabilities generates large amounts state. As result, today's checkpointing crumble under increased data. Additionally, network I/O bandwidth not nearly as fast compute cycles. These two factors have caused scalability challenges for systems. focus thesis develop scalable different – grids clusters. In grid environment, machine owners voluntarily share idle CPU cycles with other users system, performance degradation host processes remain certain threshold. challenge an environment end-to-end application given high-rate unavailability machines guest-job eviction. Today's often use expensive, dedicated servers. In thesis, we present system FALCON, which uses available disk resources shared repositories. However, unavailable storage lead loss Therefore, model failures hosts predict availability Experiments production DiaGrid show FALCON improves benchmark applications, write gigabytes data, between 11% 44% compared widely used Condor solutions. (HPC) checkpoints parallel file (PFS). up, checkpoint-restart incurs high overheads due contention PFS resources. force large-scale reduce frequency, means more time lost event failure. We alleviate problem developing MCRENGINE. MCRENGINE aggregates multiple knowledge data semantics through libraries, e.g., HDF5 netCDF, compresses them. Our novel scheme compressibility up 115% over simple concatenation compression. evaluation reduces overhead 87% restart 62% baseline aggregation or We believe contributions made serve good foundation further research improving large-scale, distributed environments.

参考文章(41)
Kurt B. Ferreira, Rolf Riesen, Ron Brighwell, Patrick Bridges, Dorian Arnold, libhashckpt: hash-based incremental checkpointing using GPU's EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface. pp. 272- 281 ,(2011) , 10.1007/978-3-642-24449-0_31
Thomas H. Cormen, David Kotz, Integrating Theory and Practice in Parallel File Systems Dartmouth College. ,(1993)
Gabrielle Allen, Werner Benger, Thomas Dramlitsch, Tom Goodale, Hans-Christian Hege, Gerd Lanfermann, André Merzky, Thomas Radke, Edward Seidel, John Shalf, Cactus Tools for Grid Applications Cluster Computing. ,vol. 4, pp. 179- 188 ,(2001) , 10.1023/A:1011491422534
Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, John Kubiatowicz, None, Pond: the oceanstore prototype file and storage technologies. pp. 1- 1 ,(2003)
Rodrigo Rodrigues, Barbara Liskov, High Availability in DHTs: Erasure Coding vs. Replication Peer-to-Peer Systems IV. pp. 226- 239 ,(2005) , 10.1007/11558989_21
R. Hedges, B. Loewe, T. McLarty, C. Morrone, Parallel file system testing for the lunatic fringe: the care and feeding of restless I/O power users ieee conference on mass storage systems and technologies. pp. 3- 17 ,(2005) , 10.1109/MSST.2005.22
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18
Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill, Collective operations in application-level fault-tolerant MPI Proceedings of the 17th annual international conference on Supercomputing - ICS '03. pp. 234- 243 ,(2003) , 10.1145/782814.782847
B. R. de Supinski, N. Maruyama, T. Gamblin, K. Mohror, Adam Moody, S. Matsuoka, K. Sato, Design and modeling of a non-blocking checkpointing system ieee international conference on high performance computing data and analytics. pp. 1- 10 ,(2012) , 10.5555/2388996.2389022