Checkpointing as a service in heterogeneous cloud environments

作者: Jiajun Cao , Matthieu Simonin , Gene Cooperman , Christine Morin

DOI: 10.1109/CCGRID.2015.160

关键词:

摘要: A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most currently rely on each application provide its own fault tolerance. uniform mechanism within the itself serves two purposes: (a) direct support long-running jobs, which would otherwise require a custom fault-tolerant application, and (b) administrative capability manage an over-subscribed by temporarily swapping out jobs when higher priority arrive. An advantage of this that it also supports parallel distributed computations, over both TCP InfiniBand, thus allowing traditional HPC applications take infrastructure. Additionally, integrated health-monitoring detects either fail or incur exceptionally low performance, perhaps due resource starvation, proactively suspends job. The feature applying implementation very different platforms: Snooze Open Stack. use architecture enables, first time, migration from one platform another.

参考文章(31)
Mahadev Konar, Benjamin Reed, Flavio P. Junqueira, Patrick Hunt, ZooKeeper: wait-free coordination for internet-scale systems usenix annual technical conference. pp. 11- 11 ,(2010)
Gene Cooperman, Jason Ansel, Xiaoqin Ma, Adaptive Checkpointing for Master-Worker Style Parallelism international conference on cluster computing. pp. 1- 2 ,(2005) , 10.1109/CLUSTR.2005.347096
Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, Franck Cappello, Optimization of cloud task processing with checkpoint-restart mechanism ieee international conference on high performance computing data and analytics. pp. 64- ,(2013) , 10.1145/2503210.2503217
Dejan Milojičić, Ignacio M. Llorente, Ruben S. Montero, OpenNebula: A Cloud Management Tool IEEE Internet Computing. ,vol. 15, pp. 11- 14 ,(2011) , 10.1109/MIC.2011.44
Carlos Maltzahn, Sage A. Weil, Ethan L. Miller, Darrell D. E. Long, Scott A. Brandt, Ceph: a scalable, high-performance distributed file system operating systems design and implementation. pp. 307- 320 ,(2006) , 10.5555/1298455.1298485
Ardalan Kangarlou, Patrick Eugster, Dongyan Xu, VNsnap: Taking Snapshots of Virtual Networked Infrastructures in the Cloud IEEE Transactions on Services Computing. ,vol. 5, pp. 484- 496 ,(2012) , 10.1109/TSC.2011.29
Rohan Garg, Komal Sodha, Zhengping Jin, Gene Cooperman, Checkpoint-restart for a network of virtual machines international conference on cluster computing. pp. 1- 8 ,(2013) , 10.1109/CLUSTER.2013.6702626
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, Eric Roman, The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing ieee international conference on high performance computing data and analytics. ,vol. 19, pp. 479- 493 ,(2005) , 10.1177/1094342005056139
Bogdan Nicolae, Franck Cappello, BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots ieee international conference on high performance computing data and analytics. pp. 1- 12 ,(2011) , 10.1145/2063384.2063429
Ifeanyi P. Egwutuoha, David Levy, Bran Selic, Shiping Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems The Journal of Supercomputing. ,vol. 65, pp. 1302- 1326 ,(2013) , 10.1007/S11227-013-0884-0