Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

作者: Seyed Ahmad Motamedi , Mehdi Lotfi

DOI: 10.6688/JISE.2010.26.3.14

关键词: Node (networking)Distributed computingReduction (complexity)Computer scienceParallel computingComputer clusterStable storageInterval (mathematics)Blocking (computing)Fault toleranceProcess (computing)

摘要: Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce new approach blocking using two-level checkpointing. The first level of local checkpointing, and nodes save the checkpoints disk. If transient failure occurs node, process can recover from Second global send their to highly reliable stable storage. permanent it not be used storage node. Local are taken more frequently than checkpoints. Also, end each interval, system determines expected recovery time case adaptively takes checkpoint, or skips. Experimental results show that average execution NAS-BT application significantly reduced by proposed method. Maximum reduction 38%.

参考文章(22)
James Vaigl, Greg Burns, Raja Daoud, LAM: An Open Cluster Environment for MPI ,(2002)
Junyoung Heo, Jiman Hong, Yookun Cho, Sangho Yi, Taking point decision mechanism for page-level incremental checkpointing based on cost analysis of process execution time Journal of Information Science and Engineering. ,vol. 23, pp. 1325- 1337 ,(2007)
Jiman Hong, Yookun Cho, Sangsu Kim, Cost Analysis of Optimistic Recovery Model for Forked Checkpointing IEICE Transactions on Information and Systems. ,vol. 86, pp. 1534- 1541 ,(2003)
Richard Graham, Galen Shipman, Brian Barrett, Ralph Castain, George Bosilca, Andrew Lumsdaine, Open MPI: A High-Performance, Heterogeneous MPI international conference on cluster computing. pp. 1- 9 ,(2006) , 10.1109/CLUSTR.2006.311904
Jason Duell, The design and implementation of Berkeley Lab's linuxcheckpoint/restart Lawrence Berkeley National Laboratory. ,(2005) , 10.2172/891617
N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme IEEE Transactions on Computers. ,vol. 46, pp. 942- 947 ,(1997) , 10.1109/12.609281
Andrzej Duda, The effects of checkpointing on program execution time Information Processing Letters. ,vol. 16, pp. 221- 229 ,(1983) , 10.1016/0020-0190(83)90093-5
James S. Plank, Michael G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems Journal of Parallel and Distributed Computing. ,vol. 61, pp. 1570- 1590 ,(2001) , 10.1006/JPDC.2001.1757
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, Eric Roman, The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing ieee international conference on high performance computing data and analytics. ,vol. 19, pp. 479- 493 ,(2005) , 10.1177/1094342005056139