Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

作者： Seyed Ahmad Motamedi , Mehdi Lotfi

DOI: 10.6688/JISE.2010.26.3.14

关键词: Node (networking) 、 Distributed computing 、 Reduction (complexity) 、 Computer science 、 Parallel computing 、 Computer cluster 、 Stable storage 、 Interval (mathematics) 、 Blocking (computing) 、 Fault tolerance 、 Process (computing)

摘要: Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce new approach blocking using two-level checkpointing. The first level of local checkpointing, and nodes save the checkpoints disk. If transient failure occurs node, process can recover from Second global send their to highly reliable stable storage. permanent it not be used storage node. Local are taken more frequently than checkpoints. Also, end each interval, system determines expected recovery time case adaptively takes checkpoint, or skips. Experimental results show that average execution NAS-BT application significantly reduced by proposed method. Maximum reduction 38%.

uni-trier.de 本地加速

airitilibrary.com 本地加速

sci-hub.st HTML 下载加速

参考文章(22)

James Vaigl, Greg Burns, Raja Daoud, LAM: An Open Cluster Environment for MPI ,(2002)

Junyoung Heo, Jiman Hong, Yookun Cho, Sangho Yi, Taking point decision mechanism for page-level incremental checkpointing based on cost analysis of process execution time Journal of Information Science and Engineering. ,vol. 23, pp. 1325- 1337 ,(2007)

Jiman Hong, Yookun Cho, Sangsu Kim, Cost Analysis of Optimistic Recovery Model for Forked Checkpointing IEICE Transactions on Information and Systems. ,vol. 86, pp. 1534- 1541 ,(2003)

Richard Graham, Galen Shipman, Brian Barrett, Ralph Castain, George Bosilca, Andrew Lumsdaine, Open MPI: A High-Performance, Heterogeneous MPI international conference on cluster computing. pp. 1- 9 ,(2006) , 10.1109/CLUSTR.2006.311904

Jason Duell, The design and implementation of Berkeley Lab's linuxcheckpoint/restart Lawrence Berkeley National Laboratory. ,(2005) , 10.2172/891617

N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme IEEE Transactions on Computers. ,vol. 46, pp. 942- 947 ,(1997) , 10.1109/12.609281

Andrzej Duda, The effects of checkpointing on program execution time Information Processing Letters. ,vol. 16, pp. 221- 229 ,(1983) , 10.1016/0020-0190(83)90093-5

James S. Plank, Michael G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems Journal of Parallel and Distributed Computing. ,vol. 61, pp. 1570- 1590 ,(2001) , 10.1006/JPDC.2001.1757

John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115

10.

Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, Eric Roman, The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing ieee international conference on high performance computing data and analytics. ,vol. 19, pp. 479- 493 ,(2005) , 10.1177/1094342005056139

Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

来源期刊

我的账户

Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

来源期刊

相似文章 3

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

An Application-Level Synchronous Checkpoint-Recover Method for Parallel CFD Simulation

Técnicas de ponto de controlo e adaptação em grelhas computacionais

我的账户