作者: Seyed Ahmad Motamedi , Mehdi Lotfi
DOI: 10.6688/JISE.2010.26.3.14
关键词: Node (networking) 、 Distributed computing 、 Reduction (complexity) 、 Computer science 、 Parallel computing 、 Computer cluster 、 Stable storage 、 Interval (mathematics) 、 Blocking (computing) 、 Fault tolerance 、 Process (computing)
摘要: Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce new approach blocking using two-level checkpointing. The first level of local checkpointing, and nodes save the checkpoints disk. If transient failure occurs node, process can recover from Second global send their to highly reliable stable storage. permanent it not be used storage node. Local are taken more frequently than checkpoints. Also, end each interval, system determines expected recovery time case adaptively takes checkpoint, or skips. Experimental results show that average execution NAS-BT application significantly reduced by proposed method. Maximum reduction 38%.