作者: Kalman Zvi Meth , Adnan M. Agbaria
DOI:
关键词:
摘要: A complete and consistent set of checkpoint files is captured identified for use in restarting a parallel program. When each process program takes checkpoint, it creates file. The file named, part that name includes version number the to be restarted, identifies its most current valid It provides this coordinating process. then decides which all processes participating restart. Once determined, forwarded restore themselves using corresponding having particular number.