作者: Jiajun Cao , Matthieu Simonin , Gene Cooperman , Christine Morin
关键词:
摘要: A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most currently rely on each application provide its own fault tolerance. uniform mechanism within the itself serves two purposes: (a) direct support long-running jobs, which would otherwise require a custom fault-tolerant application, and (b) administrative capability manage an over-subscribed by temporarily swapping out jobs when higher priority arrive. An advantage of this that it also supports parallel distributed computations, over both TCP InfiniBand, thus allowing traditional HPC applications take infrastructure. Additionally, integrated health-monitoring detects either fail or incur exceptionally low performance, perhaps due resource starvation, proactively suspends job. The feature applying implementation very different platforms: Snooze Open Stack. use architecture enables, first time, migration from one platform another.