Towards Fast Crash-Consistent Cluster Checkpointing

作者： Andrew Wood , Moshik Hershcovitch , Ilias Ennmouri , Weiyu Zong , Saurav Chennuri

DOI:

关键词:

摘要: Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while …

ieee.org 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(0)

Towards Fast Crash-Consistent Cluster Checkpointing

来源期刊

我的账户

Towards Fast Crash-Consistent Cluster Checkpointing

来源期刊

相似文章 0

我的账户