作者: Andrew Wood , Moshik Hershcovitch , Ilias Ennmouri , Weiyu Zong , Saurav Chennuri
DOI:
关键词:
摘要: Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while …