Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training
Parallel Programming Languages and Models
Reliability and Resiliency
TimeMonday, 15 November 20212pm - 2:30pm CST
DescriptionDeep learning (DL) applications are becoming one of the most important applications for HPC and cloud systems. The massive datasets and deep neural networks (DNN) used by DL applications introduce many HPC challenges. Therefore, HPC checkpoint/restart tools are an attractive choice. However, most data-parallel DL training jobs use a naive scheme called root checkpointing, which is subject to blocking semantics and straggling forward progress. In this work, we apply a multi-level checkpointing tool (SCR-Exa) to distributed DL applications. We examine the performance of two DNN models at scale on Lassen (a leading TOP500 system), while ensuring the DNN's accuracy is maintained after restart from simulated system failures. Our test results show that multi-level checkpointing schemes are able to achieve nearly constant overhead at scale. To the best of our knowledge, this study presents the first evaluation to demonstrate strong scalability of a checkpointing scheme for distributed DL without making framework-specific changes.