SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training


Workshop:SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing

Authors: Quentin Anthony (X-ScaleSolutions, Ohio State University) and Donglai Dai (X-ScaleSolutions)


Abstract: Deep learning (DL) applications are becoming one of the most important applications for HPC and cloud systems. The massive datasets and deep neural networks (DNN) used by DL applications introduce many HPC challenges. Therefore, HPC checkpoint/restart tools are an attractive choice. However, most data-parallel DL training jobs use a naive scheme called root checkpointing, which is subject to blocking semantics and straggling forward progress. In this work, we apply a multi-level checkpointing tool (SCR-Exa) to distributed DL applications. We examine the performance of two DNN models at scale on Lassen (a leading TOP500 system), while ensuring the DNN's accuracy is maintained after restart from simulated system failures. Our test results show that multi-level checkpointing schemes are able to achieve nearly constant overhead at scale. To the best our knowledge, this study presents the first evaluation to demonstrate strong scalability of a checkpointing scheme for distributed DL without making framework-specific changes.





Back to SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing Archive Listing



Back to Full Workshop Archive Listing