SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Towards Aggregated Asynchronous Checkpointing


Workshop:SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing

Authors: Mikaila Gossman (Clemson University)


Abstract: High-Performance Computing (HPC) applications need to check- point massive data sizes at scale with increasing frequency. Multi- level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among appli- cation scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background. Currently, VELOC adopts a one-file- per-process flush strategy, which results in a large number of files being written to external storage, thereby overwhelming metadata servers and making it difficult to transfer and access checkpoints as a whole. This paper discusses the challenges and opportunities of designing aggregation techniques for asynchronous multi-level checkpointing. To this end we implement and studied two aggrega- tion strategy, study their limitations and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing





Back to SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing Archive Listing



Back to Full Workshop Archive Listing