SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Analyzing Performance of File Aggregation in VELOC


Student: Mikaila Gossman (Clemson University)
Supervisor: Jon C. Calhoun (Clemson University)

Abstract: As High-Performance Computing (HPC) systems and applications continue to grow in size and complexity, the process of checkpointing to stable, external storage often results in I/O contention and degraded performance. Multi-level asynchronous checkpointing strategies like VELOC (Very Low Overhead Checkpoint Strategy) have begun gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable storage in the background. Currently, VELOC adopts a one-file-per-process flush strategy; in large scientific applications, hundreds of thousands of files quickly overwhelm the network bandwidth, filesystem, and application scientists. Thus, applications need to condense the number of checkpoint files while preserving the I/O efficiency of asynchronous multi-level workflows. This work implements different aggregation strategies into VELOC's asynchronous checkpoint runtime and analyzes the impact on performance.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF


Back to Poster Archive Listing