
Analyzing Performance of File Aggregation in VELOC
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
In-Person Only
TP
XO / EX
TimeThursday, 18 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionAs High-Performance Computing (HPC) systems and applications continue to grow in size and complexity, the process of checkpointing to stable, external storage often results in I/O contention and degraded performance. Multi-level asynchronous checkpointing strategies like VELOC (Very Low Overhead Checkpoint Strategy) have begun gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable storage in the background. Currently, VELOC adopts a one-file-per-process flush strategy; in large scientific applications, hundreds of thousands of files quickly overwhelm the network bandwidth, filesystem, and application scientists. Thus, applications need to condense the number of checkpoint files while preserving the I/O efficiency of asynchronous multi-level workflows. This work implements different aggregation strategies into VELOC's asynchronous checkpoint runtime and analyzes the impact on performance.
Archive view
