SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

SuperCheck-SC21 Plenary Talk: Let’s Make MPI and Checkpoint-Restart Libraries Work Better Together


Workshop:SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing

Authors: Anthony Skjellum (University of Tennessee, Chattanooga)


Abstract: In this SuperCheck plenary, the audience will undoubtedly hear about newer and better ways to checkpoint and restart scalable, typically MPI+X (where X=GPU or OpenMP or other accelerator), applications. This plenary reviews all the pieces that make up an MPI+X application. It looks at silos, policies, and opportunities for communities to work better together. As a designer of message passing libraries for over 30 years and MPI implementations since its inception, this speaker seeks to bring his perspective on "the other components" such as resource managers and Checkpoint-Restart (CPR) libraries. Since this is a SuperCheck workshop, focus on MPI+X with CPR will be in the forefront, but interactions with other important components ---including what we could do better --- is mentioned. Opportunities for standardization of more interfaces are described.

The following themes are considered:

* The goal of using MPI+X applications in places where resources are more ephemeral or subject to major cost changes, motivating malleability;

* An MPI-designer's viewpoint of how checkpoint restart systems (both explicit and transparent) fit within a more open policy, integrated world;

* What MPI-5 and beyond could or should do to help the entire program stack work better including the CPR library; and

* The lack of open, potentially standardized mechanisms for multiple components to work smoothly to manage malleable resources, faults, migration, etc.


Website:






Back to SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing Archive Listing



Back to Full Workshop Archive Listing