SuperCheck-SC21: Plenary Talk – Let’s Make MPI and Checkpoint-Restart Libraries Work Better Together
TimeMonday, 15 November 20219:05am - 10am CST
DescriptionIn this SuperCheck plenary, the audience will undoubtedly hear about newer and better ways to checkpoint and restart scalable, typically MPI+X (where X=GPU or OpenMP or other accelerator), applications. This plenary reviews all the pieces that make up an MPI+X application. It looks at silos, policies, and opportunities for communities to work better together. As a designer of message passing libraries for over 30 years and MPI implementations since its inception, this speaker seeks to bring his perspective on "the other components" such as resource managers and Checkpoint-Restart (CPR) libraries. Since this is a SuperCheck workshop, focus on MPI+X with CPR will be in the forefront, but interactions with other important components ---including what we could do better --- is mentioned. Opportunities for standardization of more interfaces are described.

The following themes are considered:

* The goal of using MPI+X applications in places where resources are more ephemeral or subject to major cost changes, motivating malleability;

* An MPI-designer's viewpoint of how checkpoint restart systems (both explicit and transparent) fit within a more open policy, integrated world;

* What MPI-5 and beyond could or should do to help the entire program stack work better including the CPR library; and

* The lack of open, potentially standardized mechanisms for multiple components to work smoothly to manage malleable resources, faults, migration, etc.
