Workshop:SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing
Authors: Rebecca Hartman-Baker (Lawrence Berkeley National Laboratory (LBNL)), Gene Cooperman (Northeastern University), Bogdan Nicolae (Argonne National Laboratory (ANL)), Sarp Oral (Oak Ridge National Laboratory (ORNL)), and Eric Roman and John Shalf (Lawrence Berkeley National Laboratory (LBNL))
Abstract: Checkpoint/restart (C/R) tools of the past and present are constrained by the software and hardware architectures of the systems they are developed to run upon. They then chase the underlying hardware architecture as it evolves. In practical terms, this means C/R tools may not be ready to use until midway through the life span of a cutting-edge supercomputing system. This has been severely limiting HPC communities from reaping the benefits of C/R. Given the fact that software and hardware are fast evolving and becoming more complicated and heterogeneous, the development cycles for C/R tools will be getting longer to support new hardware and the workloads on it. Can checkpointing tools ever catch up with fast-changing HPC architectures, technologies, and workloads?
This panel brings together experts on both application-level and transparent checkpointing, operating systems, I/O and storage systems, computer architectures and future technologies to debate and discuss the most productive approaches for developing ready-to-use checkpointing tools for future HPC systems.