Panel Discussion: Can Checkpoint/Restart Tools Ever Keep Pace with Fast-Changing HPC Architectures, Technologies, and Workloads?
Parallel Programming Languages and Models
Reliability and Resiliency
TimeMonday, 15 November 20213:30pm - 4:30pm CST
DescriptionCheckpoint/restart (C/R) tools of the past and present are constrained by the software and hardware architectures of the systems they are developed to run upon. They then chase the underlying hardware architecture as it evolves. In practical terms, this means C/R tools may not be ready to use until midway through the life span of a cutting-edge supercomputing system. This has been severely limiting HPC communities from reaping the benefits of C/R. Given the fact that software and hardware are fast evolving and becoming more complicated and heterogeneous, the development cycles for C/R tools will be getting longer to support new hardware and the workloads on it. Can checkpointing tools ever catch up with fast-changing HPC architectures, technologies, and workloads?
This panel brings together experts on both application-level and transparent checkpointing, operating systems, I/O and storage systems, computer architectures and future technologies to debate and discuss the most productive approaches for developing ready-to-use checkpointing tools for future HPC systems.