Remote Participation
Workshop: SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing
Event TypeWorkshop
Parallel Programming Languages and Models
Reliability and Resiliency
TimeMonday, 15 November 20219am - 5:30pm CST
DescriptionAs a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of high performance computing (HPC) communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community.
9:00am - 9:01am CSTSecond International Symposium on Checkpointing for Supercomputing
9:01am - 9:05am CSTSuperCheck-SC21: Opening Remarks
9:05am - 10:00am CSTSuperCheck-SC21: Plenary Talk – Let’s Make MPI and Checkpoint-Restart Libraries Work Better Together
10:00am - 10:30am CSTSuperCheck-SC21: Morning Break (10-10:30)
10:30am - 11:00am CSTToward Aggregated Asynchronous Checkpointing
11:00am - 11:30am CSTCheckpoint-Restart Libraries Must Become More Fault Tolerant
11:30am - 12:00pm CSTSuperCheck-SC21: Invited Talk – How Realtime Supercomputing Is Powering New Models of Experimental Science
12:00pm - 12:30pm CSTMANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale
12:30pm - 2:00pm CSTSuperCheck-SC21: Lunch Break (12:30-2)
2:00pm - 2:30pm CSTEvaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training
2:30pm - 3:00pm CSTErasure-Coding-Based Fault Tolerance for Recommendation Model Training
3:00pm - 3:30pm CSTSuperCheck-SC21: Afternoon Break (3-3:30)
3:30pm - 4:30pm CSTPanel Discussion: Can Checkpoint/Restart Tools Ever Keep Pace with Fast-Changing HPC Architectures, Technologies, and Workloads?
4:30pm - 4:40pm CSTSurvey on Checkpoint/Restart
4:40pm - 4:45pm CSTSuperCheck-SC21: Closing Remarks
4:45pm - 5:30pm CSTC/R Collaboration Updates
