Student: Nigel Tan (University of Tennessee, Knoxville)
Supervisor: Michela Taufer (University of Tennessee, Knoxville)
Abstract: The common checkpoint philosophy, checkpoint everything as frequently as possible, is becoming ineffective as we progress towards exascale machines, facing shrinking time between failures. This makes portability and resilience vital for the future of HPC. This poster demonstrates the need and forms the foundation for enhancing checkpointing to take advantage of application properties. Specifically, we show how access pattern aware checkpointing improves performance using incremental checkpoints of sparsely updated data as an example. We also define how the portable checkpointing abstractions in Kokkos Resilience can be modified to support such an enhancement transparently.
ACM-SRC Semi-Finalist: no
Poster: PDF
Poster Summary: PDF
Back to Poster Archive Listing