Accelerating Checkpoint/Restart with Lossy Methods
Reliability and Resiliency
TimeSunday, 14 November 202112:05pm - 12:15pm CST
DescriptionApproximate computing targets applications with the ability to tolerate losses of accuracy in the computational results. The essence of approximate computing is to use a data representation, that allows to reduce the data size or speed up computations at the cost of data accuracy.
Data reduction is a desirable goal when leveraging checkpoint-and-restart to ensure resiliency of an HPC application. As the performance of the IO subsystem of supercomputers increases slowly compared to the computing resources like CPU and dynamic memory, IO poses a bottleneck. Reducing the data before writing a checkpoint can help to provide viable checkpointing solutions for extreme scale applications that need frequent checkpointing.
In this work, we implement and evaluate two approximate checkpoint mechanisms: Precision Bound Differential Checkpointing and Checkpointing with Lossy Compression. Both methods reduce the amount of data and restore an approximate representation of the application state upon recovery.