Accelerating Checkpoint/Restart with Lossy Methods

SC21 Proceedings

Accelerating Checkpoint/Restart with Lossy Methods

Workshop:FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale

Authors: Kevser Ildes (Marmara University, Turkey); Athanasios Kastoras (University of Thessaly, Greece); and Kai Keller and Leonardo Bautista Gomez (Barcelona Supercomputing Center (BSC))

Abstract: Approximate computing targets applications with the ability to tolerate losses of accuracy in the computational results. The essence of approximate computing is to use a data representation, that allows to reduce the data size or speed up computations at the cost of data accuracy.

Data reduction is a desirable goal when leveraging checkpoint-and-restart to ensure resiliency of an HPC application. As the performance of the IO subsystem of supercomputers increases slowly compared to the computing resources like CPU and dynamic memory, IO poses a bottleneck. Reducing the data before writing a checkpoint can help to provide viable checkpointing solutions for extreme scale applications that need frequent checkpointing.

In this work, we implement and evaluate two approximate checkpoint mechanisms: Precision Bound Differential Checkpointing and Checkpointing with Lossy Compression. Both methods reduce the amount of data and restore an approximate representation of the application state upon recovery.

Back to FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale Archive Listing

Back to Full Workshop Archive Listing