Presentation

· Contributors · Organizations · Search Program

Accelerating Checkpoint/Restart with Lossy Methods

SessionFTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale

Author/Presenters

Kevser Ildes

Athanasios Kastoras

Kai Keller

Leonardo Bautista Gomez

Event Type

Workshop

Tags

Registration Categories

TimeSunday, 14 November 202112:05pm - 12:15pm CST

LocationOnline

DescriptionApproximate computing targets applications with the ability to tolerate losses of accuracy in the computational results. The essence of approximate computing is to use a data representation, that allows to reduce the data size or speed up computations at the cost of data accuracy.

Data reduction is a desirable goal when leveraging checkpoint-and-restart to ensure resiliency of an HPC application. As the performance of the IO subsystem of supercomputers increases slowly compared to the computing resources like CPU and dynamic memory, IO poses a bottleneck. Reducing the data before writing a checkpoint can help to provide viable checkpointing solutions for extreme scale applications that need frequent checkpointing.

In this work, we implement and evaluate two approximate checkpoint mechanisms: Precision Bound Differential Checkpointing and Checkpointing with Lossy Compression. Both methods reduce the amount of data and restore an approximate representation of the application state upon recovery.

Author/Presenters

Kevser Ildes

Marmara University, Turkey

Athanasios Kastoras

University of Thessaly, Greece

Kai Keller

Barcelona Supercomputing Center (BSC)

Leonardo Bautista Gomez

Barcelona Supercomputing Center (BSC)

No Travel? No Problem.