No Travel? No Problem.

Remote Participation
Relaxed Replication for Energy Efficient and Resilient GPU Computing
Event Type
Online Only
Extreme Scale Comptuing
Reliability and Resiliency
Registration Categories
TimeSunday, 14 November 20214:05pm - 4:30pm CST
DescriptionPower and reliability are two intertwined challenges in GPU-accelerated large-scale computing. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. Managing power and resilience is challenging, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Previous works have shown that redundancy-based approaches are more energy efficient than checkpointing/restart at extreme-scales, but current solutions only support parallel programs running on CPU-based homogeneous systems. Simply extending redundancy approaches from CPU-based systems results in sub-optimal performance and/or energy efficiency because existing redundancy solutions typically rely on identical replicas with expensive synchronization. In this work, we explore redundancy techniques and energy efficient techniques for GPU-accelerated systems running MPI parallel workloads. Specifically, we design a novel redundancy technique that relaxes the requirement of synchronization and identicalness for replica processes and allows them to run in lower-precision and at lower power/performance states with periodical rejuvenation or asynchronization, enabling resources and power reduction.

This relaxed replication mechanism complicates fault detection and recovery over the homogeneous exact replication. We discuss techniques to handle and mitigate these complexities for both process/node failures and silent data corruption. Evaluation results on a 16-GPU cluster show our techniques reduce energy by up to 15% for unmodified programs and 32% for programs that are able to adapt the precision of the replicas.
Back To Top Button