Fault-Tolerance for High Performance and Big Data Applications: Theory and Practice
Reliability and Resiliency
TimeSunday, 14 November 20218am - 5pm CST
DescriptionResilience is a critical issue for large-scale platforms, and this tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized along four main topics: an overview of failure types (software/hardware, transient/fail-stop) and typical probability distributions (Exponential, Weibull, Log-Normal); general-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce) or fixed-point convergence for iterative applications (back-propagation); and practical deployment of fault-tolerance techniques with User Level Fault Mitigation (a proposed MPI standard extension).
The tutorial is open to all SC21 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. Basic knowledge of MPI will be helpful for the hands-on session.