SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale


Student: Daniel Alberto Torres Gonzalez (LIPN, CNRS UMR 7030; Institut Galilée - Université Paris 13)
Supervisor: Camille Coti (LIPN, CNRS UMR 7030; Institut Galilée - Université Paris 13)

Abstract: Large-scale architectures provide us with high computing power, but as the size of the systems grows, computation units are more likely to fail. Fault-tolerant mechanisms have arisen in parallel computing to face the challenge of dealing with errors that may occur at any moment during the execution of parallel programs. Algorithms used by fault-tolerant programs must scale and be resilient to software/hardware failures. Recent parallel algorithms have demonstrated properties that can be exploited to make them fault-tolerant. In my thesis, I design, implement and evaluate parallel and distributed fault-tolerant numerical computation kernels for dense linear algebra. I take advantage of the intrinsic algebraic properties of communication-avoiding algorithms. I am focusing on dense matrix factorization kernels: I have results on LU and preliminary results on QR. Using performance evaluation and formal methods, I am showing they can tolerate crash-type failures, either re-spawning new processes on the fly or ignore the error.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF


Back to Poster Archive Listing