No Travel? No Problem.

Remote Participation
Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Tags
In-Person Only
Registration Categories
TP
XO / EX
TimeTuesday, 16 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionLarge-scale architectures provide us with high computing power, but as the size of the systems grows, computation units are more likely to fail. Fault-tolerant mechanisms have arisen in parallel computing to face the challenge of dealing with errors that may occur at any moment during the execution of parallel programs. Algorithms used by fault-tolerant programs must scale and be resilient to software/hardware failures. Recent parallel algorithms have demonstrated properties that can be exploited to make them fault-tolerant. In my thesis, I design, implement and evaluate parallel and distributed fault-tolerant numerical computation kernels for dense linear algebra. I take advantage of the intrinsic algebraic properties of communication-avoiding algorithms. I am focusing on dense matrix factorization kernels: I have results on LU and preliminary results on QR. Using performance evaluation and formal methods, I am showing they can tolerate crash-type failures, either re-spawning new processes on the fly or ignore the error.
Back To Top Button