Presentation

· Contributors · Organizations · Search Program

Pinpointing Crash-Consistency Bugs in the HPC I/O Stack: A Cross-Layer Approach

SessionStorage and Application Characteristics

Authors

Jinghan Sun

Jian Huang

Marc Snir

Event Type

Paper

Tags

Reproducibility Badges

Registration Categories

TimeThursday, 18 November 20214pm - 4:30pm CST

Location240-241-242

DescriptionWe present ParaCrash, a testing framework for studying crash recovery in a typical HPC I/O stack, and demonstrate its use by identifying 15 new crash-consistency bugs in various parallel file systems (PFS) and I/O libraries. ParaCrash uses a "golden version'' approach to test the entire HPC I/O stack: storage state after recovery from a crash is correct if it matches the state that can be achieved by a partial execution with no crashes. It supports systematic testing of a multilayered I/O stack while properly identifying the layer responsible for the bugs.

Download PDF

Paper available from the ACM OpenTOC

Archive view

Authors

Jinghan Sun

University of Illinois

Jian Huang

University of Illinois

Marc Snir

University of Illinois

No Travel? No Problem.