Erasure-Coding-Based Fault Tolerance for Recommendation Model Training

SC21 Proceedings

Erasure-Coding-Based Fault Tolerance for Recommendation Model Training

Workshop:SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing

Authors: Kaige Liu (Facebook, Carnegie Mellon University) and Jack Kosaian and K. V. Vinayak (Carnegie Mellon University)

Abstract: Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content. DLRMs are trained by distributing the model across the memory of tens/hundreds of servers. Server failures are common in such settings, and must be mitigated for training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead. As checkpointing overhead increases with DLRM size, checkpointing is slated to become an even larger overhead as DLRMs grow. This calls for rethinking fault-tolerant DLRM training.

We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, efficiently updates parities, and enables training to proceed without pauses. Compared to checkpointing, ECRM reduces training-time overhead, recovers from failures faster, and allows training to proceed during recovery. These results show the promise of erasure coding in enabling efficient fault tolerance for DLRM training.

Back to SuperCheck-SC21: Second International Symposium on Checkpointing for Supercomputing Archive Listing

Back to Full Workshop Archive Listing