Erasure-Coding-Based Fault Tolerance for Recommendation Model Training
Parallel Programming Languages and Models
Reliability and Resiliency
TimeMonday, 15 November 20212:30pm - 3pm CST
DescriptionDeep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content. DLRMs are trained by distributing the model across the memory of tens/hundreds of servers. Server failures are common in such settings, and must be mitigated for training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead. As checkpointing overhead increases with DLRM size, checkpointing is slated to become an even larger overhead as DLRMs grow. This calls for rethinking fault-tolerant DLRM training.
We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, efficiently updates parities, and enables training to proceed without pauses. Compared to checkpointing, ECRM reduces training-time overhead, recovers from failures faster, and allows training to proceed during recovery. These results show the promise of erasure coding in enabling efficient fault tolerance for DLRM training.